Purposes of data
compression and archiving
File archiving is a way
to
consolidate multiple input files in a
single
output archive, often using integrated data
compression strategies for removing data redundancies, so the
output is both
smaller (to save disk space occupation and upload/download bandwidth)
and easier
to handle than separate input files - learn more: what is a compressed / archive file
Optimize compression
method due end user's goals
A common concern about compressing data - either for backup or file
upload or distribution - is balancing
worthy compression ratio with
reasonably fast operational speed, so i.e. end users will be
able
to
unpack
data in a
timely fashion, or a backup process will end in a fixed maximum amount
of time.
As scenarios of different goals and constrains will vary, file
compression efficiency factors must be carefully weighted minding
intended
use of the data in first place, in following chapter will be provided
some suggestions for carefully chosing strategies and best parameters
for optimal compression results.
Data compression best
practices
Quite obviously, best data compression practices mean nothing if the
file cannot be provided to the intended end user. If the archive needs
to be shared, the first concern is what archive
file types is capable to read the end user - what archive formats are
supported or can be supported through end user computing platform
(Microsoft Windows, Google Android/ChromeOS, iOS, Apple OSX, Linux,
BDS...) - if the user is willing
and authorized to install needed software.
So most of times the better choice in this case is staying with most
common
format (ZIP), while RAR is quite popular on MS Windows
platforms and TAR is
ubiquitously supported on Unix derivate systems, and 7Z is becoming increasingly popular
on all systems.
Some
file
sharing platforms, cloud services, and e-mail provides may
block some file types with the explanation they are commonly abused
(spam, viruses, illicit content), preventing it to reach the intended
end user(s), so it is critical to read terms of services to avoid this
issue.
Usually changing file extension is not a solution, as each archive file
has a well defined internal structure (that is meant for the file to
properly function, so can hardly be cloaked) so file format recognition
is seldom based on simple parsing the file extension.
In some other cases are blocked all encrypted files or all files of
unknown/unsupported formats that service provider are not able to
inspect / scan for viruses.
Keep archive size under
a mandatory max size
|
To meet maximum size
constrains
(i.e. e-mail
attachment limit or physical
support size) you can divide the output in
volumes of desired size (file spanning),
progressively numbered i.e .001, .002, .nnn so the receiver can extract
the whole archive, usually, saving all files in the same path and
starting extraction from .001 file.
File split can be recommended as the simplest and most efficient way to
securely fit in
a mandatory output
size, rather than trying to improve compression ratio with
slower / heavier algorithms / settings in the hope to fit the desired
target size - which may not be possible despising the speed penality. |
|
Following block discuss factors that influences more the efficiency of
compression, and
which needs more weight and attention in evaluation for choiche of best
compression strategy, and options / tips & tricks to obtain
best results.
More suggestions can be found on: compression algorithms
comparison, entropy and maximum
e-mail
attachment size articles
on Wikipedia.
|
Evaluate
need
for
using high
compression formats and settings
Highest
compression
ratio is
usually attained with slower and more computing intensive algorithms,
i.e. RAR compression is slower and
more powerful than ZIP compression,
and 7Z compression is
slower and more powerful compressor than RAR, with PAQ / ZPAQ
outperforming other algorithms in terms of maximum compression ratio
but requiring more computing power.
See file
compression
formats comparison and compression
benchmarks for comparison of strongest compression algorithms, and
impact on speed and compression ratio adperformances of different file
archiving formats.
Different data types may lead to different results
with different data compression algorithms, in example weaker RAR and ZIPX compression
can close the gap with stronger 7Z compression when
multimedia files compression is involved, due to efficiently optimized
filters for
multimedia files employed in RAR and ZIPX when suitable data structures
are detected - anyway lossy compressed multimedia files remains
poorly compressible data structures.
Switching to a more powerful algorithm is usually more efficient in
terms of improving compression ratio than using highest compression
ratios of a weaker compression algorithm.
It is
suggested to
evaluate carefully if better compression is really needed
(after
deduplication, and evaluation of poorly compressible files), or if the
archive is mainly made for other reasons than decreasing file size i.e.
applying encryption, handling the content as a single file, etc.
If time is a critical factor, speed should be the primary factor to
take in account, and fastest available algoritms should be preferred,
as zlib's Deflate (GZip, ZIP, Zopfli), Brotli, or Zstandard.
|
|
Identify poorly
compressible files
Evaluate if
spending time to compress poorly compressible data or,
rather, simply store it "as is". Some
data structures contain
high levels of entropy, or
entropy is introduced by previous processes as encryption or
compression -
making further compression efforts difficult or even useless; computing
power wold be more productively spent reducing size occupation of other
types of files, leading to both better results and faster operation.
Multimedia
files (MP3, JPG, MPEG, AVI,
DIVX...) tend to poorly compressible, as those formats features lossy compression,
and, especially videos, are usually very large compared to other file
types (documents, applications), so it should be evaluated carefully if
they should be compressed at all - it is recommended using "Store"
option for
compression level, provided by most file archivers, meaning compression
is disabled (fastest, as speed is only bound from disk copy
performances) - or even copied "as is" without even passing them to the
compressor application.
For best practices to reduce disk usage of graphic files (JPEG, PNG,
TIFF, BMP) see how to
optimize compression of images for tips and tricks.
Some document formats
(PDF, Open Office and new Microsoft Office 2007
and beyond
file formats), and some databases, are already compressed (usually fast
deflate based lossless compression), so they generally does not
compress
well.
Archive files
(7Z, RAR, ZIP...) are already compressed and cannot be directly
compressed (gains will be small, if any), but archives can be converted (extracted to
the original non-compressed form, and the re-compressed) to a format
providing a better compression ratio.
Encrypted
data is not
compressible at all, being pseudo random there is
not a "shorter way" to represent the information carried in encrypted
form; attempting compression of encrypted files is not recommended.
Separating poorly compressible data from other data is a good way to
start a compression policy definition to decide the best strategy for
handling both the
types of data.
|
|
Evaluate solid
compression
advantages
Solid
compression,
available as option for some archival formats like 7Z and RAR, can
improve final
compression ratio, it works providing a wider context for compression
algorithm to reduce data redundancy and represent it in a more
convenient way to spare output file size.
But the context information is needed also during extraction, so
extraction from a solid archive needs more time to parse all the
relevant context data (usually defined "solid block") and can be
significantly slower than from a non solid archive.
7Z allows to chose the block size to be used for
solid mode operation (the "window" data context is used by the
algorithm) to minimize overhead, but this option also slightly reduces
compression ratio improvements.
Applying XZ, Brotli
compression, Bzip2 compression, GZip compression or ZSTD compression to a
tar archive is a
two-step
equivalent of solid mode compression.
Chose carefully if the intended use of the compressed data needs high
compression/solid compression to be used, the more often the data will
be needed to be extracted the more times the computational overhead
will apply for each end user.
In example, software distribution would greatly benefit of maximum
compression, as
saving bandwidth is critical and end user usually extracts the data
only once, while the overhead may not be acceptable if the data needs
to be accessed often and fastest extraction time becomes a decisive
efficiency advantage |
|
You usually
don't need
to
archive duplicate files
A very obvious
suggestion is to removing
duplicate identical files (deduplication)
in order to avoid archiving redundant data whenever it is adviceable.
Identify
and remove
duplicate files before archival decreases the input size improving both
operation time and final size result, and at the same time make
easier
for the end user to navigate/search in a tidier archive. Don't remove
duplicate files if they are mandatorily needed in the path they are
originally featured, i.e. by a software or an automated procedure. |
|
Zeroing free space on
virtual machines and disk images to remove non-meaningful information
Zero delete
function (File tools submenu) is intended for overwriting file data or
free partition space with
all-0 stream, in order to fill corresponding
physical disk area of homegeneus, highly compressible data.
This allows to save space when compressing disk images, either
low-level physical disk snapshot done for backup porpose, and Virtual
Machines guest virtual disks, as the 1:1 exact copy of the disk content
is not burdened of leftover data on free space area - some disk imaging
utilities and Virtual Machines players/managers have built-in
compression routines, zeroing free space before is strongly recommended
to improve compression ratio.
Zeroing deletion also offers
a basic grade of security improvement over PeaZip's "Quick delete"
function, which
simply remove the file from filesystem, making it not recoverable by
system's recycle bin but susceptible of being recovered with undelete
file utilities. Zero deletion however is not meant for advanced
security, and PeaZip's Secure delete
should be used instead
when it is needed to securely and permanently erase a file or sanitize
free space on a volume for privacy reasons.
Learn more about optimizing
virtual machines and disk images compression. |
|
Impact of using self
extracting archives
Self
extracting
archives
are useful to provide the end user of the appropriate extraction
routines without the need of installing any software, but being the
extraction module embedded in the archive it represent an overhead of
some 10s or 100s of KB, which makes it a noticeable disadvantage only
in
the case of very small (e.g. approximately less than 1MB) archives -
which is
however well in the size range of a typical archive of a
few textual documents. Moreover, being the self extracting archive an
executable file, some file sharing platforms, cloud providers, and
e-mail servers, may block the file, preventing it to reach the intended
receiver(s). |
Synopsis: How to
optimize file compression ratio and speed. Best settings and options to
improve file archiving and compression efficiency. Suggestions and best
practices for maximum data compression performances.
Topics: how to optimize
compression of files, what are the best compression options
PeaZip > FAQ
> How to optimize file compression, best practices
|