It is a quite common experience that not all file types can be
compressed with equal efficiency: some data structures, like text
files, or text based containers (spreadsheets, documents, databases)
usually compresses well, reaching high compression ratio, while other
formats, usually multimedia files,
does not compress properly and reaches very poor compression ratio not
matter how powerful (and slow)
compression method is used.
File types which usually
compresses well
Text, uncompressed documents, spreadsheets, and databases (doc,
xls...), uncompressed images (bmp, tiff), audio and video - especially
if computer generated or containing low noise from the analogic source.
File types which
usually don't compress well
Compressed file archives (7z, rar, zip...), compressed documents,
spreadsheets, and databases (docx, xlsx...), pdf files (which usually
are Deflate-compressed), lossy or lossless compressed images (jpeg,
png), audio (mp3), and video (mp4, divx, mkv) - generally any
multimedia file from a noisy analogic source.
File types which
usually does notcompress at all
Encrypted files and disk images.
Most common reasons
why a file can't be efficiently compressed:
|
The file contains lots of non significant
or duplicate data, or noise
Some data in the file may be not relevant and should be edited off
(with the own editing tools appropriate for the input file type) before
archiving the data, in order to reduce disk occupation without losing
actual information.
Some miscellaneous
examples of non-meaningful data in files
- Disk images /
virtual machine disks
(QCOW2, VMDK, Microsoft VHD, VDI) often contain data which should be
cleared, as recycle bin files, temporary files, RAM content (if machine
was stopped rather than being shut down), and data of erased files
still written on disk sectors if free space is not overwritten by zero
- not shown in disk occupation when running the image, as marked as
free space after deletion.
- Databases
can contain deleted entries until
purged with
appropriate actions for specific database format; also, running proper
data duplication routines is a good
practice before archiving / backing up the database.
- Backup
should be checked to reduce occurrencies
of duplicate or
temporary files, incremental backup strategy (save only files modified
between the two backup checkpoints) can be employed to dramatically
reduce the size of backup sets.
- Document
files (MS Office DOCX, Adobe Acrobat
PDF...) can contain data of obsolete
revisions, or
multimedia data not matching the resolution of the document which
should be authored to a more appropriate definition / size either on
source multimedia files, or during document editing.
- Even
if uncompressed, graphic
and multimedia can contain high level
of enthropy entropy
both in graphic/video and audio data (e.g. BMP, RAW, WAV and AVI
files), that should be accurately taken in account, in example a poor
quality
photo sensor or a not adequate microphone may lead to low signal /
noise ratio, which will lead boot to a poor result and poorly
compressible data.
How to reduce
size of data before compression
It is recommended to use specific editing tools for the format or the
scope, before starting actual compression.
PeaZip free archiver can help in some of those tasks providing tools to
overwrite free space
with zero, find and remove
duplicate files,
and to resize, convert
and re-compress graphic files.
|
|
The file is
already compressed
Some file formats are designed with built in compression, so no further
compression is possible (or practical).
File size is already efficiently reduced by design often employing
dedicated compression types (lossy
compression
is common in compressed multimedia files), more efficient than general
purpose compression algorithms designed to work on a plurality of data
structures.
Examples of
already
compressed file formats
Graphic files as JPEG and PNG, will usually compress poorly compared to
uncompressed bitmap graphic as RAW, BMP and uncompressed TIFF files
(read more on how to
optimize compression of images and media files).
Audio files
as MP3, will usually compress very poorly compared to
uncopmpressed audio formats like WAV files.
Multimedia
video files
like AVI,
DIVX, MPEG, MKV... are very poorly compressible wiith genral
purpose lossless compression algorithms used in file archiving / backup
software, and needs to be re-ecoded (possibly from the highest possible
quality source) with a more efficient lossy audio/video algorith
- e.g. encoding uncompressed AVI with H.264, H.265, or H.256 to MPEG-4
/ MP4, MKV or WebM format, rather than obsolete MPEG-1 or MPEG-2
standards.
Some
document types as Adobe Acrorbat PDF, Open Office formats, and new
Microsoft
Office file formats (DOCX, XLSX, PPTX...), and some databases, contains
already compressed data (usually employing lossless
compression designed around Deflate algorithm), and generally does not
compress
well compared to older uncompressed documents formats, like DOC, XLS,
PPT...
Please
note a document or database is itself a container, so if it
stores compressed bitmap graphic and other multimedia data,
as a role of thumb it will compress poorly if compared to file
of same format containing plain text data, this is the reason why even
some uncompressed documents (DOC, XLS,
PPT) will compress less than other documents of the same type.
Possible
solutions
to compress already compressed files
- Extract
existing archives (ZIP, ACE, ...) and
recompress the content in a more powerful compression format, such as
RAR or 7Z: in this case the final archives will be smaller than the
original ones (depending on the nature of the input data), while trying
to compress the archives will not provide comparable results. PeaZip
can automate extraction and re-compression of existing archives using
stronger compression, using its file
conversion tool.
- In some cases the built-in file format
compression is a trade off between the need to reduce the final size of
the file and the need to quickly access the content of the file,
avoiding excessively powerful and complicate compression algorithms. In
those cases
(i.e. JPEG, PDF, DOCX, XLSX files...) applying a powerful compression
(i.e.compress to RAR or 7Z) can reduce the size of those files.
- In worst cases it is simply not possible to
further improve current compression level of the files, and other
strategies should be considered, i.e. use utilities like PeaZip to
deduplicate files (search for
identical files to avoid archiving / backing up multiple copies of the
same data), or split output archive in
multiple volumes of desired size if it is needed to keep the output
under mandatory maximum sizes (i.e. maximum
e-mail
attachment size, max
upload
files size, filesystem limits or built-in file size limitations,
etc). Also in those cases it may be
useful to opt for a lower compression settings (or no-compression
archiving, option generally labelled as "store" compression level),
which will save time and computing power with no practical inpact on
final file size.
|
|
The file is not
compressible at all
Encrypted
files are not
compressible, containing pseudo random data - except, usually, for a
few
bytes in the file header. Plainly, there is
not a shorter, more efficient way to represent the content of the file
no matter how hard the compressor tries.
Examples of non
compressible files types
Encryption can be employed to protect many types of databases, backup
files, some documents formats as PDF, and it is a common option to
protect archives. Archive formats supporting strong encryption
standards are 7Z, PEA,
RAR,
ZIP, and ZIPX.
Possible
solutions
to reduce space occupation of non compressible files
It is not possible to compress a single encrypted file, and switching
to more powerful compression algorithms / settings is of no use, but -
as well as for previous cases - it
is possible to de-duplicate content
of archive / backup removing
unnecessary duplicate copies of the same data, reducing final size by
consequence.
If even this strategy is not viable, it is possible to span (split) the output in multiple
volumes of desired size, smaller than
limitations in trasmission (email attachment, sharing, max upload size)
or
in storing the data (filesystem of archive format size limitations).
An alternative approach is to decrypt the archive / backup image /
database before re-compressing it with a more powerful algorithm or
higher compression settings, and then re-applying encryption if
security is a concern.
|
Synopsis: Why some kind
of files are not compressible? Why compressing some file formats (jpeg,
pdf, docx) results in poor compression? Why multimedia files (avi, mp3,
mpeg mkv) cannot be efficiently compressed in archives and backup? Why
zip files cannot be further compressed? Why encryptted data cannot be
compressed?
Topics: why some files
cannot be compressed, compressing media files, cannot compress already
compressed files, uncompressible files, random or encrypted
PeaZip > FAQ >
Why I cannot compress some types of files such as avi, mp3, pdf
|