What is file
archiving
File
archiving
means
to combine multiple files together for easier management of the data
(i.e. backup, sharing by email attachment, FTP,
torrent, cloud, or any
kind of network service, etc) as for the host filesystem all the data
will be treated as a single file rather than as multiple ones,
eliminating the overhead of handling multiple objects - for each single
file, locating the physical data on disk, locating possible fragments,
checking file level security permissions, and so on.
|
The idea of archiving
files pre-dates .zip format by many years, as in
AR Unix
format (later superseded by TAR, released in 1979 and
standardized in 1988), and LBR format in
CP/M / DOS world in early '80s. |
|
What is file
compression
File
compression
means to reduce size of data on disk encoding it to a smaller output,
employing various strategies to efficiently map (most cases of) a
larger input to a smaller output, i.e. using statistical analisys to
reduce redundancy in inputa data.
Data compression,
too, predates development of ZIP standard, as once the input files were
merged into a single output archive, the operation was often
concatenated to lossless data compression to reduce the size of the
archive using various utilities available at the time as SQ (DOS,
CP/M), CRUNCH (CP/M), and compress (Unix).
TAR format, for example,
is still an uncompressed archive standard, and
uses external compressors, nowadays usually GZ
(fast deflate based
compression, same as in ZIP format), BZ2
(more powerful compression),
XZ (modern, very powerful LZMA based compression - the default
compression algorithm used in 7Z format), BR Google's Brotli (modern,
very fast compressor), and ZST Facebook's Zstandard (another modern,
very fast compressor).
Learn more about similarities and differences in Lossy and lossless data
compression paragraph. For general purpose compressed archive
file,
however, compression means Lossless Compression, a 1:1 mapping of input
to a smaller output.
SEA's ARC format
(1985) combined the archival and (lossless)
compression in a single pass, providing probably the first example of
general purpose of archive manager, which allowed both to spare storage
for backup, and save upload
and download bandwidth (and time) for sharing - at the time, mainly BBS.
A few years later, after a controversy with SEA about
alleged derived
work in PKARC, Phil Katz superseded previous works releasing PKZIP,
which knew great success due multiple factors, as superior speed and
efficiency, and being the specs released under public domain, and
having relatively few competitors in years of fast PC market expansion.
How lossy and lossless
compression works
Data compression can be defined lossy
or lossless,
in terms of
reversibility of the compression process due loss (or preservation) of
original information in the process. The two types of algorithms have
different pros and cons, and different field of application.
Lossless compression
definition, file
archiving
Lossless
compression
uses
statistical models to map the input to a smaller
output eliminating redundancy in the data.
In this way the output carry
exactly all the information featured by the input in less bytes, and
can be expanded when needed to a 1:1 copy of the original data
(restoring exactly the original content), which
is a fundamental property for storing some types of data - i.e. a
software, a database.
For this reason
lossless compression algorithms are used for data backup and for
archive
file formats
used in general
purpose archive manager utilities, like 7Z, RAR,
and ZIP, where an
exact and reversible image
of the original data must be saved.
Examples of lossless compression algorithms are Deflate (used i.e. for
ZIP and GZ formats), BZip2 (used in BZ2 format), PPMd (RAR, 7Z
formats), LZMA / LZMA2 (7Z / XZ format).
|
Lossless compression is fully invertible,
as 1:1 copy of original content input is stored in the smaller,
efficiently encoded output, so it is usually suitable for backup, file
archiving and other applications where any loss of information is not
tolerable.
|
|
Some graphic file fomats (notably, PNG files and deflated TIFF) uses
lossless compression, which usually results in less compression but no
image quality degradation after multiple cycles of modification and
saving of the picture, making this kind of image format suitable as
intermediate save files for image editing tools.
Lossy compression
definition, multimedia data
compression
Lossy
compression, instead,
works
identifying unnecessary
or less relevant information (not just redundant data) and removing it.
Unlike the lossless compression, the amount of information to compress
is effectively reduced.
The loss of information / content is irreversible, and depending from
the nature of the algorithm, will likely happen each time the content
is modified and saved to a lossy file format - e.g. when editing a
lossy jpeg images, and saving it multiple times to intermediate work
files.
In this way data
compression ratio is improved but at the cost of making lossy
compression a
non reversible process - as it comes at the cost of losing part of the
information - and making it a suitable choice only when it is not
intended, by design, to restore the original content again.
Lossy compression
is consequently not suitable for general purpose file
archiving
(as in example losing a single byte of an executable file would make it
not working), but it works very well when loss, reducing less
relevant
information, is acceptable, as for graphic and multimedia
files
compression
- in example for MP3
losing audio information below the audibility threshold, or losing not
visible details in JPEG
images, or both in compressed video formats such as MPEG (AVI, MKV, MPG, MP4...).
|
Information
loss is destructive for the ability of 1:1 reversal of
the algorithm (the information is permanently lost), but it is not
prejudicial for the ability of end users to receive meaningful
information - intelligible audio, clear picture or video. |
|
Most common lossy compression algorithms are consequently usually fine
tuned for the specific pattern of a multimedia data type.
For this very same reson, file types compressed with lossy algorithms
will not compress well (or at all) if added to archive files compressed
with general purpose compression algorithms: already compressed
files compresses poorly, if at all.
Due the lossy nature of those compression schemes, however, usually
professional editing work is performed on non compressed data (i.e. WAV
audio, or TIFF images) or data compressed in a lossless way (i.e. FLAC
audio, or PNG images) every time it is feasible so saving the work in
progress multiple times does not result in losing bits of the
information each time, with progressive degradation of quality -
usually
reserving use of lossy compression to final step for creating a
reasonably sized output to distribute for media consumption.
Lossy vs
lossless compression
Lossy and lossless compression algorithms are so different in scopes
that cannot be really put in direct competition.
When original content needs to be restored completely on decompression
(binary files, rew data) lossless, fully reversible compression is the
only option, while when some degree of data loss is acceptable (e.g.
finalizing work on multimedia files such as mp3 audio, mpeg video, jpeg
graphics) generally advantages of lossy compression in terms of speed
and maximum compression ratio over lossless compression are so evident
that lossy, non reversible compression is the only viable choice to
meet size and/or performances constrains.
Read lossless
compression and lossy
compression
definitions on Wikipedia.
What is a
ZIP file
ZIP format
is a lossless data
compression and archival format created in 1989 by Phil Katz,
implemented for the first time in PKWARE's PKZIP.
The ZIP file format
specifications were released under public domain
and the format had long and lasting success, to the point often "zip"
is colloquially used for any generic compressed archive, and many
package formats are based on deflate compression and/or same or very
similar specs: Java JAR / WAR / EAR, Android APK,
Apple iOS IPA files (iPhone and iPad devices), Microsoft CAB and Office
compound files.
WinZip
12.1 (2009) introduced the new ZIPX file
format specifications
for identifying a
new archive standard which
supports newer and more powerful compression algorithms.
What are RAR,
ACE, 7Z files
During '90s and
beyond, multiple alternative archival standard emerged,
as ARJ, RAR
(1993), ACE,
and 7Z
(1999), introducing unique
features to
distinguish
them from the growing number of competitors, in example:
- usually, stronger
compression ratio than ZIP at the
cost of slower operation - but that disadvantage would have been paid
off by slower transfer time (especially on slow and public networks)
of smaller output file
- multi volume archival, spanning
output to multiple files to met
constrains as mail attachment size limit
- encryption, to
enforce end user's privacy if the file
is stolen, or
passed through unsecure servers (unencrypted public network, or any
third party controlled channel as a mail server, or remote storage
service)
- error detection
and error correction (as implemented in ARC and RAR formats), to
prevent extraction
in the event data gets corrupted (i.e. faulty connection, damaged disk)
and attempt recovery from known good data.
Evolution of file archiving
formats
Archival file format
tends to be more geared towards powerful,
computing intensive features to enhance manageability of data (high
compression, strong encryption), rather than enhancing ability to work
on live data (rapid read and write access) like filesystems, even if
some archive management utilities offers various mechanisms to add or remove data inside
archives, and update
or sync files already in the archive.
More choice in standards brought users more features and healthy
competition between standards (see comparison of archive
formats) and implementations, but also brought
the need for
flexible multi-purpose archival applications, like PeaZip, to deal with
different formats users may encounter, and to make full use of the
feature's potential of
the different supported file formats.
In recent years the trend is shifting toward very fast and highly
efficient compression algorithms, aiming to minimize the compression
and decompression overhead in data transmission and achieve near
real-time speed, with algorithms like Google's Brotli (BR file format) and
Facebook's Zstandard (ZST file format).
|
Definition
of compressed / archive file is broadening, with package and
disk image format standards implementing native compression (in some
cases even encryption) features - Microsoft CAB and WIM, Apple DMG. |
|
More online
resources about file archiving and compression formats: AR (Unix), LBR (CP/M), SEA company, WinZip's
ZIPX
standard, Google Brotli, Facebook
Zstandard project
pages.
Synopsis: What is an archive
file? What is a compressed file? What is a zip file? How lossy
compression and lossless data compression works. What are non
reversible and reversible algorithms advantages and disadvantages.
Compare definitions, and compressed file types. Lossy vs lossless
compression. How 7Z, RAR, ZIP
files work? How does file compression works? How does file archiving
works? What does file archiving and file compression mean?
Topics: how data compression
works, lossy and lossless compression, what is an archive file, what is
a zip or a rar file
PeaZip > FAQ
> How file compression works, what are archive files
|