|
Data
deduplication, to identify and (possibly) remove duplicate
content, is important to reduce
disk occupation without loss of information (the data being removed
exists in other copies), in order to keep under control the size of
backup -
possibly speeding up the process and sparing space on backup media
supports - and to reduce the final size of compressed archives. Some
compressors
pushes the principle further and integrate mechanisms to identify
/ remove duplicate data blocks in order to improve compression ratio. |
|
Search for duplicate
files
When browsing a
filesystem the file
browser can show file checksum /
hash value on demand in last column, allowing to identify binary
identical files which have same checksum/hash value.
Clicking the name of
the function (after rightclicking the file manager colum header)
PeaZip file manager will display hash or checksum value for all (or
selected) files. Clicking "Find
duplicates" PeaZip file manager will work as duplicate finder
utility, displaying size and hash or checksum value
only for duplicate files - same binary identical content featured in
two or more distinct files - and will report the number of non-unique
files
identified.
|
In both
cases, sorting for CRC column allows to group all files (in
same folder, or same search filter) with identical hash or checksum,
making easier to detect and remove (if necessary) binary identical
files.
|
Set the algorithm to
detect duplicates
The
default verification function used to deduplicate files can be set in
main
application's menu:
Organize, Browser,
Checksum/hash), a wide selection of algorithms can be selected, ranging
from simple checksum functions as Adler32, CRC family (CRC16/24/32, and
CRC64) to hash functions like eDonkey/eMule, MD4, MD5, and
cryptographically strong hash as Ripemd160, SHA-1, SHA-2
(SHA256 and SHA512), SHA-3 256 and 512 bit, BLAKE2S and BLAKE 2B, and
Whirlpool512.
Detect duplicate files
in archives
When browsing an
archive this on demand verification is not
available, but some archive types provides the same integrity-checking
information, saving for each archived object the pre-computed
checksum or hash value depending on the archive format, and on the
archival settings employed - i.e. CRC32 in ZIP archives -
allowing to sort archive content by CRC column to group identical files
and find out duplicates.
Find similar images
When browsing a filesystem, PeaZip file manager can display
image thumbnails to help deduplication:
in context
menu, organize, check show picture thumbnails, or select a file
browser's preset style showing thumbnails.
While checksum/hash
based inspection allows to search for exactly identical
files (and images), thumbnails allows the user to visually detect
similar images
(i.e. same picture or graphic saved in different formats, or with
different color depth or compression settings, or scaled to different
sizes), to help in deciding if the (pseudo) duplication is acceptable,
and what copy (or version) to keep or delete.
As role of thumb for deleting extra versions, the best quality image
(larger resolution, lower compression or possibly lossless format as
RAW, BMP, TIFF, PNG) should be kept, discarding lower quality copies:
once lost, information/quality cannot be recreated.
Compare multiple
checksum and hash values at once
Check files
launches separate duplicate finder utility, from
"File tools" submenu (context menu) or "Test" button dropdown, which
allows to verify multiple hash
and
checksum algorithms
of multiple files at once.
Employing
multiple functions, and relying on cryptographically
strong hash algorithms as Ripemd, SHA-2, Whirlpool, can identify even
malicious attempt
of forging identical-looking files, detecting differences that would go
undetected to weaker algorithms, subject to easier found collisions.
Byte-to-byte comparison
(alternative deduplication
method)
Compare files
utility in
"File tools" submenu performs byte to byte comparison between two
files; unlike checksum/hash method it is not subject of collisions
under any
circumstance, and can find out and report exactly what the different
bytes are - so it not
only tells if two files are not identical, but also what changes were
made to content between the two versions.
Read more: checksum, and hash
functions
definitions on
Wikipedia.
Synopsis: Detect
duplicate files with PeaZip file manager. Search for identical content.
How to compare multiple CRC MD5 SHA hash, checksum values at once. Free
software to find redundant data to remove (deduplicate) reduntant files.
Topics: find duplicate
files, detect duplicate content by hash
PeaZip > FAQ > Free
duplicate finder utility, remove identical files
|