Using relpipe-in-filesystem
we can gather various file attributes
– basic (name, size, type, …), extended (xattr like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata…
or compute hashes of the file content (SHA-256, SHA-512 etc.).
This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.
Background:
Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage.
It means that computing hashes of multiple files sequentially will take much more time than it could.
So it is better to compute the hashes in parallel and utilize multiple cores of our CPU.
On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures.
And this is a perfect task for Relational pipes and especially relpipe-in-filesystem
which is the first tool in our collection that implements streamlets and parallel processing.
Following script prints list of files in our /bin
directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have:
#!/bin/bash
findFiles() {
find /bin/ -print0;
}
fetchAttributes() {
relpipe-in-filesystem \
--parallel 4 \
--file path \
--file type \
--file size \
--streamlet hash;
}
aggregate() {
relpipe-tr-sql \
--relation "file_hashes" \
"SELECT
path,
type,
size,
sha256,
count(*) OVER (PARTITION BY sha256) AS same_hash_count
FROM filesystem
ORDER BY same_hash_count, sha256, path, type";
}
findFiles | fetchAttributes | aggregate | relpipe-out-tabular
Output looks like this:
file_hashes:
╭─────────────────────────────────────────┬───────────────┬────────────────┬──────────────────────────────────────────────────────────────────┬──────────────────────────╮
│ path (string) │ type (string) │ size (integer) │ sha256 (string) │ same_hash_count (string) │
├─────────────────────────────────────────┼───────────────┼────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────┤
│ /bin/expiry │ f │ 31000 │ 006c97d68fbddf175f326e554693ceaea984d6406bb5f837f1a00a7c6008218d │ 1 │
│ /bin/mapscrn │ f │ 27216 │ 00941d8eb6dc9ddf4b7d0651bd21ea1df6e325259f1d3ba9f7916d1e29ec5977 │ 1 │
│ /bin/stdbuf │ f │ 51904 │ 00a5270c7b0262754886e4d26ebc1a5a03911c46fa3c02e2b8d2b346be1f924a │ 1 │
│ /bin/ps2ps2 │ f │ 669 │ 00d9eb918871124f72c14404158d08db63c24c38a9f426fbc0a556b4d7febab2 │ 1 │
│ /bin/kernel-install │ f │ 4639 │ 00e85383894393a0cf3a851839a57eb96056788bea2553c8c166fc4b814daa55 │ 1 │
│ /bin/ionice │ f │ 30800 │ 020a4770df648af0e608425a1dba3df35a14dad7bb4d3f17dde3e3142a35f820 │ 1 │
│ /bin/dh_python2 │ f │ 1056 │ 02d870b729b8c14e0fdf287a3dbfc161570d04ab75c242ff368801eaeb4dd742 │ 1 │
│ /bin/lavadecode │ f │ 18760 │ 03a751439b0be2b65827c0e54fd569dbc0cd6dc6fd561dc8afdd7df04bb0414c │ 1 │
│ /bin/libnetcfg │ f │ 15775 │ 03ea004e8921626bdfecbc5d4b200fca2185da59ce4b4bd5407109064525defa │ 1 │
│ /bin/opldecode │ f │ 18752 │ 040517423bce47a55d1b6ef6b8232226fe0dce90039447e2a1ab4e0838162128 │ 1 │
│ /bin/openssl │ f │ 736776 │ 04997b88144b719a6e71b5e206d2c8b067dd827f0bdcad1c0e5e7a395bcf54f0 │ 1 │
│ … │ … │ … │ … │ … │
│ /bin/i386 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
│ /bin/linux32 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
│ /bin/linux64 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
│ /bin/setarch │ f │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
│ /bin/x86_64 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
│ /bin/cc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
│ /bin/gcc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
│ /bin/gcc-8 │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
│ /bin/x86_64-linux-gnu-gcc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
│ /bin/x86_64-linux-gnu-gcc-8 │ f │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
│ /bin/lzcat │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/lzma │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/unlzma │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/unxz │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/xz │ f │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/xzcat │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
│ /bin/lzegrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
│ /bin/lzfgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
│ /bin/lzgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
│ /bin/xzegrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
│ /bin/xzfgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
│ /bin/xzgrep │ f │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
╰─────────────────────────────────────────┴───────────────┴────────────────┴──────────────────────────────────────────────────────────────────┴──────────────────────────╯
Record count: 1001
This pipeline consists of four steps:
findFiles
– prepares the list of files separated by \0
byte;
we can do also some basic filtering here
fetchAttributes
– does the heavy work – computes SHA-256 hash of each file;
thanks to --parallel N
option, utilizes N cores of our CPU;
we can experiment with the N value and look how the total time decreases
aggregate
– uses SQL to order the records and SQL window function to show, how many files have the same content;
in this step we could use also relpipe-tr-awk
or relpipe-tr-scheme
if we prefer AWK or Scheme to SQL
relpipe-out-tabular
– formats the results as a table in the terminal (we could use e.g. relpipe-out-gui
to call a GUI viewer or format the results as XML, CSV or other format)
In the case of the /bin
directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary.
But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives
or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not.
Following script shows how we can compute hashes using multiple algorithms:
#!/bin/bash
findFiles() {
find /bin/ -print0;
}
fetchAttributes1() {
relpipe-in-filesystem \
--parallel 4 \
--file path \
--file type \
--file size \
--streamlet hash \
--option attribute md5 \
--option attribute sha1;
}
fetchAttributes2() {
relpipe-in-filesystem \
--parallel 4 \
--file path \
--file type \
--file size \
--streamlet hash \
--option attribute md5 \
--streamlet hash \
--option attribute sha1;
}
findFiles | fetchAttributes2 | relpipe-out-tabular
There are two variants:
In fetchAttributes1
we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (--parallel 4
) over records.
In fetchAttributes2
we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (--parallel 4
) over records.
This is a common way how streamlets work:
If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation).
But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record.
The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records.
The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process.
It is up to the user to choose the optimal (or good enough) way – there is no automagic mechanism.
Relational pipes, open standard and free software © 2018-2022 GlobalCode