CSV (RFC 4180) is quite good solution when we want to store or share relational data in a simple text format – both, human-readable and well supported by many existing applications and libraries. We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc. (on the other hand, such simple formats have usually some drawbacks…)
In this example, we will show how to query a set of CSV files like it was a relational database.
Suppose we have a CSV file describing our network interfaces:
address,name
00:00:00:00:00:00,lo
00:D0:D8:00:26:00,eth0
00:01:02:01:33:70,eth1
and another CSV file with IP addresses assigned to them:
address,mask,version,interface
127.0.0.1,8,4,lo
::1,128,6,lo
192.168.1.2,24,4,eth0
192.168.1.8,24,4,eth0
10.21.12.24,24,4,eth0
75.748.86.91,95,4,eth1
23.75.345.200,95,4,eth1
2a01:430:2e::cafe:babe,64,6,eth1
Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.)
We can also add relpipe-tr-sql
in the middle of our pipeline and run some SQL queries –
transform data on-the-fly and send the query result to the relpipe-out-tabular
(or other output filter) in place of the original data.
For now, we will filter just the IPv6 addresses:
cat ip.csv \
| relpipe-in-csv --relation 'ip' \
| relpipe-tr-sql \
--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
| relpipe-out-tabular
and get them printed:
ipv6:
╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
│ address (string) │ mask (string) │ version (string) │ interface (string) │
├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
│ ::1 │ 128 │ 6 │ lo │
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 2
It is alo possible to run several queries at once and thanks to the Relational pipes format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline:
cat ip.csv \
| relpipe-in-csv --relation 'ip' \
| relpipe-tr-sql \
--relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \
--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
| relpipe-out-tabular
resulting in two nice tables:
ipv4:
╭──────────────────┬───────────────┬──────────────────┬────────────────────╮
│ address (string) │ mask (string) │ version (string) │ interface (string) │
├──────────────────┼───────────────┼──────────────────┼────────────────────┤
│ 127.0.0.1 │ 8 │ 4 │ lo │
│ 192.168.1.2 │ 24 │ 4 │ eth0 │
│ 192.168.1.8 │ 24 │ 4 │ eth0 │
│ 10.21.12.24 │ 24 │ 4 │ eth0 │
│ 75.748.86.91 │ 95 │ 4 │ eth1 │
│ 23.75.345.200 │ 95 │ 4 │ eth1 │
╰──────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 6
ipv6:
╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
│ address (string) │ mask (string) │ version (string) │ interface (string) │
├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
│ ::1 │ 128 │ 6 │ lo │
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 2
When "4"
and "6"
are not fixed values, we should not glue them to the query string like version = $version
,
because it is a dangerous practice that may lead to SQL injection.
We have parametrized queries for such tasks:
--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"
To load multiple CSV files into our in-memory database, we just concatenate the relational streams using the means of our shell – the semicolons and parenthesis:
(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \
| relpipe-tr-sql \
--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
| relpipe-out-tabular
Generic version that loads all *.csv
files:
for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \
| relpipe-tr-sql \
--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
| relpipe-out-tabular
Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.
ip_nic:
╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮
│ address (string) │ mask (string) │ version (string) │ interface (string) │ address (string) │ name (string) │
├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤
│ 127.0.0.1 │ 8 │ 4 │ lo │ 00:00:00:00:00:00 │ lo │
│ ::1 │ 128 │ 6 │ lo │ 00:00:00:00:00:00 │ lo │
│ 192.168.1.2 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
│ 192.168.1.8 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
│ 10.21.12.24 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
│ 75.748.86.91 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
│ 23.75.345.200 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯
Record count: 8
Good practice is to wrap common code blocks into functions and thus make them reusable. In shell, the function still works with input and output streams and we can use them when building our pipelines. Shell functions can be seen as named reusable parts of a pipeline.
csv2relation() { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done }
do_query() { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; }
format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }
csv2relation *.csv | do_query | format_result
The format_result()
function checks whether the STDOUT is a terminal or not.
and when printing to the terminal, it generates a table.
When writing to a regular file or STDIN of another process, it passes through original relational data.
Thus ./our-script.sh
will print a nice table in the terminal, while ./our-script.sh > data.rp
will create a file containing machine-readable data
and ./our-script.sh | relpipe-out-xhtml > report.xhtml
will create an XHTML report and ./our-script.sh | relpipe-out-gui
will show a GUI window full of tables and maybe also charts.
SELECT
nic.name || ' IPv' || ip.version AS label,
nic.name AS interface,
ip.version AS ip_version,
count(*) AS address_count
FROM nic
LEFT JOIN ip ON (ip.interface = nic.name)
GROUP BY nic.name, ip.version
ORDER BY count(*) DESC
Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code. We can also use Make (the tool intended mainly for building sofware), write a Makefile and organize our code around some temporary files and other targets instead of functions.
all: print_summary
.PHONY: clean print_summary run_commands
clean:
rm -rf *.rp
%.rp: %.csv
relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@)
define SQL_IP_NIC
SELECT
ip.address AS ip_address,
nic.name AS interface,
nic.address AS mac_address
FROM ip
JOIN nic ON (nic.name = ip.interface)
endef
export SQL_IP_NIC
define SQL_COUNT_VERSIONS
SELECT
interface,
count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count,
count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count
FROM ip
GROUP BY interface
ORDER BY interface
endef
export SQL_COUNT_VERSIONS
# Longer SQL queries are better kept in separate .sql files,
# because we can enjoy syntax highlighting and other support in our editors.
# Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)"
summary.rp: nic.rp ip.rp
cat $(^) \
| relpipe-tr-sql \
--relation "ip_nic" "$$SQL_IP_NIC" \
--relation "counts" "$$SQL_COUNT_VERSIONS" \
> $(@)
print_summary: summary.rp
cat $(<) | relpipe-out-tabular
We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make) and have reusable shell functions available in the Makefile:
SHELL=bash
BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done }
export BASH_FUNC_read_nullbyte%%
usage example:
run_commands: summary.rp
cat $(<) \
| relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \
| relpipe-out-nullbyte \
| while read_nullbyte ip_address interface mac_address; do\
echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \
done;
Both approaches – the shell script and the Makefile – have pros and cons. With Makefile, we usually create some temporary files containing intermediate results. That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed.
Relational pipes, open standard and free software © 2018-2022 GlobalCode