CSV (RFC 4180) is quite good solution when we want to store or share relational data in a simple text format – both, human-readable and well supported by many existing applications and libraries. We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc). However, such simple formats have usually some drawbacks. CSV may contain only a single relation (table, sheet). This is not a big issue – we can use several files. A more serious problem is the absence of data types – in CSV, everything is just a text string. Thus it was impossible to have loss-less conversion to CSV and back.
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
filesystem:
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
│ license/ │ d │ 0 │ hacker │ hacker │
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2
Data types are missing in CSV by default:
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv
"path","type","size","owner","group"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"
The size
attribute was integer and now it is mere string:
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular
csv:
╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
│ path (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
│ license/ │ d │ 0 │ hacker │ hacker │
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
Record count: 2
Since Relational pipes v0.18 we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading. Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
The attribute name and data type are separated by the ::
symbol e.g. name::string,age::integer,member::boolean
.
Attribute names may contain ::
(unlike the data type names).
The data type declarations may be added simply by hand or automatically using relpipe-out-csv
.
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true
"path::string","type::string","size::integer","owner::string","group::string"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"
The relpipe-out-csv
+ relpipe-in-csv
round-trip now does not degrade the data quality:
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular
csv:
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
│ license/ │ d │ 0 │ hacker │ hacker │
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2
So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
Such workflow can be effectively managed by make
–
make edit
will convert versioned data to CSV and launch the editor,
make commit
will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
Why put into VCS data in different format than CSV? Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs. At a glance we can see which attributes have been changed. While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
The relpipe-out-csv
tool generates data types only when explicitly asked for: --write-types true
.
The relpipe-in-csv
tool automatically looks for these type declarations
and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
This behavior can be disabled by --read-types false
(true
will require valid type declarations).
Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
Then we put the relpipe-tr-infertypes
tool in our pipeline.
It buffers whole relations and checks all values of each attribute.
If they contain all integers or all booleans they are converted to given type.
$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
csv:
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
│ license/ │ d │ 0 │ hacker │ hacker │
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2
This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
We can e.g. download some data set from network and pipe it through relpipe-in-csv
+ relpipe-tr-infertypes
and improve the data quality a bit.
We may apply the type inference only on certain relations: --relation "my_relation"
or chose different mode: --mode data
or metadata
or auto
.
The data
mode is described above.
In the metadata
mode the relpipe-tr-infertypes
works similar to relpipe-in-csv --read-types true
.
The auto
mode checks for the metadata in attribute names first and if not found, it fallbacks to the data
mode.
This tool works with any relational data regardless their original format or source (not only with CSV).
Some CSV files contain just data – have no header line containing the column names.
Then we specify the attribute names and data types as CLI parameters of relpipe-in-csv
:
$ echo -e "a,b,c\nA,B,C" \
| relpipe-in-csv \
--relation 'just_data' \
--attribute 'x' string \
--attribute 'y' string \
--attribute 'z' string \
| relpipe-out-tabular
just_data:
╭────────────┬────────────┬────────────╮
│ x (string) │ y (string) │ z (string) │
├────────────┼────────────┼────────────┤
│ a │ b │ c │
│ A │ B │ C │
╰────────────┴────────────┴────────────╯
Record count: 2
We may also skip existing header line: tail -n +2
and force our own names and types.
However this will not work if there are multiline values in the header – which is not common –
in such cases we should use some relpipe-tr-*
tool to rewrite the names or types
(these tools work with relational data instead of plain text).
Relational pipes, open standard and free software © 2018-2022 GlobalCode