Data come in different shapes and formats. We can distinguish several main logical models: relational, tree and graph (a tree is an undirected graph with no cycles) Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with. Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything. (actually it can describe anything, it is just a question of how nice and native it should look) Unsurprisingly, Relational pipes are build around the relational model. However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape. So we need to bridge the gap between trees/graphs and relations.
While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes). Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc. Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc.
Why reinvent the wheel and repeat the same work for each format?
We already have reusable code for relational data – this is given by the design of Relational pipes, because it separates: inputs, transformations and outputs. Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter.
But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in v0.18:
relpipe-in-xmltable
relpipe-in-asn1table
relpipe-in-cbortable
relpipe-in-htmltable
relpipe-in-initable
relpipe-in-mimetable
relpipe-in-yamltable
These tools follow the same design principle and offer the same user interface. So once the user learns one tool, he can use this knowledge also while working with other formats. The principle is:
This is nothing new – and experienced SQL users should already know where the inspiration comes from:
the XMLTable()
SQL function that converts XML tree to a result set (relation).
We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats.
And for all of them, we use the same query language: XPath.
Despite this sounds so XML-ish, we do not translate the alternative formats to the XML markup. There is no text full of angle brackets and ampersands in the middle of the process. In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications).
Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course). YAML is quite good tree-serialization format. It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution:
network:
version: 2
ethernets:
enp84s0:
dhcp4: false
dhcp6: false
accept-ra: false
bridges:
br0:
interfaces: [enp84s0, eth0]
addresses:
- 192.168.1.101/24
gateway4: 192.168.1.1
nameservers:
addresses:
- 192.168.1.10
- 192.168.1.11
mtu: 1500
parameters:
stp: true
forward-delay: 4
dhcp4: false
dhcp6: false
accept-ra: false
We can use following command to convert the tree to a set of relations:
#!/bin/bash
cat netplan.yaml \
| relpipe-in-yamltable \
--relation 'nic' \
--records '/yaml/network/ethernets/*|/yaml/network/bridges/*' \
--attribute 'name' string 'name()' \
--attribute 'type' string 'name(..)' \
--attribute 'dhcp4' boolean 'dhcp4' \
--attribute 'dhcp6' boolean 'dhcp6' \
--attribute 'accept-ra' boolean 'accept-ra' \
--relation 'bridge' \
--records '/yaml/network/bridges/*' \
--attribute 'name' string 'name()' \
--attribute 'gateway4' string 'gateway4' \
--attribute 'mtu' string 'mtu' \
--relation 'bridge_interface' \
--records '/yaml/network/bridges/*/interfaces' \
--attribute 'bridge' string 'name(..)' \
--attribute 'interface' string '.' \
--relation 'ip' \
--records '/yaml/network/bridges/*/addresses' \
--attribute 'address' string 'substring-before(., "/")' \
--attribute 'mask' string 'substring-after(., "/")' \
--attribute 'interface' string 'name(..)' \
| relpipe-out-tabular
# an early draft of the relational mapping
So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address). We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address).
n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON.
In current version (v0.18) of Relational pipes the relpipe-in-json
and relpipe-in-jsontable
are just symbolic links to their YAML counterparts.
There is also similar example: Reading Libvirt XML files using XMLTable where we build relations from a XML tree. The principles are the same for all input formats.
With relpipe-in-htmltable
we can extract structured information from poor HTML pages.
And unlike relpipe-in-xmltable
, this tool does not require valid XML/XHTML, so it is good for the dirty work.
Processing such invalid data is always bit unreliable, but still better than nothing.
#!/bin/bash
HTML='
<p>Our company is focused on:
<ul>
<li>video game arcades
<li>laundry</li>
<LI>cigarette machines and trucking
<li>personal loans and politics
</ul>
<!-- TODO: add more GIFs -->
<P>Visit our <a href="index.htm">front page</a> and check the <A href="news.php">news</a>!!!
<!--
Yes, this HTML tagsoup is total mess.
But do you still remember the pure joy when you put your first website on the internet?
-->
<p>
download <a href="mp3.cgi?token=61686f6a-e1ab-4b92-c357-474e552f4c69" class="mp3">free MP3</a> now
</P>
<!--
Best viewed in Netscape Navigator and resolution 800×600
(c) Fvkgrra Pnaqyrf ltd. 1984-2022 -->
';
fetch_html() {
# there might be a wget/curl call to download a fresh version of the web page
echo "$HTML";
}
extract_relations() {
relpipe-in-htmltable \
--relation 'field_of_business' \
--records '//li' \
--attribute 'priority' integer 'count(preceding::li)+1' \
--attribute 'name' string '.' \
--attribute 'normalized' string 'normalize-space(.)' \
--relation 'hyperlink' \
--records '//a' \
--attribute 'url' string '@href' \
--attribute 'name' string '.' \
--attribute 'xpath' string '.' --mode xpath \
--relation 'download_token' \
--records '//a[@class="mp3"]' \
--attribute 'value' string 'substring(@href, 15)' \
--relation 'hidden_footer' \
--records '//comment()[count(following::*)=0]' \
--attribute 'text' string 'normalize-space(.)'
}
format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }
# fetch_html | html2xml 1>&2
fetch_html | extract_relations | format_result
Although Mr. Ryszczyks is unable to create a valid document, this script will print:
field_of_business:
╭────────────────────┬──────────────────────────────────┬─────────────────────────────────╮
│ priority (integer) │ name (string) │ normalized (string) │
├────────────────────┼──────────────────────────────────┼─────────────────────────────────┤
│ 1 │ video game arcades↲ │ video game arcades │
│ 2 │ laundry │ laundry │
│ 3 │ cigarette machines and trucking↲ │ cigarette machines and trucking │
│ 4 │ personal loans and politics↲ │ personal loans and politics │
╰────────────────────┴──────────────────────────────────┴─────────────────────────────────╯
Record count: 4
hyperlink:
╭────────────────────────────────────────────────────┬───────────────┬──────────────────────╮
│ url (string) │ name (string) │ xpath (string) │
├────────────────────────────────────────────────────┼───────────────┼──────────────────────┤
│ index.htm │ front page │ /html/body/p[2]/a[1] │
│ news.php │ news │ /html/body/p[2]/a[2] │
│ mp3.cgi?token=61686f6a-e1ab-4b92-c357-474e552f4c69 │ free MP3 │ /html/body/p[3]/a │
╰────────────────────────────────────────────────────┴───────────────┴──────────────────────╯
Record count: 3
download_token:
╭──────────────────────────────────────╮
│ value (string) │
├──────────────────────────────────────┤
│ 61686f6a-e1ab-4b92-c357-474e552f4c69 │
╰──────────────────────────────────────╯
Record count: 1
hidden_footer:
╭─────────────────────────────────────────────────────────────────────────────────────────────╮
│ text (string) │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Best viewed in Netscape Navigator and resolution 800×600 (c) Fvkgrra Pnaqyrf ltd. 1984-2022 │
╰─────────────────────────────────────────────────────────────────────────────────────────────╯
Record count: 1
And thanks to the terminal autodetection in the format_result()
function,
we can even pipe the result of this script to any relpipe-tr-*
or relpipe-out-*
and get machine-readable data instead of the ANSI-colored tables –
so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.).
2xml
helper script: yaml2xml
, json2xml
, asn12xml
, mime2xml
etc.
Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward.
However, sometimes it is useful to see the XML serialization of this in-memory model.
In the relpipe-in-xmltable.cpp
repository we have a helper script called
2xml
– this script is not intended to be called directly – instead the user should create a symlink e.g. ini2xml
, yaml2xml
, asn12xml
etc.
The 2xml
script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format.
If we want to do the same thing without the helper script, it is quite simple.
We use appropriate relpipe-in-*table
tool and extract a single relation with single attribute and single record.
The --records
expression is '/'
i.e. the root node.
The --attribute
expression is '.'
i.e. still the root node.
And then we just add the --mode raw-xml
to this attribute, so we get the XML serialization of given node (root) instead of the text content.
In addition to this, the 2xml
script does also formatting/indentation and syntax highlighting,
if given tools (xmllint
and pygmentize
) are available and the STDOUT is a terminal.
This script is useful when writing the expressions for relpipe-in-*table
,
but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats.
We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document,
or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language.
Relational pipes, open standard and free software © 2018-2022 GlobalCode