In Relational pipes v0.18 we got a new powerful language for filtering and transformations: XPath.
It is now part of the toolset consisting of SQL, AWK, Scheme and others.
However XPath is originally a language designed for XML, in Relational pipes we can use it for relational data coming from various sources, not only XML,
and also for data that violates the rules of normal forms.
We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like x = "a" or y = 123
.
Let us have some CSV data:
"title::string","year::integer","director::string","screenwriter::string"
Beetlejuice,1988,Tim Burton,Warren Skaaren + Michael McDowell
Vacation,1983,Harold Ramis,John Hughes
Christmas Vacation,1989,Jeremiah S. Chechik,John Hughes
Celebrity,1998,Woody Allen,Woody Allen
Great Balls of Fire!,1989,Jim McBride,Jack Baran + Jim McBride
Flatliners,1990,Joel Schumacher,Peter Filardi
Heathers,1988,Michael Lehmann,Daniel Waters
Blue Velvet,1986,David Lynch,David Lynch
That look like this formatted as a table:
csv:
╭──────────────────────┬────────────────┬─────────────────────┬───────────────────────────────────╮
│ title (string) │ year (integer) │ director (string) │ screenwriter (string) │
├──────────────────────┼────────────────┼─────────────────────┼───────────────────────────────────┤
│ Beetlejuice │ 1988 │ Tim Burton │ Warren Skaaren + Michael McDowell │
│ Vacation │ 1983 │ Harold Ramis │ John Hughes │
│ Christmas Vacation │ 1989 │ Jeremiah S. Chechik │ John Hughes │
│ Celebrity │ 1998 │ Woody Allen │ Woody Allen │
│ Great Balls of Fire! │ 1989 │ Jim McBride │ Jack Baran + Jim McBride │
│ Flatliners │ 1990 │ Joel Schumacher │ Peter Filardi │
│ Heathers │ 1988 │ Michael Lehmann │ Daniel Waters │
│ Blue Velvet │ 1986 │ David Lynch │ David Lynch │
╰──────────────────────┴────────────────┴─────────────────────┴───────────────────────────────────╯
Record count: 8
Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:
cat relpipe-data/examples/film-1.csv \
| relpipe-in-csv --relation "film" \
| relpipe-tr-xpath \
--relation '.*' \
--where 'year >= 1980 and year < 1990' \
| relpipe-out-tabular
filtered result:
csv:
╭──────────────────────┬────────────────┬─────────────────────┬───────────────────────────────────╮
│ title (string) │ year (integer) │ director (string) │ screenwriter (string) │
├──────────────────────┼────────────────┼─────────────────────┼───────────────────────────────────┤
│ Beetlejuice │ 1988 │ Tim Burton │ Warren Skaaren + Michael McDowell │
│ Vacation │ 1983 │ Harold Ramis │ John Hughes │
│ Christmas Vacation │ 1989 │ Jeremiah S. Chechik │ John Hughes │
│ Great Balls of Fire! │ 1989 │ Jim McBride │ Jack Baran + Jim McBride │
│ Heathers │ 1988 │ Michael Lehmann │ Daniel Waters │
│ Blue Velvet │ 1986 │ David Lynch │ David Lynch │
╰──────────────────────┴────────────────┴─────────────────────┴───────────────────────────────────╯
Record count: 6
n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as relpipe-in-*table
commands do it
i.e. by adding underscores and unicode codepoints of given characters – e.g. the weird:field
attribute will be available as weird_3a_field
in XPath.
The CSV above is not a best example of data modeling. Actually, it is quite terrible. But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
Usually the best way is to normalize the model – follow the rules of Normal forms.
In this case, we would break this denormalized CSV table into several relations:
film
, director
, screenwriter
…
or rather film
, role
, person
, film_person_role
…
But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of screenwriter = name1 + name2
(that makes trouble when the value contains certain characters and requires writing a parser for never-specified syntax).
So, we will keep some data in classic relational attributes and some in nested XML structure.
This approach allows us to combine rigid attributes with free-form rich tree structures.
film:
╭──────────────────────┬────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ title (string) │ year (integer) │ metadata (string) │
├──────────────────────┼────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Beetlejuice │ 1988 │ <film><director>Tim Burton</director><screenwriter>Warren Skaaren</screenwriter><screenwriter>Michael McDowell</screenwriter></film> │
│ Vacation │ 1983 │ <film><director>Harold Ramis</director><screenwriter>John Hughes</screenwriter></film> │
│ Christmas Vacation │ 1989 │ <film><director>Jeremiah S. Chechik</director><screenwriter>John Hughes</screenwriter></film> │
│ Celebrity │ 1998 │ <film><director>Woody Allen</director><screenwriter>Woody Allen</screenwriter></film> │
│ Great Balls of Fire! │ 1989 │ <film><director>Jim McBride</director><screenwriter>Jack Baran</screenwriter><screenwriter>Jim McBride</screenwriter></film> │
│ Flatliners │ 1990 │ <film><director>Joel Schumacher</director><screenwriter>Peter Filardi</screenwriter></film> │
│ Heathers │ 1988 │ <film><director>Michael Lehmann</director><screenwriter>Daniel Waters</screenwriter></film> │
│ Blue Velvet │ 1986 │ <film><director>David Lynch</director><screenwriter>David Lynch</screenwriter></film> │
╰──────────────────────┴────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Record count: 8
The relpipe-tr-xpath
seamlessly integrates the schema-backed (year
) and schema-free (metadata/film
) parts of our data model.
We use the same language syntax and principles for both kinds of attributes:
cat relpipe-data/examples/film-2.csv \
| relpipe-in-csv --relation "film" \
| relpipe-tr-xpath \
--relation '.*' \
--xml-attribute 'metadata' \
--where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
| relpipe-out-tabular
Filtered result:
film:
╭────────────────────┬────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────╮
│ title (string) │ year (integer) │ metadata (string) │
├────────────────────┼────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Vacation │ 1983 │ <film><director>Harold Ramis</director><screenwriter>John Hughes</screenwriter></film> │
│ Christmas Vacation │ 1989 │ <film><director>Jeremiah S. Chechik</director><screenwriter>John Hughes</screenwriter></film> │
│ Blue Velvet │ 1986 │ <film><director>David Lynch</director><screenwriter>David Lynch</screenwriter></film> │
╰────────────────────┴────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────╯
Record count: 3
n.b. In current version, we have to mark the attributes containing XML: --xml-attribute 'metadata'
.
In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
This way, we can work with free-form attributes containing multiple values or run various functions on them. We can e.g. list films that have more than one screenwriter:
--where 'count(metadata/film/screenwriter) > 1'
Well, well… here we are:
film:
╭──────────────────────┬────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ title (string) │ year (integer) │ metadata (string) │
├──────────────────────┼────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Beetlejuice │ 1988 │ <film><director>Tim Burton</director><screenwriter>Warren Skaaren</screenwriter><screenwriter>Michael McDowell</screenwriter></film> │
│ Great Balls of Fire! │ 1989 │ <film><director>Jim McBride</director><screenwriter>Jack Baran</screenwriter><screenwriter>Jim McBride</screenwriter></film> │
╰──────────────────────┴────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Record count: 2
We can also run XPath from SQL queries (relpipe-tr-sql
) e.g. in PostgreSQL.
The relpipe-tr-xpath
does not only restriction but also projection.
It can add, remove or modify the attributes while converting the input to the result set.
cat relpipe-data/examples/film-2.csv \
| relpipe-in-csv --relation "film" \
| relpipe-tr-xpath \
--relation '.*' \
--xml-attribute 'metadata' \
--output-attribute 'title' string 'title' \
--output-attribute 'director' string 'metadata/film/director' \
--output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
| relpipe-out-tabular
We removed some attributes and created new ones:
film:
╭──────────────────────┬─────────────────────┬──────────────────────────────╮
│ title (string) │ director (string) │ screenwriter_count (integer) │
├──────────────────────┼─────────────────────┼──────────────────────────────┤
│ Beetlejuice │ Tim Burton │ 2 │
│ Vacation │ Harold Ramis │ 1 │
│ Christmas Vacation │ Jeremiah S. Chechik │ 1 │
│ Celebrity │ Woody Allen │ 1 │
│ Great Balls of Fire! │ Jim McBride │ 2 │
│ Flatliners │ Joel Schumacher │ 1 │
│ Heathers │ Michael Lehmann │ 1 │
│ Blue Velvet │ David Lynch │ 1 │
╰──────────────────────┴─────────────────────┴──────────────────────────────╯
Record count: 8
Or we may concatenate the values:
cat relpipe-data/examples/film-2.csv \
| relpipe-in-csv \
| relpipe-tr-xpath \
--relation '.*' \
--xml-attribute 'metadata' \
--output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
| relpipe-out-nullbyte | tr \\0 \\n
and build some sentences:
The film Beetlejuice was directed by Tim Burton in year 1988.
The film Vacation was directed by Harold Ramis in year 1983.
The film Christmas Vacation was directed by Jeremiah S. Chechik in year 1989.
The film Celebrity was directed by Woody Allen in year 1998.
The film Great Balls of Fire! was directed by Jim McBride in year 1989.
The film Flatliners was directed by Joel Schumacher in year 1990.
The film Heathers was directed by Michael Lehmann in year 1988.
The film Blue Velvet was directed by David Lynch in year 1986.
Input data may come not only from some kind of database or some carefully designed data set, they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database… despite this fact, we can still collect and query such data in a relational way.
For example, Maven (a build system for Java) describe its modules in XML format in pom.xml
files.
Using the find
and relpipe-in-filesystem
we collect them and create a relation containing names and contents of such files:
find -type f -name 'pom.xml' -print0 \
| relpipe-in-filesystem \
--relation 'module' \
--file path \
--file content \
| relpipe-tr-xpath \
--namespace 'm' 'http://maven.apache.org/POM/4.0.0' \
--relation '.*' \
--xml-attribute 'content' \
--output-attribute 'path' string 'path' \
--output-attribute 'group_id' string 'content/m:project/m:groupId' \
--output-attribute 'artifact_id' string 'content/m:project/m:artifactId' \
--output-attribute 'version' string 'content/m:project/m:version' \
| relpipe-out-tabular
Then we extract desired values using relpipe-tr-xpath
and get:
module:
╭─────────────────────────────────────┬────────────────────────┬──────────────────────┬──────────────────╮
│ path (string) │ group_id (string) │ artifact_id (string) │ version (string) │
├─────────────────────────────────────┼────────────────────────┼──────────────────────┼──────────────────┤
│ ./java/jdbc-dk-driver/pom.xml │ info.globalcode.sql.dk │ jdbc-dk-driver │ 0.11-SNAPSHOT │
│ ./java/jdbc-loopback-driver/pom.xml │ info.globalcode.sql.dk │ jdbc-loopback-driver │ 0.11-SNAPSHOT │
│ ./java/sql-dk/pom.xml │ info.globalcode.sql.dk │ sql-dk │ 0.11-SNAPSHOT │
│ ./java/pom.xml │ info.globalcode.sql.dk │ sql-dk-parent │ 0.11-SNAPSHOT │
╰─────────────────────────────────────┴────────────────────────┴──────────────────────┴──────────────────╯
Record count: 4
This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML. Such conversions are already available for formats like INI, ASN.1, MIME, HTML JSON, YAML etc.
The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice. But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them. When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
Relational pipes, open standard and free software © 2018-2022 GlobalCode