Sometimes there are interesting data in a semi-structured form on a website. We can read such data and process them as relations using the XMLTable input and e.g. SQL transformation. This example shows how to read the list of available Relpipe implementations, filter the commands (executables) and compute statistics, so we can see, how many input filters, output filters and transformations we have:
#!/bin/bash
URL="https://relational-pipes.globalcode.info/v_0/implementation.xhtml";
# or we can use wget, if curl is missing or we like wget more:
# curl() { wget -O - "$1"; }
curl "$URL" \
| relpipe-in-xmltable \
--namespace "h" "http://www.w3.org/1999/xhtml" \
--relation "implementation" \
--records "//h:table[1]/h:tbody/h:tr" \
--attribute "name" string "h:td[1]" \
--attribute "type" string "h:td[2]" \
--attribute "subtype" string "h:td[3]" \
| relpipe-tr-sql \
--relation \
"relpipe_commands" \
"SELECT
subtype AS type,
count(*) AS count
FROM implementation
WHERE type = 'executable'
GROUP BY type, subtype
ORDER BY count DESC" \
| relpipe-out-tabular
This script will generate a relation:
relpipe_commands:
╭────────────────┬────────────────╮
│ type (string) │ count (string) │
├────────────────┼────────────────┤
│ output │ 8 │
│ transformation │ 8 │
│ input │ 7 │
╰────────────────┴────────────────╯
Record count: 3
Using these tools we can build e.g. an automatic system which watches a website and notifies us about the changes. In SQL, we can use the EXCEPT operation and compare current data with older ones and SELECT only the new or changed records.
There are also some caveats:
What if the table structure changes? At first, we must say that parsing a web page (which is a presentation form, not designed for machine processing) is always suboptimal and hackish. The propper way is to arrange a machine-readable format for data exchange (e.g. XML with well-defined schema). But if we do not have this option and must parse some web page, we can improve it in two ways:
--records
XPath expression so it will select the table with exact number of colums and propper names instead of selecting the first table,
What if the web page is invalid? Unfortunately, current web is full of invalid and faulty documents that can not be easily parsed.
In such case, we can pass the stream through the tidy
tool which fixes the bugs and then pass it to the relpipe-in-xmltable
.
It is just one additional step in our pipeline.
Relational pipes, open standard and free software © 2018-2022 GlobalCode