The relpipe-in-filesystem
and the xpath
streamlet allows us to extract multiple values (attributes) from XML files.
We can use this feature to collect data from e.g. XHTML pages.
#!/bin/bash
XMLNS_H="http://www.w3.org/1999/xhtml"
# If we set xmlns_h="…", we can omit: --option xmlns_h "$XMLNS_H"
# because XML namespaces can be provided either as an option or as an environmental variable.
# Options have precedence.
findFiles() {
find -print0;
}
fetchAttributes() {
relpipe-in-filesystem \
--parallel 8 \
--file name \
--streamlet xpath \
--option xmlns_h "$XMLNS_H" \
--option attribute '.' --option mode boolean --as 'valid_xml' \
--option attribute 'namespace-uri()' --as 'root_xmlns' \
--option attribute '/h:html/h:head/h:title' --as 'title' \
--option attribute 'count(//h:h1)' --as 'h1_count' \
--option attribute 'count(//h:h2)' --as 'h2_count' \
--option attribute 'count(//h:h3)' --as 'h3_count'
}
filterAndOrder() {
relpipe-tr-sql \
--relation "pages" \
"SELECT
name,
title,
h1_count,
h2_count,
h3_count
FROM filesystem WHERE root_xmlns = ?
ORDER BY h1_count + h2_count + h3_count DESC
LIMIT 5" \
--type-cast 'h1_count' integer \
--type-cast 'h2_count' integer \
--type-cast 'h3_count' integer \
--parameter "$XMLNS_H";
}
findFiles | fetchAttributes | filterAndOrder | relpipe-out-gui -title "Pages and titles"
The script above will show this barchart and statistics:
This pipeline consists of four steps:
findFiles
– prepares the list of files separated by \0
byte;
we can add -iname '*.xhtml'
if we know the extension and make the pipeline more efficient
fetchAttributes
– does the heavy work – tries to parse each given file as a XML
and if valid, extracts several values specified by the XPath expressions;
thanks to --parallel N
option, utilizes N cores of our CPU;
we can experiment with the N value and look how the total time decreases
filterAndOrder
– uses SQL to skip the records (files) that are not XHTML
and takes five valid files with most number of headlines
relpipe-out-gui
– displays the data is a GUI window and generates a bar chart from the numeric values
(we could use e.g. relpipe-out-tabular
to display the data in the text terminal or format the results as XML, CSV or other format)
We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
Using --option mode raw-xml
we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
Relational pipes, open standard and free software © 2018-2022 GlobalCode