POKI_PUT_TOC_HERE
We think of CSV tables as rectangular: if there are 17 columns in the header then there are 17 columns for every row, else the data have a formatting error.
But heterogeneous data abound (today’s no-SQL databases for example). Miller handles this.
For I/O
CSV and pretty-print
Miller simply prints a newline and a new header when there is a schema change. When there is no schema change, you get CSV per se as a special case. Likewise, Miller reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite is that the former is RFC4180-compliant, while the latter readily handles heterogeneous data (which is non-compliant). For example:
POKI_RUN_COMMAND{{cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --ocsvlite cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint cat data/het.dkvp}}HERE
|
Miller handles explicit header changes as just shown. If your CSV input contains ragged data — if there are implicit header changes — you can use --allow-ragged-csv-input (or keystroke-saver --ragged). For too-short data lines, values are filled with empty string; for too-long data lines, missing field names are replaced with positional indices (counting up from 1, not 0), as follows:
POKI_RUN_COMMAND{{cat data/ragged.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsv --oxtab --allow-ragged-csv-input cat data/ragged.csv}}HERE
|
You may also find Miller’s group-like feature handy (see also
POKI_PUT_LINK_FOR_PAGE(reference.html)HERE):
POKI_RUN_COMMAND{{mlr --ocsvlite group-like data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint group-like data/het.dkvp}}HERE
|
Key-value-pair, vertical-tabular, and index-numbered formats
For these formats, record-heterogeneity comes naturally:
POKI_RUN_COMMAND{{cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --onidx --ofs ' ' cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --oxtab cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --oxtab group-like data/het.dkvp}}HERE
|
For processing
Miller operates on specified fields and takes the rest along: for example, if you are sorting on the
count field then all records in the input stream must have a count field but the other fields can
vary, and moreover the sorted-on field name(s) don’t need to be in the same position on each line:
POKI_RUN_COMMAND{{cat data/sort-het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr sort -n count data/sort-het.dkvp}}HERE
|