POKI_PUT_TOC_HERE
Data
Test data were of the form
POKI_INCLUDE_ESCAPED(./data/small)HERE
|
POKI_INCLUDE_ESCAPED(./data/small.csv)HERE
|
for DKVP and CSV, respectively, where fields a and b take one of five text values,
uniformly distributed; i is a 1-up line counter; x and y
are independent uniformly distributed floating-point numbers in the unit
interval.
Data files of one million lines (totalling about 50MB for CSV and 60MB for
DKVP) were used. In experiments not shown here, I also varied the file sizes;
the size-dependent results were the expected, completely unsurprising
linearities and so I produced no file-size-dependent plots for your viewing pleasure.
Comparands
The cat, cut, awk, sed, sort tools
were compared to mlr on an 8-core Darwin laptop; RAM capacity was
nowhere near challenged . The catc program is a simple line-oriented
line-printer (source
here) which is intermediate between Miller (which is record-aware as well
as line-aware) and cat (which is only byte-aware).
Raw results
Note that for CSV data, the command is mlr --csvlite ... rather than mlr ....
POKI_INCLUDE_ESCAPED(perftbl.txt)HERE
Analysis
As expected, cat is very fast — it needs only stream bytes as quickly as possible; it doesn’t even need to touch individual bytes.
My catc is also faster than Miller: it needs to read and write lines, but it doesn’t segment lines into records; in fact it does no iteration over bytes in each line.
Miller does not outperform sed, which is string-oriented rather than record-oriented.
For the tools which do need to pick apart fields (cut,
awk, sort), Miller is comparable or outperforms. As noted above, this effect
persists linearly across file sizes.
For univariate and bivariate statistics, I didn’t attempt to
compare to other tools wherein such computations are less straightforward;
rather, I attempted only to show that Miller’s processing time here is comparable to its own processing time for other problems.
Conclusion
For record-oriented data transformations, Miller meets or beats the Unix
toolkit in many contexts. Field renames in particular are worth doing as a
pre-pipe or post-pipe using sed.