It isn’t. Miller is one of many, many participants in the
online-analytical-processing culture. Other key participants include
awk, SQL, spreadsheets, etc. etc. etc. Far from being an original
concept, Miller explicitly strives to imitate several existing tools:
Unix toolkit: Intentional similarities as described in
POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE.
Recipes abound for command-line data analysis using the Unix toolkit. Here are just a couple of my favorites:
RecordStream: Miller owes particular inspiration
to RecordStream. The
key difference is that RecordStream is a Perl-based tool for manipulating JSON
(including requiring it to separately manipulate other formats such as CSV into
and out of JSON), while Miller is fast C which handles its formats natively.
The similarities include the sort, stats1 (analog of
RecordStream’s collate), and delta operations, as well
as filter and put, and pretty-print formatting.
stats_m: A third source of lineage is my Python
stats_m
module. This includes simple single-pass algorithms which form Miller’s
stats1 and stats2 subcommands.
SQL: Fourthly, Miller’s group-by command
name is from SQL, as is the term aggregate.
Added value:
Miller’s added values include:
- Name-indexing, compared to the Unix toolkit’s positional indexing.
- Raw speed, compared to awk, RecordStream, stats_m, or various other kinds of Python/Ruby/etc. scripts one can easily create.
- Ability to handle text files on the Unix pipe, without need for creating database tables, compared to SQL databases.
jq: Miller does for name-indexed text what
jq does for JSON. If you’re
not already familiar with jq, please check it out!.
What about DOTADIW? One of the key points of the
Unix philosophy is
that a tool should do one thing and do it well. Hence sort and
cut do just one thing. Why does Miller put awk-like
processing, a few SQL-like operations, and statistical reduction all into one
tool (see also POKI_PUT_LINK_FOR_PAGE(reference.html)HERE)? This is a fair
question. First note that many standard tools, such as awk and
perl, do quite a few things — as does jq. But I could
have pushed for putting format awareness and name-indexing options into
cut, awk, and so on (so you could do cut -f
hostname,uptime or awk '{sum += $x*$y}END{print sum}'). Patching
cut, sort, etc. on multiple operating systems is a
non-starter in terms of uptake. Moreover, it makes sense for me to have Miller
be a tool which collects together format-aware record-stream processing into
one place, with good reuse of Miller-internal library code for its various
features.
No, really, why one more command-line data-manipulation
tool? I wrote Miller because I was frustrated with tools like
grep, sed, and so on being line-aware without being
format-aware. The single most poignant example I can think of is seeing
people grep data lines out of their CSV files and sadly losing their header
lines. While some lighter-than-SQL processing is very nice to have, at core I
wanted the format-awareness of RecordStream combined
with the raw speed of the Unix toolkit. Miller does precisely that.