Perform a SQL-like group by operation on a CSV file on a specified column or columns. Operation results in printing columns that were aggregation key (it will be distinct in every row) and appending all columns from aggregation functions values:
usage: csvgroup [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-v] [-l]
[--zero] [-c COLUMNS] [-a FUNCTION COLUMNS] [-n]
[FILE]
Execute a SQL-like group by on specified column or columns
positional arguments:
FILE The CSV file to operate on. If omitted, will accept
input on STDIN.
optional arguments:
-c COLUMNS, --columns COLUMNS
The column name(s) on which to group by. Should be
either one name (or index) or a comma-separated list.
May also be left unspecified, in which case none
columns will be used
-a FUNCTION COLUMNS, --aggregation FUNCTION COLUMNS
Aggregate column values using max function
-n, --names Display column names and indices from the input CSV
and exit.
Also see: Arguments common to all utilities.
- max
- min
- sum
- count - count every row
- countA - count non zero rows
- common - returns most common value (handy with text fields aggregation)
Having a data:
$ csvlook examples/test_group.csv
|-----+----+----+----+----+-----|
| h1 | h2 | h3 | h4 | h5 | h6 |
|-----+----+----+----+----+-----|
| a | a | b | 1 | 2 | 3 |
| c | b | b | 0 | 0 | 0 |
| d | b | b | 6 | 7 | 1 |
| b | a | b | 3 | 2 | 1 |
|-----+----+----+----+----+-----|
Warning
Before grouping by any number of columns (see -c option) you should always sort data using the same columns identifiers ( e.g. use csvsort -c 1,2,3 | csvgroup -c 1,2,3). csvgroup pre assumes that data are already sorted using aggregation key. This is very similar to Linux sort and uniq idiom.
Sorting is however not necessary for grouping without specifying -c parameter (without aggregation key).
Lets aggregate by column h2 using min, max and count of columns h4, h5, h6:
$ csvsort -c h2 examples/test_group.csv | csvgroup -c h2 -a min h4 -a max h5 -a count h6 | csvlook
|-----+---------+---------+------------|
| h2 | min(h4) | max(h5) | count(h6) |
|-----+---------+---------+------------|
| a | 1 | 2 | 2 |
| b | 0 | 7 | 2 |
|-----+---------+---------+------------|
We can also define many columns for one aggregate:
$ csvsort -c h2 examples/test_group.csv | csvgroup -c h2 -a max 4-6 | csvlook
|-----+---------+---------+----------|
| h2 | max(h4) | max(h5) | max(h6) |
|-----+---------+---------+----------|
| a | 3 | 2 | 3 |
| b | 6 | 7 | 1 |
|-----+---------+---------+----------|
Aggregating by two columns:
$ csvsort -c h2,h3 examples/test_group.csv | csvgroup -c h2,h3 -a max 4-6 | csvlook
|-----+----+---------+---------+----------|
| h2 | h3 | max(h4) | max(h5) | max(h6) |
|-----+----+---------+---------+----------|
| a | b | 3 | 2 | 3 |
| b | b | 6 | 7 | 1 |
|-----+----+---------+---------+----------|
And by all rows:
$ csvgroup -a max 4-6 examples/test_group.csv | csvlook
|----------+---------+----------|
| max(h4) | max(h5) | max(h6) |
|----------+---------+----------|
| 6 | 7 | 3 |
|----------+---------+----------|
Note
Notice that aggregation by all rows does not require any kind of sorting, because every row is treated as unique.
Get most common value of every column:
$ csvgroup -a common 1-6 examples/test_group.csv | csvlook
|-----+----+----+----+----+-----|
| h1 | h2 | h3 | h4 | h5 | h6 |
|-----+----+----+----+----+-----|
| a | a | b | 1 | 2 | 1 |
|-----+----+----+----+----+-----|