csvgroup

Description

Perform a SQL-like group by operation on a CSV file on a specified column or columns. Operation results in printing columns that were aggregation key (it will be distinct in every row) and appending all columns from aggregation functions values:

usage: csvgroup [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
            [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-v] [-l]
            [--zero] [-c COLUMNS] [-a FUNCTION COLUMNS] [-n]
            [FILE]

Execute a SQL-like group by on specified column or columns

positional arguments:
  FILE                  The CSV file to operate on. If omitted, will accept
                        input on STDIN.

optional arguments:
  -c COLUMNS, --columns COLUMNS
                        The column name(s) on which to group by. Should be
                        either one name (or index) or a comma-separated list.
                        May also be left unspecified, in which case none
                        columns will be used
  -a FUNCTION COLUMNS, --aggregation FUNCTION COLUMNS
                        Aggregate column values using max function
  -n, --names           Display column names and indices from the input CSV
                        and exit.

Also see: Arguments common to all utilities.

Aggregation functions

  • max
  • min
  • sum
  • count - count every row
  • countA - count non zero rows
  • common - returns most common value (handy with text fields aggregation)

Examples

Having a data:

$ csvlook examples/test_group.csv
|-----+----+----+----+----+-----|
|  h1 | h2 | h3 | h4 | h5 | h6  |
|-----+----+----+----+----+-----|
|  a  | a  | b  | 1  | 2  | 3   |
|  c  | b  | b  | 0  | 0  | 0   |
|  d  | b  | b  | 6  | 7  | 1   |
|  b  | a  | b  | 3  | 2  | 1   |
|-----+----+----+----+----+-----|

Warning

Before grouping by any number of columns (see -c option) you should always sort data using the same columns identifiers ( e.g. use csvsort -c 1,2,3 | csvgroup -c 1,2,3). csvgroup pre assumes that data are already sorted using aggregation key. This is very similar to Linux sort and uniq idiom.

Sorting is however not necessary for grouping without specifying -c parameter (without aggregation key).

Lets aggregate by column h2 using min, max and count of columns h4, h5, h6:

$ csvsort -c h2 examples/test_group.csv | csvgroup -c h2 -a min h4 -a max h5 -a count h6 | csvlook
|-----+---------+---------+------------|
|  h2 | min(h4) | max(h5) | count(h6)  |
|-----+---------+---------+------------|
|  a  | 1       | 2       | 2          |
|  b  | 0       | 7       | 2          |
|-----+---------+---------+------------|

We can also define many columns for one aggregate:

$ csvsort -c h2 examples/test_group.csv | csvgroup -c h2 -a max 4-6 | csvlook
|-----+---------+---------+----------|
|  h2 | max(h4) | max(h5) | max(h6)  |
|-----+---------+---------+----------|
|  a  | 3       | 2       | 3        |
|  b  | 6       | 7       | 1        |
|-----+---------+---------+----------|

Aggregating by two columns:

$ csvsort -c h2,h3 examples/test_group.csv | csvgroup -c h2,h3 -a max 4-6 | csvlook
|-----+----+---------+---------+----------|
|  h2 | h3 | max(h4) | max(h5) | max(h6)  |
|-----+----+---------+---------+----------|
|  a  | b  | 3       | 2       | 3        |
|  b  | b  | 6       | 7       | 1        |
|-----+----+---------+---------+----------|

And by all rows:

$ csvgroup -a max 4-6 examples/test_group.csv | csvlook
|----------+---------+----------|
|  max(h4) | max(h5) | max(h6)  |
|----------+---------+----------|
|  6       | 7       | 3        |
|----------+---------+----------|

Note

Notice that aggregation by all rows does not require any kind of sorting, because every row is treated as unique.

Get most common value of every column:

$ csvgroup -a common 1-6 examples/test_group.csv | csvlook
|-----+----+----+----+----+-----|
|  h1 | h2 | h3 | h4 | h5 | h6  |
|-----+----+----+----+----+-----|
|  a  | a  | b  | 1  | 2  | 1   |
|-----+----+----+----+----+-----|

Project Versions

Table Of Contents

Previous topic

csvpys

Next topic

csvjson

This Page