NAME
data-prepare - prepare CSV data for automatic processing
SYNOPSIS
data-prepare [-f config] [-v] [[-u|-a|-k|-p|-sn colnum|-sk key] file...]
DESCRIPTION
Uses Data::Prepare to process the specified CSV files to make them suitable for automatic processing (such as data science applications). It will first delete columns specified in chop_cols
, then do the merge
operations, then the chop_lines
operations (in that order so that all lines are available for merge
purposes). Then primary keys are inserted.
If the flags that take files are given, the config file is not read and no operations executed.
Please note this program overwrites the data files with updated data. Your workflow needs to take this into account by e.g. copying the data into place before calling this program. Use of a version-control system such as Git is also recommended.
OPTIONS
- -u file...
-
For each given file (the config is ignored), prints out any non-unique column values in the first row ("column headers"), with the number of times they occur. Use this to see if further merge/modify operations are needed on column headers in order to achieve uniquely-named columns.
- -a file...
-
Print for each file, a sequence of numbers. Each number is the count of non-blank entries in that column (from left to right). This helps you spot columns with few or no entries that you may wish to "chop".
- -s[kn] col file...
-
Print for each file, either the zero-based-number-th column ("slice") from that file, or in the
-sk
form, it gets the number using "key_to_index" in Data::Prepare. - -v
-
Turn on verbose mode.
- -f config
-
Use the given YAML-formatted config file rather than the default of data-prepare-conf.yml. See below for format.
- -k file...
-
Requires a config file with a
pk_spec
key giving keysfile
,primary_key
, and an arrayalt_keys
.Print for each file, a data structure mapping each column that gave any matches to a further hash-ref mapping each of the potential key columns to how many matches it gave. This is to help you select the best main
primary_key
to specify for that file. - -p file...
-
Requires a config file with a
pk_spec
key giving keysfile
,primary_key
, and an arrayalt_keys
.Print for each file, data structures showing which primary-key columns gave how many exact matches, which gave approximate matches, and an array-ref of rows that gave no matches at all. These files will want to have already been "header merged".
CONFIGURATION FILE FORMAT
This is in YAML format. An example is given below (included in the distribution, together with the applicable CSV files, in the examples directory):
---
chop_cols:
examples/CoreHouseholdIndicators.csv: [0, 2, 4, 7, 10, 13, 16, 19, 21, 22, 23, 25, 26, 29, 32]
chop_lines:
examples/CoreHouseholdIndicators.csv: [0, 0, 0, -1, -1, -1, -1, -1]
merge:
- files:
- examples/CoreHouseholdIndicators.csv
spec:
- do:
- overwrite
from: up
fromspec: lastnonblank
line: 2
matchto: HH
to: self
- do:
- prepend
- ' '
from: self
line: 2
matchfrom: .
to: down
- do:
- prepend
- /
from: self
fromspec: left
line: 3
matchto: Year
to: self
- do:
- overwrite
from: self
fromspec: literal:Country
line: 3
to: self
tospec: index:0
pk_insert:
- files:
- examples/CoreHouseholdIndicators.csv
spec:
column_heading: ISO3CODE
local_column: Country
pk_column: official_name_en
use_fallback: true
pk_spec:
file: examples/country-codes.csv
primary_key: ISO3166-1-Alpha-3
alt_keys:
- ISO3166-1-Alpha-2
- UNTERM English Short
- UNTERM English Formal
- official_name_en
- CLDR display name
stopwords:
- islands
- china
- northern
This turns the first three lines of CSV excerpted from the supplied example data (spaces inserted for alignment reasons only):
,Proportion of households with, , ,
,(HH1) ,Year ,(HH2),Year
,Radio ,of data,TV ,of data
Belize,58.7 ,2019 ,78.7 ,2019
into the following. Note that the first two lines will still be present (not shown), possibly modified, so you will need your chop_lines to remove them. The columns of the third line are shown, one per line, for readability:
ISO3CODE,
Country,
Proportion of households with Radio,
Proportion of households with Radio/Year of data,
Proportion of households with TV,
Proportion of households with TV/Year of data
This achieves a single row of column-headings, with each column-heading being unique, and sufficiently meaningful. It also has a column that can be used as a globally-unique and meaningful primary key for joining data-sets.
WORKFLOW
This is one workflow, using the supplied example config, and recreating the supplied example data by re-downloading it from the International Telecommunication Union (ITU):
mkdir -p xlsx examples
# localc --convert-to xlsx:"Calc MS Excel 2007 XML" --outdir xlsx file.xls # convert other spreadsheet format
wget https://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2020/CoreHouseholdIndicators.xlsx -P xlsx
# after: pip3 install --user xlsx2csv
~/.local/bin/xlsx2csv -i -a xlsx/CoreHouseholdIndicators.xlsx examples
data-prepare -f examples/data-prepare-conf.yml # the supplied example config
localc
is LibreOffice's spreadsheet program.