NAME
App::CSVUtils - CLI utilities related to CSV
VERSION
This document describes version 1.034 of App::CSVUtils (from Perl distribution App-CSVUtils), released on 2024-02-02.
DESCRIPTION
This distribution contains the following CLI utilities:
- 1. csv-add-fields
- 2. csv-avg
- 3. csv-check-cell-values
- 4. csv-check-field-names
- 5. csv-check-field-values
- 6. csv-check-rows
- 7. csv-cmp
- 8. csv-concat
- 9. csv-convert-to-hash
- 10. csv-csv
- 11. csv-delete-fields
- 12. csv-dump
- 13. csv-each-row
- 14. csv-fill-cells
- 15. csv-fill-template
- 16. csv-find-values
- 17. csv-freqtable
- 18. csv-gen
- 19. csv-get-cells
- 20. csv-grep
- 21. csv-info
- 22. csv-intrange
- 23. csv-list-field-names
- 24. csv-lookup-fields
- 25. csv-ltrim
- 26. csv-map
- 27. csv-munge-field
- 28. csv-munge-rows
- 29. csv-pick
- 30. csv-pick-cells
- 31. csv-pick-fields
- 32. csv-pick-rows
- 33. csv-quote
- 34. csv-replace-newline
- 35. csv-rtrim
- 36. csv-select-fields
- 37. csv-select-rows
- 38. csv-setop
- 39. csv-shuf
- 40. csv-shuf-fields
- 41. csv-shuf-rows
- 42. csv-sort
- 43. csv-sort-fields
- 44. csv-sort-fields-by-example
- 45. csv-sort-fields-by-spec
- 46. csv-sort-rows
- 47. csv-sorted
- 48. csv-sorted-fields
- 49. csv-sorted-rows
- 50. csv-split
- 51. csv-sum
- 52. csv-transpose
- 53. csv-trim
- 54. csv-uniq
- 55. csv-unquote
- 56. csv2csv
- 57. csv2ltsv
- 58. csv2paras
- 59. csv2td
- 60. csv2tsv
- 61. csv2vcf
- 62. list-csvutils
- 63. paras2csv
- 64. tsv2csv
FUNCTIONS
gen_csv_util
Usage:
gen_csv_util(%args) -> bool
Generate a CSV utility.
This routine is used to generate a CSV utility in the form of a Rinci function (code and metadata). You can then produce a CLI from the Rinci function simply using Perinci::CmdLine::Gen or, if you use Dist::Zilla, Dist::Zilla::Plugin::GenPericmdScript or, if on the command-line, gen-pericmd-script.
Using this routine, by providing just one or a few hooks and setting some parameters like a couple of extra arguments, you will get a complete CLI with decent POD/manpage, ability to read one or multiple CSV's and write one or multiple CSV's, some command-line options to customize how the input CSV's should be parsed and how the output CSV's should be formatted and named. Your CLI also has tab completion, usage and help message, and other features.
To create a CSV utility, you specify a name
(e.g. csv_dump
; must be a valid unqualified Perl identifier/function name) and optionally summary
, description
, and other metadata like links
or even add_meta_props
. Then you specify one or more of on_*
or before_*
or after_*
arguments to supply handlers (coderefs) for your CSV utility at various hook points.
THE HOOKS
All code for hooks should accept a single argument r
. r
is a stash (hashref) of various data, the keys of which will depend on which hook point being called. You can also add more keys to store data or for flow control (see hook documentation below for more details).
The order of the hooks, in processing chronological order:
on_begin
Called when utility begins, before reading CSV. You can use this hook e.g. to process arguments, set output filenames (if you allow custom output filenames).
before_read_input
Called before opening any input CSV file. This hook is still called even if your utility sets
reads_csv
to false.At this point, the
input_filenames
stash key (as well as other keys likeinput_filename
,input_filenum
, etc) has not been set. You can use this hook e.g. to set a custominput_filenames
.before_open_input_files
Called before an input CSV file is about to be opened, including for stdin (
-
). You can use this hook e.g. to check/preprocess input file. Flow control is available by setting$r->{wants_skip_files}
to skip reading all the input files and go directly to theafter_read_input
hook.before_open_input_file
Called before each input CSV file is about to be opened, including for stdin (
-
). For the first file, called afterbefore_open_input_file
hook. You can use this hook e.g. to check/preprocess input file. Flow control is available by setting$r->{wants_skip_file}
to skip reading a single input file and go to the next file, or$r->{wants_skip_files}
to skip reading the rest of the files and go directly to theafter_read_input
hook.on_input_header_row
Called when receiving header row. Will be called for every input file, and called even when user specify
--no-input-header
, in which case the header row will be the generated["field1", "field2", ...]
. You can use this hook e.g. to add/remove/rearrange fields.You can set
$r->{wants_fill_rows}
to a defined false if you do not want$r->{input_rows}
to be filled with empty string elements when it contains less than the number of fields (in case of sparse values at the end). Normally you only want to do this when you want to do checking, e.g. in csv-check-rows.on_input_data_row
Called when receiving each data row. You can use this hook e.g. to modify the row or print output (for line-by-line transformation or filtering).
after_close_input_file
Called after each input file is closed, including for stdin (
-
) (although for stdin, the handle is not actually closed). Flow control is possible by setting$r->{wants_skip_files}
to skip reading the rest of the files and go straight to theafter_close_input_files
hook.after_close_input_files
Called after the last input file is closed, after the last
after_close_input_file
hook, including for stdin (-
) (although for stdin, the handle is not actually closed).after_read_input
Called after the last row of the last CSV file is read and the last file is closed. This hook is still called, if you set
reads_csv
option to false. At this point the stash keys related to CSV reading have all been cleared, includinginput_filenames
,input_filename
,input_fh
, etc.You can use this hook e.g. to print output if you buffer the output.
on_end
Called when utility is about to exit. You can use this hook e.g. to return the final result.
THE STASH
The common keys that r
will contain:
gen_args
, hash. The arguments used to generate the CSV utility.util_args
, hash. The arguments that your CSV utility accepts. Parsed from command-line arguments (or configuration files, or environment variables).name
, str. The name of the CSV utility. Which can also be retrieved viagen_args
.code_print
, coderef. Routine provided for you to print something. Accepts a string. Takes care of opening the output files for you.code_print_row
, coderef. Routine provided for you to print a data row. You pass the row (either arrayref or hashref). Takes care of opening the output files for you, as well as printing header row the first time, if needed.code_print_header_row
, coderef. Routine provided for you to print header row. You don't need to pass any arguments. Will only print the header row once per output file if output header is enabled, even if called multiple times.
If you are accepting CSV data (reads_csv
gen argument set to true), the following keys will also be available (in on_input_header_row
and on_input_data_row
hooks):
input_parser
, a Text::CSV_XS instance for input parsing.input_fields
, array of str. Input CSV's field names.input_fields_idx
, hash with field name as keys and field index (0-based integer) as values.input_filenames
, array of str.input_filename
, str. The name of the current input file being read (-
if reading from stdin).input_filenum
, uint. The number of the current input file, 1 being the first file, 2 for the second, and so on.input_fh
, the handle to the current file being read.input_rownum
, uint. The number of rows that have been read (reset after each input file). Inon_input_header_row
phase, this will be 1 since header row (including the generated one) is the first row. Then inon_input_data_row
phase (called the first time for a file), it will be 2 for the first data row, even if physically it is the first row for CSV file that does not have a header.input_data_rownum
, uint. The number of data rows that have been read (reset after each input file). This will be equal toinput_rownum
less 1 if input file has header.input_row
, aos (array of str). The current input CSV row as an arrayref.input_row_as_hashref
, hos (hash of str). The current input CSV row as a hashref, with field names as hash keys and field values as hash values. This will only be calculated if utility wants it. Utility can express so by setting$r->{wants_input_row_as_hashref}
to true, e.g. in theon_begin
hook.input_header_row_count
, uint. Contains the number of actual header rows that have been read. If CLI user specifies--no-input-header
, this will stay at zero. Will be reset for each CSV file.input_data_row_count
, int. Contains the number of actual data rows that have read. Will be reset for each CSV file.
If you are outputting CSV (writes_csv
gen argument set to true), the following keys will be available:
output_emitter
, a Text::CSV_XS instance for output.output_fields
, array of str. Should be set to list of output field names. If unset, will be set to be the same asinput_fields
.output_fields_idx
, hash with field names as keys and field indexes (0-based integer) as values. Normally you do not need to set this manually; you just need to setoutput_fields
and this hash will be computed automatically for you just before the first output row is outputted.output_filenames
, array of str.output_filename
, str, name of current output file.output_filenum
, uint, the number of the current output file, 1 being the first file, 2 for the second, and so on.output_fh
, handle to the current output file.output_rownum
, uint. The number of rows that have been outputted (reset after each output file).output_data_rownum
, uint. The number of data rows that have been outputted (reset after each output file). This will be equal toinput_rownum
less 1 if input file has header.
For other hook-specific keys, see the documentation for associated hook point.
ACCEPTING ADDITIONAL COMMAND-LINE OPTIONS/ARGUMENTS
As mentioned above, you will get additional command-line options/arguments in $r->{util_args}
hashref. Some options/arguments are already added by gen_csv_util
, e.g. input_filename
or input_filenames
along with input_sep_char
, etc (when your utility declares reads_csv
), output_filename
or output_filenames
along with overwrite
, output_sep_char
, etc (when your utility declares writes_csv
).
If you want to accept additional arguments/options, you specify them in add_args
(hashref, with key being Each option/argument has to be specified first via add_args
(as hashref, with key being argument name and value the argument specification as defined in Rinci::function)). Some argument specifications have been defined in App::CSVUtils and can be used. See existing utilities for examples.
READING CSV DATA
To read CSV data, normally your utility would provide handler for the on_input_data_row
hook and sometimes additionally on_input_header_row
.
OUTPUTTING STRING OR RETURNING RESULT
To output string, usually you call the provided routine $r->{code_print}
. This routine will open the output files for you.
You can also return enveloped result directly by setting $r->{result}
.
OUTPUTTING CSV DATA
To output CSV data, usually you call the provided routine $r->{code_print_row}
. This routine accepts a row (arrayref or hashref). This routine will open the output files for you when needed, as well as print header row automatically.
You can also buffer rows from input to e.g. $r->{output_rows}
, then call $r->{code_print_row}
repeatedly in the after_read_input
hook to print all the rows.
READING MULTIPLE CSV FILES
To read multiple CSV files, you first specify reads_multiple_csv
. Then, you can supply handler for on_input_header_row
and on_input_data_row
as usual. If you want to do something before/after each input file, you can also supply handler for before_open_input_file
or after_close_input_file
.
WRITING TO MULTIPLE CSV FILES
Similarly, to write to many CSv files, you first specify writes_multiple_csv
. Then, you can supply handler for on_input_header_row
and on_input_data_row
as usual. To switch to the next file, set $r->{wants_switch_to_next_output_file}
to true, in which case the next call to $r->{code_print_row}
will close the current file and open the next file.
CHANGING THE OUTPUT FIELDS
When calling $r->{code_print_row}
, you can output whatever fields you want. By convention, you can set $r->{output_fields}
and $r->{output_fields_idx}
to let other handlers know about the output fields. For example, see the implementation of csv-concat.
This function is not exported by default, but exportable.
Arguments ('*' denotes required arguments):
add_args => hash
(No description)
add_args_rels => hash
(No description)
add_meta_props => hash
Add additional Rinci function metadata properties.
after_close_input_file => code
(No description)
after_close_input_files => code
(No description)
after_read_input => code
(No description)
before_open_input_file => code
(No description)
before_open_input_files => code
(No description)
before_read_input => code
(No description)
description => str
(No description)
examples => array
(No description)
links => array[hash]
(No description)
name* => perl::identifier::unqualified_ascii
(No description)
on_begin => code
(No description)
on_end => code
(No description)
on_input_data_row => code
(No description)
on_input_header_row => code
(No description)
reads_csv => bool (default: 1)
Whether utility reads CSV data.
reads_multiple_csv => bool
Whether utility accepts CSV data.
Setting this option to true will implicitly set the
reads_csv
option to true, obviously.summary => str
(No description)
writes_csv => bool (default: 1)
Whether utility writes CSV data.
writes_multiple_csv => bool
Whether utility outputs CSV data.
Setting this option to true will implicitly set the
writes_csv
option to true, obviously.
Return value: (bool)
compile_eval_code
Usage:
$coderef = compile_eval_code($str, $label);
Compile string code $str
to coderef in 'main' package, without use strict
or use warnings
. Die on compile error.
eval_code
Usage:
$res = eval_code($coderef, $r, $topic_var_value, $return_topic_var);
FAQ
My CSV does not have a header?
Use the --no-header
option. Fields will be named field1
, field2
, and so on.
My data is TSV, not CSV?
Use the --tsv
option.
I have a big CSV and the utilities are too slow or eat too much RAM!
These utilities are not (yet) optimized, patches welcome. If your CSV is very big, perhaps a C-based solution is what you need.
HOMEPAGE
Please visit the project's homepage at https://metacpan.org/release/App-CSVUtils.
SOURCE
Source repository is at https://github.com/perlancar/perl-App-CSVUtils.
SEE ALSO
Similar CLI bundles for other format
App::TSVUtils, App::LTSVUtils, App::SerializeUtils.
Other CSV-related utilities
xls2csv and xlsx2csv from Spreadsheet::Read
import-csv-to-sqlite from App::SQLiteUtils
Query CSV with SQL using fsql from App::fsql
Other non-Perl-based CSV utilities
Python
csvkit, https://csvkit.readthedocs.io/en/latest/
AUTHOR
perlancar <perlancar@cpan.org>
CONTRIBUTOR
Adam Hopkins <violapiratejunky@gmail.com>
CONTRIBUTING
To contribute, you can send patches by email/via RT, or send pull requests on GitHub.
Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via:
% prove -l
If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, Pod::Weaver::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond that are considered a bug and can be reported to me.
COPYRIGHT AND LICENSE
This software is copyright (c) 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
BUGS
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=App-CSVUtils
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.