NAME
Data::Validate::CSV - read and validate CSV
SYNOPSIS
CSV Schema (JSON):
{
"@context": "http://www.w3.org/ns/csvw",
"url": "countries.csv",
"tableSchema": {
"columns": [{
"name": "country",
"datatype": { "base": "string", "length": 2 }
},{
"name": "country group",
"datatype": "string"
},{
"name": "name (en)",
"datatype": "string"
},{
"name": "name (fr)",
"datatype": "string"
},{
"name": "name (de)",
"datatype": "string"
},{
"name": "latitude",
"datatype": { "base": "number", "maximum": 90, "minimum": -90 }
},{
"name": "longitude",
"datatype": { "base": "number", "maximum": 180, "minimum": -180 }
}]
}
}
CSV Data:
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
Perl:
use Path::Tiny qw(path);
use Data::Validate::CSV;
my $table = Data::Validate::CSV::Table->new(
schema => path('countries.csv-metadata.json'),
input => path('countries.csv'),
has_header => !!0,
);
while (my $row = $table->get_row) {
for my $e (@{$row->errors}) {
warn $e;
}
printf(
"%s is at latitude %f, longitude %f.\n",
$row->get("name (en)")->value,
$row->get("latitude")->value,
$row->get("longitude")->value,
);
}
DESCRIPTION
There's not really a lot of documentation right now.
Mostly there's three interfaces you need to know about: tables, rows, and cells. (There are also columns, schemas, and notes, but for most day-to-day usage, those can be considered internal implementation details.)
Table interface
The table is constructed with the following attributes:
schema
-
A schema for the table. Can be a hashref, a JSON string, a scalar ref to a JSON string, or a Path::Tiny path to a file containing the schema.
input
-
The CSV data for the table. Can be a filehandle, a scalar ref to a string of data, or a Path::Tiny path to a file.
has_header
-
A boolean indicating whether the CSV contains a header row. This will be used to supply any column names missing from the schema, and will be skipped from being returned by
get_row
. reader
-
A coderef which, if given a filehandle, will return a parsed line of CSV. The default is basically something like:
sub { Text::CSV_XS->new->getline($_[0]) }
That's probably sufficient for most cases, but you may need to supply your own reader for handling tab-delimited files.
skip_rows
-
An integer, number of additional rows to skip before the header. Some CSV files contain a title or credit line. Defaults to 0.
skip_rows_after_header
-
An integer, number of additional rows to skip after the header. Defaults to 0.
The table provides the following methods:
get_row
-
Returns a row object for the next row of the table.
all_rows
-
Gets all the rows as a list.
row_count
-
The number of non-skipped, non-header lines read so far.
Row interface
The rows returned by get_row
and all_rows
are blessed objects. They provide the following methods:
raw_values
-
The values returned by Text::CSV_XS without any further processing.
values
-
The values returned by Text::CSV_XS, processed by datatype. Date and time datatypes will be reformatted from any CLDR-based format to ISO 8601. Booleans using non-standard representations will be changed to "1" and "0". Fields that have a separator defined will be split into an arrayref. Numbers given as percentages will be divided by 100. And so forth.
cells
-
Returns the same values as
values
but wrapped in cell objects. The following are equivalent:$row->values->[0]; $row->cells->[0]->value; $row->[0]; # $row overloads @{}
Why fetch a cell instead of directly fetching the value? The cell object offers a few other useful methods.
get($name)
-
Gets a single cell from the row by its name. Names are defined in the schema, or the header row if missing from the schema.
$row->get("country")->value;
row_number
-
The row number for this row in the table. Rows are numbered starting at 1. Headers and skipped rows are not counted.
key_string
-
For tables that has a primary key, this returns a string formed by joining together the primary key columns. It ought to be a unique identifier for this row within the table, and if it is not, this will be raised as an error.
errors
-
An arrayref of strings of errors associated with this row. This includes data validation problems.
Cell interface
It is possible to bypass using the cell interface and access cell values directly from the rows, but if accessing cells, these are the methods they provide:
raw_value
-
The value returned by Text::CSV_XS without any further processing.
value
-
The value returned by Text::CSV_XS, processed by datatype.
inflated_value
-
Like
value
but inflates some values to blessed objects. Date and time related datatypes will be returned as DateTime, DateTime::Incomplete, or DateTime::Duration objects. Booleans will be returned as JSON::PP::Boolean objects. row_number
-
The row number for the cell's parent row in the table. Rows are numbered starting at 1. Headers and skipped rows are not counted.
col_number
-
The column number of this cell within the parent row. Columns are numbered starting at 1.
datatype
-
The datatype for this cell as a hashref.
BUGS
Please report any bugs to http://rt.cpan.org/Dist/Display.html?Queue=Data-Validate-CSV.
SEE ALSO
https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/.
AUTHOR
Toby Inkster <tobyink@cpan.org>.
COPYRIGHT AND LICENCE
This software is copyright (c) 2019 by Toby Inkster.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
DISCLAIMER OF WARRANTIES
THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.