NAME
Genealogy::Gedcom - An OS-independent processor for GEDCOM data
Synopsis
See Genealogy::Gedcom::Reader::Lexer.
Description
Genealogy::Gedcom provides a processor for GEDCOM data.
See The GEDCOM Specification Ged551-5.pdf.
Distributions
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.
Installation
Install Genealogy::Gedcom as you would for any Perl
module:
Run:
cpanm Genealogy::Gedcom
or run:
sudo cpan Genealogy::Gedcom
or unpack the distro, and then either:
perl Build.PL
./Build
./Build test
sudo ./Build install
or:
perl Makefile.PL
make (or dmake or nmake)
make test
make install
Constructor and Initialization
See Genealogy::Gedcom::Reader::Lexer.
FAQ
Does this module handle utf8?
Yes. The input files are assumed to be in utf8. Files in ISO-8859-1 work automatically, too.
The default output log also handles utf8.
Does this module handle ANSEL?
No. ANSEL was an invention before Unicode. Just create a utf-8 encoded file, such as data/sample.7.ged.
That file was generated from data/GEDCOMANSELTable.xhtml by scripts/parse.sample.7.pl.
Thanx for Tamura Jones for creating that web page.
How are user-defined tags handled?
In the same way as GEDCOM tags.
They are defined by having a leading '_', as well as same syntax as GEDCOM files. That is:
- o At level 0, they match /(_?(?:[A-Z]{3,4}))/.
- o At level > 0, they match /(_?(?:ADR[123]|[A-Z]{3,5}))/.
Each user-defined tag is stand-alone, meaning they can't be extended with CONC or CONT tags in the way some GEDCOM tags can.
See data/sample.4.ged.
How are CONC and CONT tags handled?
Nothing is done with them, meaning e.g. text flowing from a NOTE (say) onto a CONC or CONT is not concatenated.
Currently then, even GEDCOM tags are stand-alone.
How is the lexed data stored in RAM?
Items are stored in an arrayref. This arrayref is available via the "items()" method.
This method returns the same data as does "items()" in Genealogy::Gedcom::Reader.
Each element in the array is a hashref of the form:
{
count => $n,
data => $a_string
level => $n,
line_count => $n,
tag => $a_tag,
type => $a_string,
xref => $a_string,
}
Key-value pairs are:
- o count => $n
-
Items are numbered from 1 up, so this is the array index + 1.
Note: Blank lines in the input file are skipped.
- o data => $a_string
-
This is any data associated with the tag.
Given the GEDCOM record:
1 NAME Given Name /Surname/
then data will be 'Given Name /Surname/', i.e. the text after the tag.
Given the GEDCOM record:
1 SUBM @SUBM1@
then data will be 'SUBM1'.
As with xref (below), the '@' characters are stripped.
- o level => $n
-
The is the level from the GEDCOM data.
- o line_count => $n
-
This is the line number from the GEDCOM data.
- o tag => $a_tag
-
This is the GEDCOM tag.
- o type => $a_string
-
This is a string indicating what broad class the tag refers to. Values:
- o (Empty string)
-
Used for various cases.
- o Address
- o Concat
- o Continue
- o Date
-
If the type is 'Date', then it has been successfully parsed.
If parsing failed, the value will be 'Invalid date'.
- o Event
- o Family
- o File name
- o Header
- o Individual
- o Invalid date
-
If the type is 'Date', then it has been successfully parsed.
If parsing failed, the value will be 'Invalid date'.
- o Link to FAM
- o Link to INDI
- o Link to OBJE
- o Link to SUBM
- o Multimedia
- o Note
- o Place
- o Repository
- o Source
- o Submission
- o Submitter
- o Trailer
- o xref => $a_string
-
Given the GEDCOM record:
0 @I82@ INDI
then xref will be 'I82'.
As with data (above), the '@' characters are stripped.
What validation is performed?
There is no perfect answer as to what should be a warning and what should be an error.
So, the author's philosophy is that unrecoverable states are errors, and the code calls 'die'. See "Under what circumstances does the code call 'die'?".
And, the log level 'error' is not used. All validation failures are logged at level warning, leaving interpretation up to the user. See "How does logging work?".
Details:
- o Cross-references
-
Xrefs (pointers) are checked that they point to an xref which exists. Each dangling xref is only reported once.
- o Dates are validated
- o Duplicate xrefs
-
Xrefs which are (potentially) pointed to are checked for uniqueness.
- o String lengths
-
Maximum string lengths are checked as per the GEDCOM Specification.
Minimum string lengths are checked as per the value of the 'strict' option to new().
- o Strict 'v' Mandatory
-
Validation is mandatory, even with the 'strict' option set to 0. 'strict' only affects the minimum string length acceptable.
- o Tag nesting
-
Tag nesting is validated by the mechanism of nested method calls, with each method (called tag_*) knowing what tags it handles, and with each nested call handling its own tags.
This process starts with the call to tag_lineage(0, $line) in method "run()".
-
The lexer reports the first unexpected tag, meaning it is not a GEDCOM tag and it does not start with '_'.
All validation failures are reported as log messages at level 'warning'.
What other validation is planned?
Here are some suggestions from the mailing list:
-
This means check that each tag has all its mandatory sub-tags.
- o Natural (not step-) parent must be older than child
- o Prior art
-
Many such checks are possible. E.g. Attribute type (p 43 of GEDCOM Specification) must be one of: CAST | EDUC | NATI | OCCU | PROP | RELI | RESI | TITL | FACT.
What other features are planned?
Here are some suggestions from the mailing list:
How does logging work?
- o Debugging
-
When new() is called as new(maxlevel => 'debug'), each method entry is logged at level 'debug'.
This has the effect of tracing all code which processes tags.
Since the default value of 'maxlevel' is 'info', all this output is suppressed by default. Such output is mainly for the author's benefit.
- o Log levels
-
Log levels are, from highest (i.e. most output) to lowest: 'debug', 'info', 'warning', 'error'. No lower levels are used. See Log::Handler::Levels.
'maxlevel' defaults to 'info' and 'minlevel' defaults to 'error'. In this way, levels 'info' and 'warning' are reported by default.
Currently, level 'error' is not used. Fatal errors cause 'die' to be called, since they are unrecoverable. See "Under what circumstances does the code call 'die'?".
- o Reporting
-
When new() is called as new(report_items => 1), the items are logged at level 'info'.
- o Validation failures
-
These are reported at level 'warning'.
Under what circumstances does the code call 'die'?
- o When there is a typo in the field name passed in to check_length()
-
This is a programming error.
- o When an input file is not specified
-
This is a user (run time) error.
- o When there is a syntax error in a GEDCOM record
-
This is a user (data preparation) error.
How do I change the version of the GEDCOM grammar supported?
By sub-classing.
TODO
o What is the purpose of this set of modules?
It's the basis of a long-term project to write a new interface to GEDCOM files.
How are the modules related?
- o Genealogy::Gedcom
-
This is a dummy module at the moment, which just occupies the namespace. It holds the FAQ though.
- o Genealogy::Gedcom::Reader
-
This employs the lexer to do the work. It may one day use the new (currently non-existent) parser too.
- o Genealogy::Gedcom::Reader::Lexer
-
This does the real work for finding tokens within GEDCOM files.
Run: perl scripts/lex.pl -help
Programs Supplied as part of this Package
- o find.unused.limits.pl
-
Helps me debug code.
- o lex.pl
-
Runs the lexer on a file and reports some statictics. Try lex.pl -h.
- o parse.sample.7.pl
-
This reads data/sample.7.html and writes data/sample.7.ged.
- o test.all.dates.pl
-
Reads all files in data/ and checks that any each date is valid.
Repository
https://github.com/ronsavage/Genealogy-Gedcom
See Also
<Gedcom::Date>.
Machine-Readable Change Log
The file Changes was converted into Changelog.ini by Module::Metadata::Changes.
Version Numbers
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
Thanks
Many thanks are due to the people who worked on Gedcom.
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Genealogy::Gedcom.
Author
Genealogy::Gedcom was written by Ron Savage <ron@savage.net.au> in 2011.
Home page: http://savage.net.au/index.html.
Copyright
Australian copyright (c) 2011, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Perl License, a copy of which is available at:
http://dev.perl.org/licenses/