NAME

App::lcpan::Manual::Internals - App::lcpan internals

VERSION

version 1.065.000

INDEXING

Indexing is done in several steps. The last step (parsing release files) is done in at least 3 passes. We can skip one or more of these passes to save time, if we don't need the information that the passes gather.

First step: parse authors/01mailrc.txt.gz

First, we parse authors/01mailrc.txt.gz and insert the data into author table. Some DarkPANs like those produced by OrePAN have authors/00whois.xml instead.

Second step: parse modules/02packages.details.txt.gz

Then we parse modules/02packages.details.txt.gz, which is the main meat of CPAN index. This file links package (module) names to release tarballs. A snippet from the file:

...
Log::ger                          0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::App                     0.014  P/PE/PERLANCAR/Log-ger-App-0.014.tar.gz
Log::ger::DBI::Query              0.001  P/PE/PERLANCAR/Log-ger-DBI-Query-0.001.tar.gz
Log::ger::Filter                  0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::Filter::Code            0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
...

We insert these records to file table, so each release file gets a numeric file ID, and module table, so each module gets a numeric module ID as well as link to its file ID.

At this point, we haven't parsed distribution names yet because that will need information from META.{json,yaml} inside the release files.

Third step: (release) files

Then we start to examine the release files. This is done in several passes and you have the option to skip some of the passes. The third step is done in multiple passes because in pass 2, we want to collect all known scripts first to be able to detect links to scripts in POD (collected in pass 1). Also some passes are more high-level and/or experimental and/or optional.

Third step pass 1: content, scripts, distribution metadata, dependency

First we list the content of each release archive and store the results into the content table. This will allow us to check whether a distribution has a distribution metadata file (META.yml or META.json), whether a distribution contains scripts, and so on.

We populate the script table by heuristically including content which from its name looks like script, e.g.:

script/foo
bin/whatever

We then extract the distribution metadata files (either META.json or META.yaml) and store the information contained in these metadata files into the database. These include the distribution name (written to the dist table) and the dependency information (written to the dep table).

At the end of this first pass, we have a pretty useful database already. One of the main uses of lcpan is to provide dependency information. You can skip the other passes if you want.

Third step pass 2: parse POD

In the second pass, we extract modules and script files inside each release file into a temporary directory, then parse their POD. This pass usually takes several times the amount of time it takes to complete the first pass. At the time of this writing (2020-04-19) on my computer, the first pass takes about 14 minutes and the second pass takes 72 minutes. A big release file that contains thousands of (mostly autogenerated) module files (yes, they exist; see Paws for example) can take 25 minutes on its own. You might want to skip those files if you do not expect to ever need to deal with the module/distribution; see the lcpan update documentation. For example, in lcpan.conf you can put:

skip_index_file_patterns = ^Paws-\d
skip_index_file_patterns = ^Google-Ads-GoogleAds-Client-\d
skip_index_file_patterns = ^Google-Ads-AdWords-Client-\d
skip_index_file_patterns = ^eBay-API-\d
skip_index_file_patterns = ^Microsoft-AdCenter-\d
skip_index_file_patterns = ^VMOMI-\d

By parsing POD, we get: module/script abstract (stored into module table) and mentions (i.e. a POD that links to another POD, stored in mention table). The mentions information is mainly useful to know how related a module is to another (see lcpan related-mods subcommand).

Third step pass 3: subroutine

In this pass, we try to extract subroutine names in modules. This requires the use of a source code lexer (lcpan uses Compiler::Lexer). On my computer, this pass takes another 19 minutes. At the time of this writing, this pass is experimental and not enabled by default.

AUTHOR

perlancar <perlancar@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2021, 2020 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.