NAME

Business::CompanyDesignator - module for matching and stripping/manipulating the company designators appended to company names

VERSION

Version: 0.13.

This module is considered a BETA release. Interfaces may change and/or break without notice until the module reaches version 1.0.

SYNOPSIS

Business::CompanyDesignator is a perl module for matching and stripping/manipulating the typical company designators appended (or sometimes, prepended) to company names. It supports both long forms (e.g. Corporation, Incorporated, Limited etc.) and abbreviations (e.g. Corp., Inc., Ltd., GmbH etc).

use Business::CompanyDesignator;

# Constructor
$bcd = Business::CompanyDesignator->new;
# Optionally, you can provide your own company_designator.yml file, instead of the bundled one
$bcd = Business::CompanyDesignator->new(datafile => '/path/to/company_designator.yml');

# Get lists of designators, which may be long (e.g. Limited) or abbreviations (e.g. Ltd.)
@des = $bcd->designators;
@long = $bcd->long_designators;
@abbrev = $bcd->abbreviations;

# Lookup individual designator records (returns B::CD::Record objects)
# Lookup record by long designator (unique)
$record = $bcd->record($long_designator);
# Lookup records by abbreviation or long designator (may not be unique)
@records = $bcd->records($designator);

# Get a regex for matching designators by type ('end'/'begin') and lang
# By default, returns 'end' regexes for all languages
$re = $bcd->regex;
$company_name =~ $re and say 'designator found!';
$company_name =~ /$re\s*$/ and say 'final designator found!';
my $re_begin_en = $bcd->regex('begin', 'en');

# Split $company_name on designator, returning a ($before, $designator, $after) triplet,
# plus the normalised form of the designator matched (can pass to records(), for example)
($before, $des, $after, $normalised_des) = $bcd->split_designator($company_name);

# Or in scalar context, return a L<Business::CompanyDesignator::SplitResult> object
$res = $bcd->split_designator($company_name, lang => 'en');
print join ' / ', $res->designator_std, $res->short_name, $res->extra;

DATASET

Business::CompanyDesignator uses the company designator dataset from here:

L<https://github.com/ProfoundNetworks/company_designator>

which is bundled with the module. You can use your own (updated or custom) version, if you prefer, by passing a 'datafile' parameter to the constructor.

The dataset defines multiple long form designators (like "Company", "Limited", or "Incorporée"), each of which have zero or more abbreviations (e.g. 'Co.', 'Ltd.', 'Inc.' etc.), and one or more language codes. The 'Company' entry, for instance, looks like this:

Company:
  abbr:
    - Co.
    - '& Co.'
    - and Co.
  lang: en

Long designators are unique across the dataset, but abbreviations are not e.g. 'Inc.' is used for both "Incorporated" and French "Incorporée".

METHODS

new()

Creates a Business::CompanyDesignator object.

$bcd = Business::CompanyDesignator->new;

By default this uses the bundled company_designator dataset. You may provide your own (updated or custom) version by passing via a 'datafile' parameter to the constructor.

$bcd = Business::CompanyDesignator->new(datafile => '/path/to/company_designator.yml');

designators()

Returns the full list of company designator strings from the dataset (both long form and abbreviations).

@designators = $bcd->designators;

long_designators()

Returns the full list of long form designators from the dataset.

@long = $bcd->long_designators;

abbreviations()

Returns the full list of abbreviation designators from the dataset.

@abbrev = $bcd->abbreviations;

record($long_designator)

Returns the Business::CompanyDesignator::Record object for the given long designator (and dies if not found).

records($designator)

Returns a list of Business::CompanyDesignator::Record objects for the given abbreviation or long designator (for long designators there will only be a single record returned, but abbreviations may map to multiple records).

Use this method for abbreviations, or if you're aren't sure of a designator's type.

regex([$type], [$lang])

Returns a regex for all matching designators for $type ('begin'/'end') and $lang (iso 639-1 language code e.g. 'en', 'es', de', etc.) from the dataset. $lang may be either a single language code scalar, or an arrayref of language codes, for multiple alternative languages. The returned regex is case-insensitive and non-anchored.

$type defaults to 'end', so without parameters regex() returns a regex matching all designators for all languages.

split_designator($company_name, [lang => $lang], [allow_embedded => $bool])

Attempts to split $company_name on (the first) company designator found.

In array context split_designator returns a list of four items - a triplet of strings from $company_name ( $before, $designator, $after ), plus the standardised version of the designator as a fourth element.

($short_name, $des, $after_text, $des_std) = $bcd->split_designator($company_name);

In scalar context split_designator returns a Business::CompanyDesignator::SplitResult object.

$res = $bcd->split_designator($company_name, lang => $lang);

The $des designator in array context, and the SplitResult $res->designator is the designator text as it matched in $company_name, while the array context $des_std, and the SplitResult $res->designator_std is the standardised version as found in the dataset.

For instance, "ABC Pty Ltd" would return "Pty Ltd" as the $designator, but "Pty. Ltd." as the stardardised form, and the latter would be what you would find in designators() or would lookup with records(). Similarly, "Accessoires XYZ Ltee" (without the french acute) would match, returning "Ltee" (as found) for the $designator, but "Ltée" (with the acute) as the standardised form.

split_designator accepts the following optional (named) parameters:

lang => $lang

$lang can be a scalar ISO 639-1 language code ('en', 'fr', 'cn', etc.), or an arrayref containing multiple language codes. If $lang is defined, split_designator will only match designators for the specified set of languages, which can improve the accuracy of the split by reducing false positive matches.

allow_embedded => $boolean

allow_embedded is a boolean indicating whether or not designators can occur in the middle of strings, instead of only at the beginning or end. Defaults to true, for backwards compatibility, which yields more matches, but also more false positives. Setting to false is safer, but yields fewer matches (and embedded designators do occur surprisingly often in the wild.)

For more discussion, see AMBIGUITIES below.

AMBIGUITIES

Note that split_designator does not always get the split right. It checks for final designators first, then leading ones, and then finally looks for embedded designators (if allow_embedded is set to true).

Leading and trailing designators are usually reasonably accurate, but embedded designators are problematic. For instance, embedded designators allow names like these to split correctly:

Amerihealth Insurance Company of NJ
Trenkwalder Personal AG Schweiz
Vicente Campano S L (COMERCIAL VICAM)
Gvozdika, gostinitsa OOO ""Eko-Treyd""

but it will also wrongly split names like the following:

XYZ PC Repairs ('PC' is a designator meaning 'Professional Corporation')
Dr S L Ledingham ('S L' is a Spanish designator for 'Sociedad Limitada')

If you do want to allow splitting on embedded designators, you might want to pass a 'lang' parameter to split_designator if you know the language(s) used for your company names, as this will reduce the number of false positives by restricting the set of designators matched against. It won't eliminate the issue altogether though, so some post-processing might be required. (And I'd love to hear of ideas on how to improve this.)

SEE ALSO

Finance::CompanyNames

AUTHOR

Gavin Carr <gavin@profound.net>

COPYRIGHT AND LICENCE

Copyright (C) 2013-2016 Gavin Carr

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.