NAME

String::CaseProfile - Get/Set the letter case profile of a string

VERSION

Version 0.17 - February 9, 2010

SYNOPSIS

use String::CaseProfile qw(get_profile set_profile copy_profile);

my $reference_string = 'Some reference string';
my $string = 'sample string';


# Typical, single-line usage
my $target_string = set_profile($string, get_profile($reference_string));

# Alternatively, you can use the 'copy_profile' convenience function:
my $target_string = copy_profile(
                                    from => $reference_string,
                                    to   => $string,
                                );


# Get the profile of a string and access the details
my %ref_profile = get_profile($reference_string);

my $string_type = $ref_profile{string_type};
my $profile_str = $ref_profile{fold};             # 'fll'
my $word        = $ref_profile{words}[2]->{word}; # third word
my $word_type   = $ref_profile{words}[2]->{type};

# See a profile report
print "$ref_profile{report}";        # No need to add \n

# Apply the profile to another string
my $new_string  = set_profile($string, %ref_profile);


# Use custom profiles
my %profile1 = ( string_type => '1st_uc' );
$new_string  = set_profile($string, %profile1);

my %profile2 = ( string_type => 'all_lc', force_change => 1 );
$new_string  = set_profile($string, %profile2);

my %profile3 = (
                custom => {
                            default => 'all_lc',
                            all_uc  => '1st_uc',
                            index   => {
                                        3 => '1st_uc',
                                        5 => 'all_lc',
                                       },
                           }
                );
$new_string  = set_profile($string, %profile3);

DESCRIPTION

This module provides a convenient way of handling the recasing (letter case conversion) of sentences/phrases/chunks in machine translation, case-sensitive search and replace, and other text processing applications.

String::CaseProfile includes three functions:

get_profile determines the letter case profile of a string.

set_profile applies a letter case profile to a string; you can apply a profile determined by get_profile, or you can create your own custom profile.

copy_profile gets the profile of a string and applies it to another string in a single step.

These functions are Unicode-aware and support text in languages based on alphabets which feature lowercase and uppercase letter forms (Roman, Greek, Cyrillic and Armenian). You must feed them utf8-encoded strings.

get_profile and set_profile use the following identifiers to classify word and string types according to their case:

  • all_lc

    In word context, it means that all the letters are lowercase. In string context, it means that every word is of all_lc type.

  • all_uc

    In word context, it means that all the letters are uppercase. In string context, it means that every word is of all_uc type.

  • 1st_uc

    In word context, it means that the first letter is uppercase, and the other letters are lowercase. In string context, it means that the type of the first word is 1st_uc, and the type of the other words is all_lc.

  • other

    Undefined type (e.g. a CamelCase code identifier in word context, or a string containing several alternate types in string context.)

FUNCTIONS

NOTE: The syntax of the get_profile function changed slightly in v0.16. The old syntax (see http://search.cpan.org/~enell/String-CaseProfile-0.15/lib/String/CaseProfile.pm) still works, but eventually it will be deprecated.

get_profile( $string, { exclude => $excluded, strict => $strict } )

Returns a hash containing the profile details for $string.

The string provided must be encoded as utf8. This is the only required parameter.

You can also specify a hash reference containing any of the following optional parameters:

  • exclude

    A reference to a list of terms that should not be considered when determining the profile of $string (e.g., the word "Internet" in some cases, or the first person personal pronoun in English, "I").

  • strict

    A parameter that you can set to to a true value if you want to consider 'Other'-type words when determining the string type. By default, this parameter is set to false.

The keys of the returned hash are the following:

  • string_type

    Scalar containing the string type, if it can be determined; otherwise, its value is 'other'.

  • fold

    Pattern string created by mapping each word type to a single-letter code:

    1st_uc => 'f'
    all_uc => 'u'
    all_lc => 'l'
    other  => 'o'

    For instance, the patterns of the common types are:

    1st_uc:  ^fl*$
    all_uc:  ^u+$
    all_lc:  ^l+$

    This feature can be useful to process 'other' string types using regular expressions. E.g., you can use it to detect (probable) title case strings:

    if ( $profile{fold} =~ /^f[fl]*f$/ ) {
        # some code here
    }
  • words

    Reference to an array containing a hash for every word in the string. Each hash has two keys: word and type.

  • report

    Returns a string containing a summary of the string profile.

set_profile( $string, %profile )

Applies %profile to $string and returns a new string. $string must be encoded as utf8. The profile configuration parameters (hash keys) are the following:

  • string_type

    You can specify one of the string types mentioned above (except 'other') as the type that should be applied to the string.

  • custom

    As an alternative, you can define a custom profile as a reference to a hash in which you can specify types for specific word (zero-based) positions, conversions for the types mentioned above, and you can define a 'default' type for the words for which none of the preceding rules apply. The order of evaluation is 1) index, 2) type conversion, 3) default type. For more information, see the examples below.

  • exclude

    Optionally, you can specify a list of words that should not be affected by the get_profile function. The value of the exclude key should be an array reference. The case profile of these words won't change unless the target string type is 'all_uc'.

  • force_change

    By default, set_profile will ignore words with type 'other' when applying the profile. You can use this boolean parameter to enable changing this kind of words.

copy_profile(from => $source, to => $target, [ exclude => $array_ref ])

Gets the profile of $source, applies it to $target, and returns the resulting string.

You can also specify words that should be excluded both in the input string and the target string:

copy_profile(
                from    => $source,
                to      => $target,
                exclude => $array_ref,
                strict  => $strict,
            );

This is just a convenience function. If copy_profile cannot determine the profile of the source string, it will leave unchanged the target string. If you need more control, you should use the get_profile and set_profile functions.

NOTES:

When these functions process excluded words, they also consider compound words that include them, like "Internet-based" or "I've".

The list of excluded words is case-sensitive (i.e., if you exclude the word 'MP3', its lowercase version, 'mp3', won't be excluded unless you add it to the list).

EXAMPLES

use String::CaseProfile qw(
                            get_profile
                            set_profile
                            copy_profile
                           );
use Encode;

my @strings = (
                'Entorno de tiempo de ejecución',
                'è un linguaggio dinamico',
                'langages dérivés du C',
              );


# Encode strings as utf-8, if necessary
my @samples = map { decode('iso-8859-1', $_) } @strings;

my $new_string;


# EXAMPLE 1: Get the profile of a string

my %profile = get_profile( $samples[0] );

print "$profile{string_type}\n";   # prints '1st_uc'
my @types = $profile{string_type}; # 1st_uc all_lc all_lc all_lc all_lc
my @words = $profile{words};       # returns an array of hashes



# EXAMPLE 2: Get the profile of a string and apply it to another string

my $ref_string1 = 'REFERENCE STRING';
my $ref_string2 = 'Another reference string';

$new_string = set_profile( $samples[1], get_profile( $ref_string1 ) );
# The current value of $new_string is 'È UN LINGUAGGIO DINAMICO'

$new_string = set_profile( $samples[1], get_profile( $ref_string2 ) );
# Now it's 'È un linguaggio dinamico'

# Alternative, using copy_profile
$new_string = copy_profile( from => $ref_string1, to => $samples[1] );
$new_string = copy_profile( from => $ref_string2, to => $samples[1] );



# EXAMPLE 3: Change a string using several custom profiles

my %profile1 = ( string_type  => 'all_uc' );

$new_string = set_profile( $samples[2], %profile1 );
# $new_string is 'LANGAGES DÉRIVÉS DU C'

my %profile2 = ( string_type => 'all_lc', force_change => 1 );

$new_string = set_profile( $samples[2], %profile2 );
# $new_string is 'langages dérivés du c'

my %profile3 = (
                custom  => {
                            default => 'all_lc',
                            index   => { '1'  => 'all_uc' }, # 2nd word
                           }
               );

$new_string = set_profile( $samples[2], %profile3 );
# $new_string is 'langages DÉRIVÉS du C'

my %profile4 = ( custom => { all_lc => '1st_uc' } );

$new_string = set_profile( $samples[2], %profile4 );
# $new_string is 'Langages Dérivés Du C'

More examples, this time excluding words:

# A second batch of sample strings
@strings = (
            'conexión a Internet',
            'An Internet-based application',
            'THE ABS MODULE',
            'Yes, I think so',
            "this is what I'm used to",
           );
           
# Encode strings as utf-8, if necessary
my @samples = map { decode('iso-8859-1', $_) } @strings;



# EXAMPLE 4: Get the profile of a string excluding the word 'Internet'
#            and apply it to another string

my %profile = get_profile( $samples[0], { exclude => ['Internet'], } );

print "$profile{string_type}\n";      # prints  'all_lc'
print "$profile{words}[2]->{word}\n"; # prints 'Internet'
print "$profile{words}[2]->{type}\n"; # prints 'excluded'

# Set this profile to $samples[1], excluding the word 'Internet'
$profile{exclude} = ['Internet'];

$new_string = set_profile( $samples[1], %profile );

print "$new_string\n"; # prints "an Internet-based application", preserving
                       # the case of the 'Internet-based' compound word



# EXAMPLE 5: Set the profile of a string containing a '1st_uc' excluded word
#            to 'all_uc'

%profile = ( string_type => 'all_uc', exclude => ['Internet'] );

$new_string = set_profile( $samples[0], %profile );

print "$new_string\n";   # prints 'CONEXIÓN A INTERNET', as expected, since
                         # the case profile of a excluded word is not preserved
                         # if the target string type is 'all_uc'



# EXAMPLE 6: Set the profile of a string containing an 'all_uc'
#            excluded word to 'all_lc'

%profile = ( string_type => 'all_lc', exclude => ['ABS'] );

$new_string = set_profile( $samples[2], %profile );

print "$new_string\n";   # prints 'the ABS module', preserving the 
                         # excluded word case profile


# EXAMPLE 7: Get the profile of a string containing the word 'I' and
#            apply it to a string containing the compound word 'I'm'
#            using the copy_profile function

$new_string = copy_profile(
                            from    => $samples[3],
                            to      => $samples[4],
                            exclude => ['I'],
                          );

print "$new_string\n";   # prints "This is what I'm used to"



# EXAMPLE 8: Change a string using a custom profile

%profile = (
                custom  => {
                            default => '1st_uc',
                            index   => { '1'  => 'all_lc' }, # 2nd word
                           },
                exclude => ['ABS'],
           );

$new_string = set_profile( $samples[2], %profile );
print "$new_string\n";  # prints 'The ABS Module'

Yet more examples using other alphabets:

# Samples using other alphabets

use utf8;

binmode STDOUT, ':utf8';


my @samples = ( 
                'Ծրագրի հեղինակների ցանկը', # Armenian
                'Λίστα των συγγραφέων του προγράμματος', # Greek
                'Список авторов программы', # Russian
              );

my $new_string;


# EXAMPLE 9: Get the profile of a string

my %profile = get_profile( $samples[0] );

print "$profile{string_type}\n";   # prints '1st_uc'


# EXAMPLE 10: Change a string using a custom profile

%profile = ( string_type  => 'all_uc');

$new_string = set_profile($samples[0], %profile);

print "$new_string\n"; # prints 'ԾՐԱԳՐԻ ՀԵՂԻՆԱԿՆԵՐԻ ՑԱՆԿԸ'


# EXAMPLE 11: Get the profile of a string and apply it to another string

print set_profile($samples[1], get_profile($new_string)); # prints 'ΛΊΣΤΑ ΤΩΝ ΣΥΓΓΡΑΦΈΩΝ ΤΟΥ ΠΡΟΓΡΆΜΜΑΤΟΣ'
print "\n";


# EXAMPLE 12: More custom profiles

my %profile1 = (
            custom  => {
                        default => 'all_lc',
                        index   => { '1'  => 'all_uc' }, # 2nd word
                       }
            );
            
my %profile2 = ( custom => { 'all_lc' => '1st_uc' } );

print set_profile($samples[2], %profile1); # prints 'список АВТОРОВ программы'
print "\n";

print set_profile($samples[2], %profile2); # prints 'Список Авторов Программы'
print "\n";

EXPORT

None by default.

LIMITATIONS

Since String::CaseProfile is a multilanguage module and title case is a language-dependent feature, the functions provided don't handle title case capitalization (in the See Also section you will find further information on modules you can use for this task). Anyway, you can use the profile information provided by get_profile to implement a solution for your particular case.

For the German language, which has a peculiar letter case rule consisting in capitalizing every noun, these functions may have a limited utility, but you can still use the profile information to create and apply customs profiles.

SEE ALSO

Lingua::EN::Titlecase

Text::Capitalize

http://en.wikipedia.org/wiki/Capitalization

ACKNOWLEDGEMENTS

Many thanks to Xavier Noria and Joaquín Ferrero for wise suggestions.

AUTHOR

Enrique Nell, <blas.gordon@gmail.com>

BUGS

Please report any bugs or feature requests to bug-string-caseprofile at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=String-CaseProfile. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc String::CaseProfile

You can also look for information at:

COPYRIGHT AND LICENSE

Copyright (C) 2007-2010 by Enrique Nell, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.