The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Ufal::NameTag - bindings to NameTag library http://ufal.mff.cuni.cz/nametag.

DESCRIPTION

Ufal::NameTag is a Perl binding to NameTag library http://ufal.mff.cuni.cz/nametag.

All classes can be imported into the current namespace using the all export tag.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Perl interface, see Ufal::NameTag::Forms source for reference. Static methods and enumerations are available only through the module, not through object instance.

Wrapped C++ API

The C++ API being wrapped follows. For a API reference of the original C++ API, see L\<http://ufal.mff.cuni.cz/nametag/api-reference\>.

Helper Structures
-----------------

  typedef vector<string> Forms;
  
  struct TokenRange {
    size_t start;
    size_t length;
  };
  typedef vector<TokenRange> TokenRanges;
  
  struct NamedEntity {
    size_t start;
    size_t length;
    string type;
  
    NamedEntity();
    NamedEntity(size_t start, size_t length, const string& type);
  };


Main Classes
------------

  class Version {
   public:
    unsigned major;
    unsigned minor;
    unsigned patch;
  
    static Version current();
  };
  
  class Tokenizer {
   public:
    virtual void setText(const char* text);
    virtual bool nextSentence(Forms* forms, TokenRanges* tokens);
  
    static Tokenizer* newVerticalTokenizer();
  };
  
  class Ner {
    static ner* load(const char* fname);
  
    virtual void recognize(Forms& forms, NamedEntities& entities) const;
  
    virtual Tokenizer* newTokenizer() const;
  };

Example

run_ner

Simple example performing named entity recognition.

use strict;
use open qw(:std :utf8);

use Ufal::NameTag qw(:all);

sub encode_entities($) {
  my ($text) = @_;
  $text =~ s/[&<>"]/$& eq "&" ? "&amp;" : $& eq "<" ? "&lt;" : $& eq ">" ? "&gt;" : "&quot;"/ge;
  return $text;
}

sub sort_entities($) {
  my ($entities) = @_;
  my @entities = ();
  for (my ($i, $size) = (0, $entities->size()); $i < $size; $i++) {
    push @entities, $entities->get($i);
  }
  return sort { $a->{start} <=> $b->{start} || $b->{length} <=> $a->{length} } @entities;
}

@ARGV >= 1 or die "Usage: $0 recognizer_model\n";

print STDERR "Loading ner: ";
my $ner = Ner::load($ARGV[0]);
$ner or die "Cannot load recognizer from file '$ARGV[0]'\n";
print STDERR "done\n";
shift @ARGV;

my $forms = Forms->new();
my $tokens = TokenRanges->new();
my $entities = NamedEntities->new();
my @sorted_entities;
my @open_entities;
my $tokenizer = $ner->newTokenizer();
$tokenizer or die "No tokenizer is defined for the supplied model!";

for (my $not_eof = 1; $not_eof; ) {
  my $text = '';

  # Read block
  while (1) {
    my $line = <>;
    last unless ($not_eof = defined $line);
    $text .= $line;
    chomp($line);
    last unless length $line;
  }

  # Tokenize and recognize
  $tokenizer->setText($text);
  my $t = 0;
  while ($tokenizer->nextSentence($forms, $tokens)) {
    $ner->recognize($forms, $entities);
    @sorted_entities = sort_entities($entities);

    # Write entities
    for (my ($i, $size, $e) = (0, $tokens->size(), 0); $i < $size; $i++) {
      my $token = $tokens->get($i);
      my ($token_start, $token_length) = ($token->{start}, $token->{length});

      print encode_entities(substr $text, $t, $token_start - $t);
      print '<sentence>' if $i == 0;

      # Open entities starting at current token
      for (; $e < @sorted_entities && $sorted_entities[$e]->{start} == $i; $e++) {
        printf '<ne type="%s">', encode_entities($sorted_entities[$e]->{type});
        push @open_entities, $sorted_entities[$e]->{start} + $sorted_entities[$e]->{length} - 1;
      }

      # The token itself
      printf '<token>%s</token>', encode_entities(substr $text, $token_start, $token_length);

      # Close entities ending after current token
      while (@open_entities && $open_entities[-1] == $i) {
        print '</ne>';
        pop @open_entities;
      }
      print '</sentence>' if $i + 1 == $size;
      $t = $token_start + $token_length;
    }
  }
  # Write rest of the text
  print encode_entities(substr $text, $t);
}

AUTHORS

Milan Straka <straka@ufal.mff.cuni.cz>

Jana Straková <strakova@ufal.mff.cuni.cz>

COPYRIGHT AND LICENCE

Copyright 2014 by Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic.

NameTag is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

NameTag is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with NameTag. If not, see <http://www.gnu.org/licenses/>.