NAME

gen_word_break_data.pl - Generate word break table and tests

SYNOPSIS

perl gen_word_break_data.pl [-c] UCD_SRC_DIR

DESCRIPTION

This script generates the tables to lookup Unicode word break properties for the StandardTokenizer. It also converts the word break test suite in the UCD to JSON.

UCD_SRC_DIR should point to a directory containing the files WordBreakProperty.txt, WordBreakTest.txt, and DerivedCoreProperties.txt from the Unicode Character Database available at http://www.unicode.org/Public/6.3.0/ucd/.

OUTPUT FILES

modules/unicode/ucd/WordBreak.tab
modules/unicode/ucd/WordBreakTest.json

OPTIONS

-c

Show total table size for different shift values