NAME
gen_word_break_data.pl - Generate word break table and tests
SYNOPSIS
perl gen_word_break_data.pl [-c] UCD_SRC_DIR
DESCRIPTION
This script generates the tables to lookup Unicode word break properties for the StandardTokenizer. It also converts the word break test suite in the UCD to JSON.
UCD_SRC_DIR should point to a directory containing the files WordBreakProperty.txt, WordBreakTest.txt, and DerivedCoreProperties.txt from the Unicode Character Database available at http://www.unicode.org/Public/6.3.0/ucd/.
OUTPUT FILES
modules/unicode/ucd/WordBreak.tab
modules/unicode/ucd/WordBreakTest.json
OPTIONS
-c
Show total table size for different shift values