NAME

WordList::ID::Common::Wikipedia2500 - Top 2500 words from Wikipedia Indonesia pages

VERSION

This document describes version 0.006 of WordList::ID::Common::Wikipedia2500 (from Perl distribution WordLists-ID-Common), released on 2020-10-11.

SYNOPSIS

use WordList::ID::Common::Wikipedia2500;

my $wl = WordList::ID::Common::Wikipedia2500->new;

# Pick a (or several) random word(s) from the list
my $word = $wl->pick;
my @words = $wl->pick(3);

# Check if a word exists in the list
if ($wl->word_exists('foo')) { ... }

# Call a callback for each word
$wl->each_word(sub { my $word = shift; ... });

# Iterate
my $first_word = $wl->first_word;
while (defined(my $word = $wl->next_word)) { ... }

# Get all the words
my @all_words = $wl->all_words;

DESCRIPTION

This module contains 2500 most frequently used Indonesian words in Wikipedia Indonesian pages.

Here's how the list is produced: First the Wikipedia Indonesia's XML.bz2 [1] was downloaded (last downloaded: Dec 30, 2017). Then a couple of ad-hoc, rather simplistic Perl scripts were used to process this large file: one script to split the file to a per-page basis, and the other to strip Wikimedia markup. All-lowercase words were then extracted from these files and merged to become a single file. Then the list is curated to get the final {1000,2500,5000} top words (false positives, misspellings removed).

Note that Wikipedia article pages do not represent general Indonesian text, some words are overrepresented e.g. "lagu" (in articles about particular songs) or "filum".

Some words are derivative forms (not-root words), e.g. "makanannya" or "berdasarkan".

The order of the words in this wordlist is asciibetical, as required by the WordList convention. If you want to know the ranks of words by frequency, as well as the scripts used to generate the result, see the devscripts/ and work/ directories in the Git repository.

[1] https://id.wikipedia.org/wiki/Wikipedia:Wikipedia_bahasa_Indonesia_versi_luring

WORDLIST STATISTICS

+----------------------------------+-------+
| key                              | value |
+----------------------------------+-------+
| avg_word_len                     | 7.074 |
| longest_word_len                 | 18    |
| num_words                        | 2500  |
| num_words_contain_nonword_chars  | 0     |
| num_words_contain_unicode        | 0     |
| num_words_contain_whitespace     | 0     |
| num_words_contains_nonword_chars | 0     |
| num_words_contains_unicode       | 0     |
| num_words_contains_whitespace    | 0     |
| shortest_word_len                | 2     |
+----------------------------------+-------+

The statistics is available in the %STATS package variable.

HOMEPAGE

Please visit the project's homepage at https://metacpan.org/release/WordLists-ID-Common.

SOURCE

Source repository is at https://github.com/perlancar/perl-WordLists-ID-Common.

BUGS

Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordLists-ID-Common

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

perlancar <perlancar@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2020, 2018, 2017 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.