NAME

WordList::ID::Common::Wikipedia::Top500 - Top 300 words from Wikipedia Indonesia pages

VERSION

This document describes version 0.006 of WordList::ID::Common::Wikipedia::Top500 (from Perl distribution WordLists-ID-Common), released on 2020-10-11.

SYNOPSIS

use WordList::ID::Common::Wikipedia::Top500;

my $wl = WordList::ID::Common::Wikipedia::Top500->new;

# Pick a (or several) random word(s) from the list
my $word = $wl->pick;
my @words = $wl->pick(3);

# Check if a word exists in the list
if ($wl->word_exists('foo')) { ... }

# Call a callback for each word
$wl->each_word(sub { my $word = shift; ... });

# Iterate
my $first_word = $wl->first_word;
while (defined(my $word = $wl->next_word)) { ... }

# Get all the words
my @all_words = $wl->all_words;

DESCRIPTION

This module contains 300 most frequently used Indonesian words in Wikipedia Indonesian pages.

Here's how the list is produced: First the Wikipedia Indonesia's XML.bz2 [1] was downloaded (last downloaded: Oct 11, 2020). Then a couple of ad-hoc, rather simplistic Perl scripts were used to process this large file: one script to split the file to a per-page basis, and the other to strip Wikimedia markup. Case-insensitively, words were then extracted from these files and merged to become a single file. Then the list is manually curated to get the final 500 top words (false positives, misspellings removed).

Note that Wikipedia article pages do not represent general Indonesian text, some words are overrepresented e.g. "lagu" (in articles about particular songs) or "filum".

Some words are derivative forms (not-root words), e.g. "makanannya" or "berdasarkan".

WORDLIST STATISTICS

+----------------------------------+-------+
| key                              | value |
+----------------------------------+-------+
| avg_word_len                     | 6.134 |
| longest_word_len                 | 13    |
| num_words                        | 500   |
| num_words_contain_nonword_chars  | 0     |
| num_words_contain_unicode        | 0     |
| num_words_contain_whitespace     | 0     |
| num_words_contains_nonword_chars | 0     |
| num_words_contains_unicode       | 0     |
| num_words_contains_whitespace    | 0     |
| shortest_word_len                | 2     |
+----------------------------------+-------+

The statistics is available in the %STATS package variable.

HOMEPAGE

Please visit the project's homepage at https://metacpan.org/release/WordLists-ID-Common.

SOURCE

Source repository is at https://github.com/perlancar/perl-WordLists-ID-Common.

BUGS

Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordLists-ID-Common

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

perlancar <perlancar@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2020, 2018, 2017 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.