NAME
WordList::ID::Common::Wikipedia::Top500 - Top 300 words from Wikipedia Indonesia pages
VERSION
This document describes version 0.006 of WordList::ID::Common::Wikipedia::Top500 (from Perl distribution WordLists-ID-Common), released on 2020-10-11.
SYNOPSIS
use WordList::ID::Common::Wikipedia::Top500;
my $wl = WordList::ID::Common::Wikipedia::Top500->new;
# Pick a (or several) random word(s) from the list
my $word = $wl->pick;
my @words = $wl->pick(3);
# Check if a word exists in the list
if ($wl->word_exists('foo')) { ... }
# Call a callback for each word
$wl->each_word(sub { my $word = shift; ... });
# Iterate
my $first_word = $wl->first_word;
while (defined(my $word = $wl->next_word)) { ... }
# Get all the words
my @all_words = $wl->all_words;
DESCRIPTION
This module contains 300 most frequently used Indonesian words in Wikipedia Indonesian pages.
Here's how the list is produced: First the Wikipedia Indonesia's XML.bz2 [1] was downloaded (last downloaded: Oct 11, 2020). Then a couple of ad-hoc, rather simplistic Perl scripts were used to process this large file: one script to split the file to a per-page basis, and the other to strip Wikimedia markup. Case-insensitively, words were then extracted from these files and merged to become a single file. Then the list is manually curated to get the final 500 top words (false positives, misspellings removed).
Note that Wikipedia article pages do not represent general Indonesian text, some words are overrepresented e.g. "lagu" (in articles about particular songs) or "filum".
Some words are derivative forms (not-root words), e.g. "makanannya" or "berdasarkan".
WORDLIST STATISTICS
+----------------------------------+-------+
| key | value |
+----------------------------------+-------+
| avg_word_len | 6.134 |
| longest_word_len | 13 |
| num_words | 500 |
| num_words_contain_nonword_chars | 0 |
| num_words_contain_unicode | 0 |
| num_words_contain_whitespace | 0 |
| num_words_contains_nonword_chars | 0 |
| num_words_contains_unicode | 0 |
| num_words_contains_whitespace | 0 |
| shortest_word_len | 2 |
+----------------------------------+-------+
The statistics is available in the %STATS
package variable.
HOMEPAGE
Please visit the project's homepage at https://metacpan.org/release/WordLists-ID-Common.
SOURCE
Source repository is at https://github.com/perlancar/perl-WordLists-ID-Common.
BUGS
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordLists-ID-Common
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
perlancar <perlancar@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2020, 2018, 2017 by perlancar@cpan.org.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.