Sophie

Sophie

distrib > Mandriva > 2010.1 > x86_64 > media > main-release > by-pkgid > ce2d6b4be30fbdadc0779c842df46ebb > files > 8

aspell-gv-0.50.0-9mdv2010.1.x86_64.rpm


NOTES ON THE CONSTRUCTION OF THE WORD LIST
   A preliminary version of this spell checking dictionary was assembled
with the help of my web crawler "An Crúbadán":

  http://borel.slu.edu/crubadan/

BUILDING TEXT CORPORA FOR MINORITY LANGUAGES
Initially a small collection of "seed" texts are fed to the crawler
(a few hundred words of running text have been sufficient in practice).
Queries combining words from these texts are generated and passed to
the Google API which returns a list of documents potentially written
in the target language.  These are downloaded, processed into plain text,
and formatted.  A combination of statistical techniques bootstrapped from
the initial seed texts (and refined as more texts are added to the database)
is used to determine which documents (or sections thereof) are written in
the target language.   The crawler then recursively follows links contained
within documents that are in the target language.   When these run out,
the entire process is repeated, with a new set of Google queries generated
from the new, larger corpus.

EXTRACTING A CLEAN WORD LIST
The raw texts downloaded using the scheme just described contain
a lot of pollution and are unsuitable for use without further processing.   
I have been able to extract reasonably accurate spell checking dictionaries
by applying a series of simple filters.   First, the texts are tokenized
and used to generate a word list sorted by frequency and the lowest
frequency words are filtered out.   Then, depending on the target language,
correctly-spelled words from one or more "polluting" languages
are filtered out to be checked by hand later.  Usually this means English,
but I also filter Dutch from the Frisian corpus, Spanish from Chamorro, etc.
The remaining words are used to generate 3-gram statistics for the target
language.  These are used to flag as "suspect" any remaining words containing
one or more improbable 3-grams.  Finally, pairs of words differing only
in the presence or absence of diacritical marks are flagged.

Please contact me at the address below if you are interested in applying
these techniques to a new language.

Kevin Scannell 
<scannell@slu.edu>
March 2004