|
|
Making the Training Text
The training text should be a plain text file containing
text `similar' to what you intend to write.
The larger the better.
We think that 300K is a good size to aim for.
Example training texts that you could use are:
- Take all the documents you have written, and glue them
all together in one big document.
- Use novels - eg, we used Jane Austen's Emma from
Project Gutenberg. The problem with using
just one or two novels, however, is that particular words (like Emma or Alice)
occur very frequently; so novels are not ideal for a general-purpose training text.
- Use all the email messages you have written, and glue them
all together in one big document..
How to make a general-purpose training text
Here's how I made the training text for the English version of
Dasher.
-
Get lots of English documents. Get far more material than you think you need,
so that we can select a well-balanced
set of sentences in a sensible way, as follows.
-
Pre-process them all so that there is exactly one sentence per line.
I did this using a perl program I wrote,
processbook.p
with scripts like this
foreach f ( alice emma )
processbook.p /books0/$f > /books/$f
end
- Now, obtain a listing of the 2000 most frequent words in
the language. The idea is, since these words are common, it is important that we should
have them represented several times each in the final corpus, in a variety of
contexts. We will use these words to select which sentences are included
from our over-large corpus.
I obtained such a list from the internet and put it in a file called dict.
I removed from dict any absurdly common words that prevented the remaining steps from
working nicely.
-
Use another program to select from each pre-processed book the sentences
that contain the 2000 required words. Go through the required words in order,
so that the resulting corpus is also ordered, with the top of the corpus containing
examples of use of the most common words; that way, the corpus can be shrunk by cutting
its tail off, and should still be an appropriate corpus for its size.
Glue the sentences together into plausible-sized paragraphs, so as to emulate
normal writing.
I did this step by using the linux utility glimpse and my perl
program corpus.p
rm /data/coll/mackay/books/*~
glimpseindex -b -B -H ~/dasher/ /data/coll/mackay/books/
corpus.p k=1 f=4 o=corpus4.txt
That's how I made this corpus (316K),
which is used in Dasher 1.6.8.
If people make good corpuses in other languages and wish to share
them, I can put them on this site.
|
The Inference Group is supported by the Gatsby Foundation and by a partnership award from IBM Zurich Research Laboratory David MacKaySite last modified Fri Oct 1 10:33:20 BST 2010
|
|