Inference Group: Projekt Dasher: Making the Training Text

Inference Group

Domowa

Do czego służy Dasher?

Jak działa Dasher?
·	Three-page explanation
·	Demonstracje
·	Try Dasher in your browser

Pobranie
·	Tips for Novices
·	22

Special Needs

Languages
·	Fonts «

Future directions

Publications

Historia

Rozwój

Press coverage

Press Information

Any questions?

Making the Training Text

The training text should be a plain text file containing text `similar' to what you intend to write. The larger the better. We think that 300K is a good size to aim for.

Example training texts that you could use are:

Take all the documents you have written, and glue them all together in one big document.
Use novels - eg, we used Jane Austen's Emma from Project Gutenberg. The problem with using just one or two novels, however, is that particular words (like Emma or Alice) occur very frequently; so novels are not ideal for a general-purpose training text.
Use all the email messages you have written, and glue them all together in one big document..

How to make a general-purpose training text

Here's how I made the training text for the English version of Dasher.

Get lots of English documents. Get far more material than you think you need, so that we can select a well-balanced set of sentences in a sensible way, as follows.
Pre-process them all so that there is exactly one sentence per line.
I did this using a perl program I wrote, processbook.p with scripts like this
```
foreach f ( alice emma )
  processbook.p  /books0/$f > /books/$f
end
```
Now, obtain a listing of the 2000 most frequent words in the language. The idea is, since these words are common, it is important that we should have them represented several times each in the final corpus, in a variety of contexts. We will use these words to select which sentences are included from our over-large corpus.
I obtained such a list from the internet and put it in a file called dict. I removed from dict any absurdly common words that prevented the remaining steps from working nicely.
Use another program to select from each pre-processed book the sentences that contain the 2000 required words. Go through the required words in order, so that the resulting corpus is also ordered, with the top of the corpus containing examples of use of the most common words; that way, the corpus can be shrunk by cutting its tail off, and should still be an appropriate corpus for its size.
Glue the sentences together into plausible-sized paragraphs, so as to emulate normal writing.
I did this step by using the linux utility glimpse and my perl program corpus.p
```
rm  /data/coll/mackay/books/*~
glimpseindex -b  -B   -H ~/dasher/  /data/coll/mackay/books/
corpus.p k=1 f=4 o=corpus4.txt
       
```
That's how I made this corpus (316K), which is used in Dasher 1.6.8.

If people make good corpuses in other languages and wish to share them, I can put them on this site.

The Inference Group is supported by the Gatsby Foundation
and by a partnership award from IBM Zurich Research Laboratory

David MacKay

Site last modified Fri Oct 1 10:33:20 BST 2010