Inference Group: Dasher Project: 18

·	7
·	8
·	9
·	Try Dasher in your browser
·	11
·	12
·	13
·	14
·	16
·	18 «
·	Fonts
·	20
·	22
·	23
·	24
·	Development 2008
·	CVS branches 2005
·	Documentation
·	Experiments
·	Schools
·	Driving methods
·	Alphabets
·	Button Dasher Notes
·	Chinese
·	Japanese alphabets
·	Notes from March 2005
·	Version 3.0.2 Release Notes
·	API Notes
·	Version 3.0.1 Release Notes
·	Notes from 12/02
·	Notes from 11/02
·	Notes from 10/02
·	Further notes 08/02
·	Dasher Plan 07/02
·	Discussion 07/02
·	To Do List 07/02
·	To Do List 04/02
·	47
·	48
·	49

Combining accents

Many languages make use of accents or other marks that modify characters. For example, some european languages use acute, circumflex, umlaut, grave, cedilla, or macron (over-bar) accents, as shown in the example image above right.

Unicode provides two ways to create such characters: either with a single `COMPOSED' character such as "è", "é", "ê", or "ë"; or by COMBINING a simple "e" with a second character, a `combining accent', such as "◌̀", "◌́", or "◌̂". When you use the second `DECOMPOSED' form, if the font on your computer knows what it is doing, it should replace the sequence "e◌́" automatically by "é" (to see if your browser does it right, look at this - "é".

In Dasher, there are similarly two ways to set things up so that you can write characters with accents. The simplest method, which we used in most alphabets in version 3, is to include in the alphabet all the accented characters, as illustrated by the upper alphabet in this French alphabet file; this method is clumsy, because it makes a larger alphabet than necessary, and it can be tricky for users to learn to find the accented characters.

Especially if the user thinks of the accent as being written after the basic character is written, it may be more natural to use the second method -- an alphabet with combining characters, as illustrated by the second alphabet in this French file.

This is also an elegant solution for languages in which the accents are used very rarely, such as English. For people who want to occasionally write accented words such as passé in English, we recommend using english with combining characters. For an alphabet that supplies the full set of all combining accents needed for any Latin ISO-8859 alphabet, see Latin ISO-8859.

If you use combining characters, you must be aware of three things: (1) the training text must also use combining characters in the same way. If you have a training text consisting of COMPOSED characters then you should convert it into decomposed form. On linux systems decomposing can be done using a perl program like this. (2) the output file will consist of decomposed characters. Not all computer programs display these characters correctly, so you may wish to run the output file through a composing filter. (3) It will be possible to write nonsense in Dasher, when you have combining accents, eg, to attach accents to characters that should not have them in that language, or stack up multiple accents in ways that no language allows; but the language model should make such sequences difficult to write.

The behaviour of Dasher should be much the same whichever alphabet you use. The language model's predictions may be a little different when you use combining accents.

A third option, which will work in Dasher version 4 when nested groups are supported correctly, is to use a long alphabet with composed characters, but render the groups of related characters differently. The third French alphabet illustrates this option.

Languages for which we provide and recommend a combining alphabet (number of accented characters)

Languages for which a combining alphabet might be good, but we have not made it yet (number of accented characters)

Languages for which we think an ordinary composed alphabet is probably OK (number of accented characters)

Catalan (10);
Czech (13);
French;
Italian (6);
Portuguese (16);
Spanish;

Breton (9); Corsican (5); Dutch (5); Esperanto (6); Galician; German (3); Hungarian (9); Kurdish (7); Gaelic (5); Latvian (13); Lithuanian (9); Luxembourgish (7); Macedonian? Serbian?; Maltese (3); Moldavian (7); Occitan (12); Polish (9); Romansch (7); Scots Gaelic (7); Slovak (many); Slovene (3); Welsh (28).

Albanian (2); Basque (1); Bosnian and Croatian (4, and one non-ascii character (Đ))
Danish (7) (but the conventional alphabetical order has "Ä" and "Å" at the end of the alphabet, away from "A" "Á", so combining accents would not be helpful?); Estonian (6) (like Danish); Faroese (like Danish); Finnish (like Danish); Icelandic (like Danish); Norwegian (like Danish); Swedish (like Danish) (3+);