Alias Combining.html
HiddenPage Combining accents
Combining accents
Many languages make use of accents or other marks
that modify characters.
For example, some european languages use acute, circumflex, umlaut, grave, cedilla, or macron (over-bar)
accents, as shown in the example image above right.
Unicode provides two ways to create such characters:
either with a single `COMPOSED' character such as
"è",
"é",
"ê", or
"ë";
or by COMBINING a simple "e" with a second character, a `combining accent', such as
"◌̀",
"◌́", or
"◌̂". When you use the second `DECOMPOSED' form,
if the font on your computer knows what it is doing, it should replace the
sequence
"e◌́" automatically by "é" (to see if your browser
does it right, look at this - "é".
In Dasher, there are similarly two ways to set things up so that you can
write characters with accents.
The simplest method, which we used in most alphabets in version 3,
is to include in the alphabet all the accented
characters, as illustrated by
the upper
alphabet in this French alphabet file;
this method is clumsy, because it makes a larger alphabet than necessary,
and it can be tricky for users to learn to find the accented characters.
Especially if the user thinks of the accent as being written after
the basic character is written, it may be more natural to use
the second method -- an alphabet with combining characters, as
illustrated by
the second
alphabet in this French file.
This is also an elegant solution for languages in which the accents are used
very rarely, such as English.
For people who want to occasionally write accented words such as passé
in English, we recommend using
english
with combining characters.
For an alphabet that supplies the
full set of all combining accents needed for any
Latin ISO-8859 alphabet, see
Latin ISO-8859.
If you use combining characters, you must be aware of three things:
(1) the training text must also use combining characters in the same way.
If you have a training text consisting of COMPOSED characters then you
should convert it into decomposed form. On linux systems decomposing can be done using
a perl program like this.
(2) the output file will consist of decomposed characters. Not all computer programs display
these characters correctly, so you may wish to run the output file through a composing filter.
(3) It will be possible to write nonsense in Dasher, when you have combining
accents, eg, to attach accents to characters
that should not have them in that language, or stack up multiple accents in ways
that no language allows; but the language model should make such sequences
difficult to write.
The behaviour of Dasher should be much the same whichever alphabet you use.
The language model's predictions may be a little different when you use
combining accents.
A third option, which will work in Dasher version 4 when nested groups are
supported correctly, is to use a long alphabet with composed characters,
but render the groups of related characters differently.
The
third French alphabet
illustrates this option.
Languages for which we provide and recommend a combining alphabet (number of accented characters)
|
Languages for which a combining alphabet might be good, but we have not made it yet (number of accented characters)
|
Languages for which we think an ordinary composed alphabet is probably OK (number of accented characters)
|
Catalan (10);
Czech (13);
French;
Italian (6);
Portuguese (16);
Spanish;
|
Breton (9);
Corsican (5);
Dutch (5);
Esperanto (6);
Galician;
German (3);
Hungarian (9);
Kurdish (7);
Gaelic (5);
Latvian (13);
Lithuanian (9);
Luxembourgish (7);
Macedonian? Serbian?;
Maltese (3);
Moldavian (7);
Occitan (12);
Polish (9);
Romansch (7);
Scots Gaelic (7);
Slovak (many);
Slovene (3);
Welsh (28).
|
Albanian (2);
Basque (1);
Bosnian and Croatian (4, and one non-ascii character (Đ))
Danish (7) (but the conventional alphabetical order has
"Ä" and
"Å"
at the end of the alphabet, away from
"A"
"Á", so combining accents would not be helpful?);
Estonian (6) (like Danish);
Faroese (like Danish);
Finnish (like Danish);
Icelandic (like Danish);
Norwegian (like Danish);
Swedish (like Danish) (3+);
|