Aspell tutorial

Aspell: Create a new language dictionary
Create a dictionary template
aspell-lang free

Aspell There are some files that are essential for creating a dictionary. You can create them yourself, but it is more convenient to create a basic template using aspell-lang and fill in the contents.

First, download aspell-lang. You can find the latest distribution file, download it and unzip it. For example, the current files are:

$ curl -O ftp://ftp.gnu.org/gnu/aspell/aspell-lang-20101122.tar.gz
$ tar xzvf aspell-lang-20101122.tar.gz

Alternatively, you can download the CVS repository as follows:

$ cvs -z3 -d: pserver: [email protected]: / sources / aspell co aspell-lang

pre: Create the default directory

Now go into the aspell-lang directory and issue the pre LANG CHARSET command. For example, in the case of Manchu, LANG uses mnc and CHARSET uses iso-8859-1, you can issue the following command.

$ cd aspell-lang
$ ./pre mnc iso-8859-1

Now you can see that a directory called mnc has been created. It contains the following files.

$ cd mnc
$ ls
Copyright info mnc.dat mnc.wl proc

You should fill in this file with the appropriate content. We will explain what to fill in the next section.
LANG: Language code

I used the name mnc for Manchu. This name follows the ISO 639 language code. English is en, and Korean is ko. If you have two letters, use it. Manchu uses the three-letter code mnc because there is no two-letter code.

In some cases, it may be a dictionary of the same language, but it may try to distinguish the differences. At this time, the format of the dictionary name is "language name _ region - variation type - size". For example, en is English en, US en_US, UK English en_GB, and Canadian English en_CA. German is de, and the name of the dictionary that follows the old spelling is de-alt.
CHARSET: character set

I used the iso-8859-1 character set for Manchu. This is the first 256 characters of Unicode. If you want to create a dictionary for a language that only uses alphabetic characters, you can use iso-8859-1.

The aspell character set and encoding are different. aspell can handle Unicode without problems. However, it is internally processed as 8 bits. By default, the supported character sets are contained in individual files in the aspell-lang / maps directory.

You may need to create a new character set for the language you want to support. If you are using Unicode characters, use u- <LANG> .cmap, u- <LANG> .cset file, or l- <LANG> .cmap, u- <LANG> > You must define the character set with a .cset file.

Let's look at some examples. Suppose you need two Unicode characters š (U + 0161) and ū (U + 016B) besides the alphabet letter a-z to represent Manchu. asepll-lang / maps Let's create a file named l-mnc.txt as shown below.

include iso-8859-1.txt

0x80i + 0161
0x81U + 016B

Here, include iso-8859-1.txt means to include all ASCII characters in the Manchu character set. Then write a list of characters to use in the lower part. 0x80, 0x81 are the code to use internally in aspell.

You can now create the l-mnc.cmap and l-mnc.cset files from the aspell-lang directory with the following command: Copy these files into the mnc directory. l-mnc.txt is usually put under mnc / misc.

$ ./mkchardata maps / l-mnc.txt
$ cp maps / l-mnc. * mnc /
$ mkdir mnc / misc
$ mv mnc / l-mnc.txt mnc / misc /

Writing basic information

The following files are now in the mnc directory.

$ ls
Copyright info mnc.dat mnc.wl proc

Here you need to put information about the dictionary in info and LANG.dat. LANG.wl is a word list, which we'll discuss in the next section.
info

First, let's create the info file. In this file, information about the corresponding dictionary is recorded. For example, in the case of Manchu, it can be made as follows.

name_english Manchu
lang mnc
data-file mnc_affix.dat
author:
  name You Hyun Jo
  email you at cpan.org
copyright GPLv3
url https://github.com/youhyunjo/manchu-spell
version 20130123-0
source-version 20130123
complete false
accurate true
alias mnc manchu
dict:
  name mnc
  add mnc

At this time, the files specified in the data-file become essential files. If a word list is the only thing that resolves everything, no additional data files are needed. In the case of Manchu language, it is necessary to use the affix rules because it is a language that has various kinds of affix, so we added mnc_affix.dat.
LANG.dat

Now let's create the contents of the LANG.dat file. The content to be included is simple. In the case of Manchu, it becomes mnc.dat and contains the following contents.

name mnc
charset iso-8859-1
data-encoding utf-8
affix mnc
affix-compress true
partially-expand true
repl-table mnc_affix.dat
special '- **

If you are using only ASCII characters, charset iso-8859-1. If you created a separate character set, you can use the name of the character set. For example, you can use charset l-mnc if you use the character set you created earlier. The last line special is a device for using special characters as part of a word. The Aspell manual provides detailed instructions.
Licensing and Copyright

Next, let's write a copyright file that contains copyright. You can write it properly. COPYING files for widely used licenses are provided automatically.
proc script

Now you can use proc script to automatically generate the necessary files. However, there is no mnc_affix.dat added by data-file. Let's create an empty file once and run the proc script

$ touch mnc_affix.dat
$ ./proc

Several necessary files are created. README, COPYING etc. are provided and configure, Makefile.pre, .alias, .multi etc. are made.
Create word lists and affix rules

Everything is ready. Now it is time to make a real dictionary. If the language is solved with a list of words, it is sufficient to create LANG.wl. The word list file LANG.wl is a file containing one word per line. If the form of the word does not change, it is enough to just create a word list. If a word changes shape, you must create a list of words and an affix rule. In most cases, you need to create LANG_affix.dat because you need adverb rules.

Let's take an example of Manchu. Manjang's noun beye turns into beyei, beyede, beyebe. In this case, -i, -de, and -be are suffixes. The verb arabi turns into arame, arafi, araha, arara, araci. ara- is the stem, and -mbi, -me, -fi, -ha, -ra, and -ci are the endings. On the other hand, suppose the verb genembi turns into geneme, genefi, genehe, genere, geneci. gene- is a stem and -mbi, -me, -fi, -he, -re, and -ci are the last words. -ha / -he, -ra / -re are selected depending on the form of the stem.

Let's create mnc.wl as follows.

gye / N
intermediate / V
gene / V

Here, beye, ara, and gene are stem. N and V are devices that distinguish the type of affix. A group of affixes can be represented by a single letter. mnc_affix.dat must describe the rules for the N, V affix group. For example:

SFX N Y 3
SFX N 0 i.
At SFX N 0.
SFX N 0 be.

SFX V Y 8
SFX V 0 mbi.
SFX V 0 me.
SFX V 0 fi.
SFX V 0 ha a
SFX V 0 he e
SFX V 0 ra a
SFX V 0 re e
SFX V 0 ci.

SFX in the first line means suffix. N is a name for identifying the affix type. Y means that it can appear with a prefix. 3 means that there are three rules in the suffix N group.

The second line is the actual suffix rule. SFX N means suffix group N, and the following three columns indicate the conditions for subtraction, addition, and condition respectively. 0 In i., 0 means that nothing is subtracted from the stem. i means add i. It means that there is no restriction on the condition.

Now let's look at the suffix V group. SFX V 0 ha a means to add ha without subtracting anything when the stem ends with a. SFX V 0 he e means to add he at the end of the string, without subtracting anything. Thus, ara / V extends to araha and gene / V extends to genehe.

install

All files are now ready. To do this, you must first create a dictionary file for installation. The following command creates LANG.cwl and creates LANG.rws. This is the dictionary file to be installed.

$ ./configure
$ make

Installation is as follows.

$ make install

Of course, to use the dictionary you have created, you must have aspell 0.60 or later installed.
Using

The following command prints a list of dictionaries currently installed on the system.

$ aspell dump dicts

Let's print a list of the vocabulary of installed dictionaries. The -l <lang> option specifies the language. If you are in Manchuria:

$ aspell -l mnc dump master
intermediate / V
gye / N
gene / V

You can use expand and munch to see if the adverb rules in the installed dictionary are working properly.

$ echo 'search / v' | aspell -l mnc expand
Intermediate Intermediate Intermediate Intermediate Intermediate Intermediate
$ echo 'araha' | aspell -l mnc munch
Intermediate search / V

If you want to see how spell checking works:

$ echo 'arahe' | aspell -l mnc -a
@ (#) International Ispell Version 3.1.20 (but really Aspell 0.60.6.1)
& arahe 2 0: arah, arah

Deploying

You can distribute it if you are confident that the dictionary is built correctly. The files to be deployed can be created simply from the dictionary directory with the following command:

$ make dist

As a result of this command, the file aspell6-mnc-2013012309.tar.bz2 is created. You can distribute this.
Installing Aspell

On Debian Linux, you can install it with the following command:

$ sudo apt-get install