Advertisement
pavelsayekat

Aspell tutorial

Jun 19th, 2018
303
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 9.41 KB | None | 0 0
  1. Aspell: Create a new language dictionary
  2. Create a dictionary template
  3. aspell-lang free
  4.  
  5. Aspell There are some files that are essential for creating a dictionary. You can create them yourself, but it is more convenient to create a basic template using aspell-lang and fill in the contents.
  6.  
  7. First, download aspell-lang. You can find the latest distribution file, download it and unzip it. For example, the current files are:
  8.  
  9. $ curl -O ftp://ftp.gnu.org/gnu/aspell/aspell-lang-20101122.tar.gz
  10. $ tar xzvf aspell-lang-20101122.tar.gz
  11.  
  12. Alternatively, you can download the CVS repository as follows:
  13.  
  14. $ cvs -z3 -d: pserver: anonymous@cvs.savannah.gnu.org: / sources / aspell co aspell-lang
  15.  
  16. pre: Create the default directory
  17.  
  18. Now go into the aspell-lang directory and issue the pre LANG CHARSET command. For example, in the case of Manchu, LANG uses mnc and CHARSET uses iso-8859-1, you can issue the following command.
  19.  
  20. $ cd aspell-lang
  21. $ ./pre mnc iso-8859-1
  22.  
  23. Now you can see that a directory called mnc has been created. It contains the following files.
  24.  
  25. $ cd mnc
  26. $ ls
  27. Copyright info mnc.dat mnc.wl proc
  28.  
  29. You should fill in this file with the appropriate content. We will explain what to fill in the next section.
  30. LANG: Language code
  31.  
  32. I used the name mnc for Manchu. This name follows the ISO 639 language code. English is en, and Korean is ko. If you have two letters, use it. Manchu uses the three-letter code mnc because there is no two-letter code.
  33.  
  34. In some cases, it may be a dictionary of the same language, but it may try to distinguish the differences. At this time, the format of the dictionary name is "language name _ region - variation type - size". For example, en is English en, US en_US, UK English en_GB, and Canadian English en_CA. German is de, and the name of the dictionary that follows the old spelling is de-alt.
  35. CHARSET: character set
  36.  
  37. I used the iso-8859-1 character set for Manchu. This is the first 256 characters of Unicode. If you want to create a dictionary for a language that only uses alphabetic characters, you can use iso-8859-1.
  38.  
  39. The aspell character set and encoding are different. aspell can handle Unicode without problems. However, it is internally processed as 8 bits. By default, the supported character sets are contained in individual files in the aspell-lang / maps directory.
  40.  
  41. You may need to create a new character set for the language you want to support. If you are using Unicode characters, use u- <LANG> .cmap, u- <LANG> .cset file, or l- <LANG> .cmap, u- <LANG> > You must define the character set with a .cset file.
  42.  
  43. Let's look at some examples. Suppose you need two Unicode characters š (U + 0161) and ū (U + 016B) besides the alphabet letter a-z to represent Manchu. asepll-lang / maps Let's create a file named l-mnc.txt as shown below.
  44.  
  45. include iso-8859-1.txt
  46.  
  47. 0x80i + 0161
  48. 0x81U + 016B
  49.  
  50. Here, include iso-8859-1.txt means to include all ASCII characters in the Manchu character set. Then write a list of characters to use in the lower part. 0x80, 0x81 are the code to use internally in aspell.
  51.  
  52. You can now create the l-mnc.cmap and l-mnc.cset files from the aspell-lang directory with the following command: Copy these files into the mnc directory. l-mnc.txt is usually put under mnc / misc.
  53.  
  54. $ ./mkchardata maps / l-mnc.txt
  55. $ cp maps / l-mnc. * mnc /
  56. $ mkdir mnc / misc
  57. $ mv mnc / l-mnc.txt mnc / misc /
  58.  
  59. Writing basic information
  60.  
  61. The following files are now in the mnc directory.
  62.  
  63. $ ls
  64. Copyright info mnc.dat mnc.wl proc
  65.  
  66. Here you need to put information about the dictionary in info and LANG.dat. LANG.wl is a word list, which we'll discuss in the next section.
  67. info
  68.  
  69. First, let's create the info file. In this file, information about the corresponding dictionary is recorded. For example, in the case of Manchu, it can be made as follows.
  70.  
  71. name_english Manchu
  72. lang mnc
  73. data-file mnc_affix.dat
  74. author:
  75. name You Hyun Jo
  76. email you at cpan.org
  77. copyright GPLv3
  78. url https://github.com/youhyunjo/manchu-spell
  79. version 20130123-0
  80. source-version 20130123
  81. complete false
  82. accurate true
  83. alias mnc manchu
  84. dict:
  85. name mnc
  86. add mnc
  87.  
  88. At this time, the files specified in the data-file become essential files. If a word list is the only thing that resolves everything, no additional data files are needed. In the case of Manchu language, it is necessary to use the affix rules because it is a language that has various kinds of affix, so we added mnc_affix.dat.
  89. LANG.dat
  90.  
  91. Now let's create the contents of the LANG.dat file. The content to be included is simple. In the case of Manchu, it becomes mnc.dat and contains the following contents.
  92.  
  93. name mnc
  94. charset iso-8859-1
  95. data-encoding utf-8
  96. affix mnc
  97. affix-compress true
  98. partially-expand true
  99. repl-table mnc_affix.dat
  100. special '- **
  101.  
  102. If you are using only ASCII characters, charset iso-8859-1. If you created a separate character set, you can use the name of the character set. For example, you can use charset l-mnc if you use the character set you created earlier. The last line special is a device for using special characters as part of a word. The Aspell manual provides detailed instructions.
  103. Licensing and Copyright
  104.  
  105. Next, let's write a copyright file that contains copyright. You can write it properly. COPYING files for widely used licenses are provided automatically.
  106. proc script
  107.  
  108. Now you can use proc script to automatically generate the necessary files. However, there is no mnc_affix.dat added by data-file. Let's create an empty file once and run the proc script
  109.  
  110. $ touch mnc_affix.dat
  111. $ ./proc
  112.  
  113. Several necessary files are created. README, COPYING etc. are provided and configure, Makefile.pre, .alias, .multi etc. are made.
  114. Create word lists and affix rules
  115.  
  116. Everything is ready. Now it is time to make a real dictionary. If the language is solved with a list of words, it is sufficient to create LANG.wl. The word list file LANG.wl is a file containing one word per line. If the form of the word does not change, it is enough to just create a word list. If a word changes shape, you must create a list of words and an affix rule. In most cases, you need to create LANG_affix.dat because you need adverb rules.
  117.  
  118. Let's take an example of Manchu. Manjang's noun beye turns into beyei, beyede, beyebe. In this case, -i, -de, and -be are suffixes. The verb arabi turns into arame, arafi, araha, arara, araci. ara- is the stem, and -mbi, -me, -fi, -ha, -ra, and -ci are the endings. On the other hand, suppose the verb genembi turns into geneme, genefi, genehe, genere, geneci. gene- is a stem and -mbi, -me, -fi, -he, -re, and -ci are the last words. -ha / -he, -ra / -re are selected depending on the form of the stem.
  119.  
  120. Let's create mnc.wl as follows.
  121.  
  122. gye / N
  123. intermediate / V
  124. gene / V
  125.  
  126. Here, beye, ara, and gene are stem. N and V are devices that distinguish the type of affix. A group of affixes can be represented by a single letter. mnc_affix.dat must describe the rules for the N, V ​​affix group. For example:
  127.  
  128. SFX N Y 3
  129. SFX N 0 i.
  130. At SFX N 0.
  131. SFX N 0 be.
  132.  
  133. SFX V Y 8
  134. SFX V 0 mbi.
  135. SFX V 0 me.
  136. SFX V 0 fi.
  137. SFX V 0 ha a
  138. SFX V 0 he e
  139. SFX V 0 ra a
  140. SFX V 0 re e
  141. SFX V 0 ci.
  142.  
  143. SFX in the first line means suffix. N is a name for identifying the affix type. Y means that it can appear with a prefix. 3 means that there are three rules in the suffix N group.
  144.  
  145. The second line is the actual suffix rule. SFX N means suffix group N, and the following three columns indicate the conditions for subtraction, addition, and condition respectively. 0 In i., 0 means that nothing is subtracted from the stem. i means add i. It means that there is no restriction on the condition.
  146.  
  147. Now let's look at the suffix V group. SFX V 0 ha a means to add ha without subtracting anything when the stem ends with a. SFX V 0 he e means to add he at the end of the string, without subtracting anything. Thus, ara / V extends to araha and gene / V extends to genehe.
  148.  
  149. install
  150.  
  151. All files are now ready. To do this, you must first create a dictionary file for installation. The following command creates LANG.cwl and creates LANG.rws. This is the dictionary file to be installed.
  152.  
  153. $ ./configure
  154. $ make
  155.  
  156. Installation is as follows.
  157.  
  158. $ make install
  159.  
  160. Of course, to use the dictionary you have created, you must have aspell 0.60 or later installed.
  161. Using
  162.  
  163. The following command prints a list of dictionaries currently installed on the system.
  164.  
  165. $ aspell dump dicts
  166.  
  167. Let's print a list of the vocabulary of installed dictionaries. The -l <lang> option specifies the language. If you are in Manchuria:
  168.  
  169. $ aspell -l mnc dump master
  170. intermediate / V
  171. gye / N
  172. gene / V
  173.  
  174. You can use expand and munch to see if the adverb rules in the installed dictionary are working properly.
  175.  
  176. $ echo 'search / v' | aspell -l mnc expand
  177. Intermediate Intermediate Intermediate Intermediate Intermediate Intermediate
  178. $ echo 'araha' | aspell -l mnc munch
  179. Intermediate search / V
  180.  
  181. If you want to see how spell checking works:
  182.  
  183. $ echo 'arahe' | aspell -l mnc -a
  184. @ (#) International Ispell Version 3.1.20 (but really Aspell 0.60.6.1)
  185. & arahe 2 0: arah, arah
  186.  
  187. Deploying
  188.  
  189. You can distribute it if you are confident that the dictionary is built correctly. The files to be deployed can be created simply from the dictionary directory with the following command:
  190.  
  191. $ make dist
  192.  
  193. As a result of this command, the file aspell6-mnc-2013012309.tar.bz2 is created. You can distribute this.
  194. Installing Aspell
  195.  
  196. On Debian Linux, you can install it with the following command:
  197.  
  198. $ sudo apt-get install
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement