mizinamo

Untitled

Mar 30th, 2025
29
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.46 KB | None | 0 0
  1. It depends on what you consider a "letter".
  2.  
  3. For example, in German, **ä ö ü ß** are not considered parts of the alphabet; "the German alphabet" has 26 letters in a particular order.
  4.  
  5. **ä ö ü ß** don't have a position in that alphabetical order; when you want to sort German words, there are at least two ways to take them into account ("phonebook" and "dictionary" – "phonebook" sorts **ä ö ü ß** as if they were two letters each **ae oe ue ss** and "dictionary" sorts them as if they were **a o u ss**).
  6.  
  7. So they are "graphemes used in writing German", but whether they are "letters" or not depends on your definition of "letter".
  8.  
  9. Then there is the issue that **ß** was traditionally only used in lower case; when you capitalise a word such as *Größe* "size", you would get GRÖSSE rather than GRÖẞE with the capital ß that was introduced fairly recently.
  10.  
  11. Meanwhile, in crossword puzzles, it's common to "write out" umlauts and Eszett, i.e. a crossword puzzle might have squares that you write **G R O E S S E** into, rather than expecting **G R Ö S S E** or **G R Ö ẞ E**. (So that the **O** can be part of another, crossing, word that does not have **Ö** in it.)
  12.  
  13. For English, you cannot write "fiancée, café, résumé/resumé" with those letters because **é** is missing. Is that a separate letter from **e**? Is "résumé/resumé" a separate word from "resume", or does "resume" have two different pronunciations, one for the meaning "continue" and one for the meaning "CV"?
  14.  
  15. (And conservative British English needs even more graphemes if you want to write words such as coöperate, naïve, manœuvre, pædiatrician, learnèd, rôle, …)
  16.  
  17. Same with Portuguese: you can't even write "você" if you have no **ê** in your alphabet / list of graphemes.
  18.  
  19. That letter might not be listed when you recite the alphabet, but diacritics make a difference in Portuguese: *a* "the" is not the same word as *à* "to the".
  20.  
  21. Also, when you say
  22.  
  23. > I need to create a function that returns a random letter from the chosen language
  24.  
  25. , do you want each letter to be equally likely, including letters that are only used in a single word in that language, or do you want to weight them (a bit like Scrabble tiles) so that more-frequent letters have a higher probability of being chosen?
  26.  
  27. For Japanese, well – that’s a basic set of syllable characters, and you can write any Japanese word using *versions* of those letters, but depending (again!) on whether you are going for "letter" or "grapheme", you will probably want some more:
  28.  
  29. * がぎぐげござじずぜぞだぢづでどばびぶべぼ (voiced stops)
  30. * ぱぴぷぺぽ (unvoiced stops)
  31. * っ (for long consonants)
  32. * ゃゅょ (for syllables with -y- in them, as in Tokyo)
  33.  
  34. For Korean, you have the issue that Korean *does* have an alphabet, but it doesn't simply write the letters left to right as in English. Instead, it puts letters of one symbol into a little square.
  35.  
  36. So while you could have a minimal alphabet that looks something like
  37.  
  38. * ㄱ ㄴ ㄷ ㄹ ㅁ ㅂ ㅅ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ (consonants)
  39. * ㅏ ㅑ ㅓ ㅕ ㅗ ㅛ ㅜ ㅠ ㅡ ㅣ (vowels)
  40.  
  41. you would still have to be able to put them together into something such as 김 (the common Korean family name usually spelled "Kim" in English): see how the individual letters ㄱ ㅣ ㅁ are put together?
  42.  
  43. The example syllables you have are combinations of consonant + vowel ㅏ "a", i.e. something like "ga na da ra…".
  44.  
  45. And syllables can get as complex as 쐞 ( ㅅ ㅅ ㅗ ㅏ ㅣ ㄹ ㅍ; or ㅆ ㅙ ㄿ if you use a larger "alphabet" that includes some combinations of consonants and vowels as individual units).
  46.  
  47. So you will have to have code that can compose Hangeul syllables from individual Hangeul letters, as well as decompose syllables into their individual letters.
  48.  
  49. Then there are languages where some letters of the alphabet are composed of multiple graphemes; for example, **dzs** is the eighth letter of the Hungarian alphabet, which also contains the letters **d, z, s, dz, zs** each with their own sound and their own position in the alphabet. I'm not sure whether your current data format can handle this.
  50.  
  51. (Even in English, it may or may not be useful to have individual tiles for common digraphs such as **th sh ch** even if they do not count as individual letters in English.)
  52.  
  53. tl;dr: it’s a lot more complicated than you seem to think, and there is no one correct answer. Maybe you will have to think more about the needs of your app.
Advertisement
Add Comment
Please, Sign In to add comment