Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- <spectie> no obvious POS, but i suppose you can tell in some cases
- <spectie> from the verb morphology in kk and in english
- <nathan0n5ire> the words that end in - are verbs
- <spectie> and sometimes it has POS
- <spectie> АЙЛЫ /adj./ moonlight, moonlit.
- <nathan0n5ire> yes adjectives are marked
- <nathan0n5ire> I don't think the other ones are though
- <spectie> it seems like kazakh stems are marked by being only in uppercase
- <spectie> and also in cyrillic (obv)
- <spectie> so something could be done with that
- <spectie> to extract just the stems
- <nathan0n5ire> all of the kazakh words are in uppercase
- <spectie> АЙЛЫҚ monthly; monthly wages; ... MEP-31M period of one month.
- <spectie>
- <spectie> ah
- <nathan0n5ire> sometimes the ocr also messed up
- <nathan0n5ire> like AJIMAJIA- to embrace.
- <spectie> MER-31M is ocr error
- <spectie> i would approach this in passes
- <spectie> i would start by extracting the really easy stuff
- <spectie> where you just have two words:
- <spectie> АЙТУ pronunciation.
- <spectie> АЙМАҚТЫҚ regional.
- <spectie> АЙНАЛАДА around.
- <spectie>
- <spectie> etc.
- <spectie> i would put these in a separate file
- <spectie> then i would extract the ones with only comma + full stop as punctuation
- <spectie> АЙЛАКЕР sly, cunning one.
- <spectie> АЙЛАКЕРЛІК slyness, cunning.
- <spectie> АЙЛАЛЫ adroit, resourceful.
- <spectie> АЙЛАСЫЗ artless, unsophisticated.
- <spectie> etc.
- <nathan0n5ire> around 8K start with a kazakh character
- <nathan0n5ire> *8K lines
- <spectie> nice
- <spectie> lines starting with a kazakh character ?
- <nathan0n5ire> [АаӘәБбВвГгҒғДдЕеЁёЖжЗзИиЙйКкҚқЛлМмНнҢңОоӨөПпРрСсТтУуҰұҮүФфХхҺһЦцЧчШшЩщЪъЫыІіЬьЭэЮюЯя]
Add Comment
Please, Sign In to add comment