//////////////////////////////////// // INFORMATION ON FREQUENCY LISTS // //////////////////////////////////// Date created: 2014-02-27. Total size of files: 9.7 MB. Source used: http://forum.ge 22667472 words analyzed Word frequency lists include the 50000 most common words used in message board posts at forum.ge. No attempt has been made to reduce the list to headwords only. Phrase frequency lists are divided into separate files of 2 and 3-word phrases. Separate entries of phrases from an identical or very similar larger phrase often appear in the lists. This is because the list generator reads every combination of words in a single sentence, including phrases which are linked by the same word or words. These files are probably the largest and most relevant lists of word and phrase frequencies bearing any resemblance to a colloquial usage of Georgian that are (currently) freely available on the internet. Over 20 GB of data was downloaded from the source website and filtered by post content only. The following is a list of content which was filtered out: * Content from the forum section ყიდვა - გაყიდვა - გაცვლა. This section includes a lot of redundant usage of words like ლარად 'for lari' იყიდება 'is being sold' მოყვება 'is included' etc. * Quote blocks (although some outer content in nested quote blocks is included due to parsing issues in the source) * References to user names in bold font (pervasive and extremely common) and other content in bold font (uncommon) * Multiple subsequent occurrences of a single word (often to the effect of "word spamming") * Words not containing any Georgian characters * Words containing numbers Therefore any post content written using one or more different transliterated versions of the Roman alphabet (common elsewhere, e.g. Facebook posts, comment sections in articles, video portals, etc.) was ignored and not included in the lists. As a result of this and the list of ignored content above, only about half of the actual content accessed from message board posts alone (and not including the the forum section ყიდვა - გაყიდვა - გაცვლა) was actually included in the analysis: only 22667472 out of 44472664 words were analyzed. These lists should nevertheless prove to be a relevant source for anyone interested in obtaining a general profile of the most frequently used words and phrases in present-day colloquial Georgian. ///////////////////////////////// // INFORMATION ON SOURCES USED // ///////////////////////////////// Forum.GE - TBILISIS FORUMI [http://forum.ge] Content accessed: 16872 forum topics, 44472664 words Topic IDs 34482817 to 34517999, Jan. 8, 2013 to Apr. 8, 2013 Description from http://top.ge: მაღალინტელექტუალური ფორუმი. Highly intelligent forum.