Advertisement
BinYamin

UNICODE letter(combination) frequency counter

Nov 16th, 2014
359
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.53 KB | None | 0 0
  1. /*Author Shaha Hassan
  2. *date 16/11/2014
  3. * GNU License, feel free to do whatever with the code, as long as you give me Shaha Hassan the credit for the code.
  4. *
  5. * This program was written to count the occurances of UNICODE letters, or their combinations(2 and 3 char long).
  6. * It can be used to find what letter occurs the most in a language, or what combinations of letters occur the most.
  7. * A user can input what UNICODE characters are he wants to focus on in a file and the text to be analysed in another.
  8. * The program then reads the text file, counts the number of occurences of combinations of consecutive required unicode characters
  9. * and lists them in descending order.
  10. */
  11.  
  12. /*
  13. * this project is in java and has 4 files
  14. * Driver http://pastebin.com/ZWBeq6MV
  15. * Matrix http://pastebin.com/SypU8jJe
  16. * Map http://pastebin.com/Z5M7aSJ1
  17. * Documentation
  18. * this one is Documentation.txt (search though my pastes to find the other files.)
  19. */
  20.  
  21.  
  22.  
  23.  
  24. --------Default values for file names are---------------------
  25. Input files
  26. input file D:\in.txt
  27. char-set D:\set.txt
  28.  
  29. Output files(auto generated)
  30. out1combo D:\out1.txt
  31. out2combo D:\out2.txt
  32. out3combo D:\out3.txt
  33.  
  34. recommended is the default names
  35. so type a period (.) if prompted for filenames to use these default values.
  36. --------------------------------------------------------------
  37. --------Providing user's filenames----------
  38. To input filename please provide full filenames with path, WITHOUT EXTENSION
  39. Donot use spaces in folder names
  40. Eg if input file is
  41. E:\My_folder\novel.txt
  42. type
  43. E:\My_folder\novel
  44.  
  45. Similarly for other files.
  46. For output file 1, 2, 3 will be suffixed automatically. If you type
  47. E:\file
  48. it will generate the files
  49. file1.txt, file2.txt, file3.txt
  50. -======================================================
  51.  
  52.  
  53. --------character set file format-------------------------
  54. set.txt shoould strictly follow the following format:
  55. Each line begins with (tilde)~ or (capsign)^
  56. and the list line should contain simply one asterisk(*).
  57.  
  58. -----------Range (~)
  59. tilde means the set characters are from char1 to char2(both inclusive)
  60. A line beginning with ~ should have just two characters, and end with newline(ascii val 13)
  61. Note: unicode value of char1 should be less than that of char2
  62.  
  63. Syntax:
  64. ~char1char2
  65.  
  66. Example:
  67. ~ch
  68. --includes character from c to h. cdefgh
  69. ~VZ
  70. --includes VWXYZ
  71.  
  72.  
  73. ------------Individual chars(^)
  74. A line beginning with ^ should contain atleast one character.
  75. All the characters in this line will be included in the set.
  76.  
  77. Syntax:
  78. ^charlist
  79.  
  80. Example:
  81. ^,
  82. --comma will be included
  83. ^pl%d v
  84. --characters p,l,%,d,<space> and v will be included.
  85.  
  86. asterisk
  87. the last line must be an asterisk. Anything following asterisk will simply be ignored.
  88. So you can write comment, notes, descriptions, love letters here.
  89.  
  90. --------------------------EG of set files
  91. ~ad
  92. ^yfB
  93. ~GJ
  94. ~NP
  95. *
  96. these chars will be included a,b,c,d, y,f,B, G,H,I,J, N,O,P.
  97. All the unicode characters (( minus whitespace) ++actual space) are allowed.
  98. -------------------------------------------
  99. ~16
  100. *
  101. digits 1 to 6
  102. -----------
  103. ~az
  104. ~AZ
  105. ^., +-
  106. ~09
  107. *
  108. will contain all the alphabet and digits and .,+- and space
  109. ----------
  110. ~10
  111. *
  112. is WRONG. Since value of 1 is greater than that of 0.
  113.  
  114.  
  115. Note: The number and order of lines containing ^ and ~ does not matter.
  116. But lines starting with ~ should have exactly 2char
  117. lines starting with * should have atleast 1 char
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement