Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- /*Author Shaha Hassan
- *date 16/11/2014
- * GNU License, feel free to do whatever with the code, as long as you give me Shaha Hassan the credit for the code.
- *
- * This program was written to count the occurances of UNICODE letters, or their combinations(2 and 3 char long).
- * It can be used to find what letter occurs the most in a language, or what combinations of letters occur the most.
- * A user can input what UNICODE characters are he wants to focus on in a file and the text to be analysed in another.
- * The program then reads the text file, counts the number of occurences of combinations of consecutive required unicode characters
- * and lists them in descending order.
- */
- /*
- * this project is in java and has 4 files
- * Driver http://pastebin.com/ZWBeq6MV
- * Matrix http://pastebin.com/SypU8jJe
- * Map http://pastebin.com/Z5M7aSJ1
- * Documentation
- * this one is Documentation.txt (search though my pastes to find the other files.)
- */
- --------Default values for file names are---------------------
- Input files
- input file D:\in.txt
- char-set D:\set.txt
- Output files(auto generated)
- out1combo D:\out1.txt
- out2combo D:\out2.txt
- out3combo D:\out3.txt
- recommended is the default names
- so type a period (.) if prompted for filenames to use these default values.
- --------------------------------------------------------------
- --------Providing user's filenames----------
- To input filename please provide full filenames with path, WITHOUT EXTENSION
- Donot use spaces in folder names
- Eg if input file is
- E:\My_folder\novel.txt
- type
- E:\My_folder\novel
- Similarly for other files.
- For output file 1, 2, 3 will be suffixed automatically. If you type
- E:\file
- it will generate the files
- file1.txt, file2.txt, file3.txt
- -======================================================
- --------character set file format-------------------------
- set.txt shoould strictly follow the following format:
- Each line begins with (tilde)~ or (capsign)^
- and the list line should contain simply one asterisk(*).
- -----------Range (~)
- tilde means the set characters are from char1 to char2(both inclusive)
- A line beginning with ~ should have just two characters, and end with newline(ascii val 13)
- Note: unicode value of char1 should be less than that of char2
- Syntax:
- ~char1char2
- Example:
- ~ch
- --includes character from c to h. cdefgh
- ~VZ
- --includes VWXYZ
- ------------Individual chars(^)
- A line beginning with ^ should contain atleast one character.
- All the characters in this line will be included in the set.
- Syntax:
- ^charlist
- Example:
- ^,
- --comma will be included
- ^pl%d v
- --characters p,l,%,d,<space> and v will be included.
- asterisk
- the last line must be an asterisk. Anything following asterisk will simply be ignored.
- So you can write comment, notes, descriptions, love letters here.
- --------------------------EG of set files
- ~ad
- ^yfB
- ~GJ
- ~NP
- *
- these chars will be included a,b,c,d, y,f,B, G,H,I,J, N,O,P.
- All the unicode characters (( minus whitespace) ++actual space) are allowed.
- -------------------------------------------
- ~16
- *
- digits 1 to 6
- -----------
- ~az
- ~AZ
- ^., +-
- ~09
- *
- will contain all the alphabet and digits and .,+- and space
- ----------
- ~10
- *
- is WRONG. Since value of 1 is greater than that of 0.
- Note: The number and order of lines containing ^ and ~ does not matter.
- But lines starting with ~ should have exactly 2char
- lines starting with * should have atleast 1 char
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement