Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- #No suprise, It's borked!
- This is a little side project i've been working on for the last few weeks because I decided I wanted to create a dynamic test case system
- for a friend of mines, machine learning class homework. Needless to say, I've had my hands full and I'm kinda tired of working
- on the f\*\*ker! It's a little broken, the ps0 assignment that I finished should should be somewhere along the lines of complete, save for a few things.
- ***It workes for the original test cases but not my randomly generaterated ones*** and that could be either because I fucked up the generator somehow
- or It doesn't cover every string case...
- Most likely it's a generator error because that's the thing that's taken me the longest to program. Between minor optimizations and design choices
- I had to completely figure the damn thing out myself. *Nothing like the actual assignment for which I was given a list of constraints*
- Also, since this is my first time using the python language for anything wholesome please forgive bad implementation. I'll figure it out evenutally...
- > Note, I actually didn't plot the data using a pyplot loglog, but instead chose a format that looks a lot prettier
- #PS0 Constraints
- (5 points) Write a python function create_corpus(d) that reads each .txt file
- in the directory d and returns a dictonary mapping the file name (without the directory
- name) to a string representing the entire text document. The following modules/functions
- will probably be useful
- (5 points) Write a python function corpus_char_stats(corpus) that takes a
- dictionary corpus as the one you created in question 1, and returns a 2 element tuple in
- which element 0 is itself a tuple that contains the length (in characters) of the shortest
- file along with that file’s name. Element 1 of the returned tuple should similarly be a
- two element tuple with the length and name of the longest file in the corpus. Just count
- the characters in each file here; don’t do any fanciness. Put these lengths (and the file
- names) in the function’s doc comment.
- (5 points) Create a python function words(data) that takes the raw data string (the
- value associated with a filename key in your corpus) and: (1) splits the data string into
- a sequence of white-space delimited tokens; (2) changes all alphabetic characters (i.e.,
- a-zA-Z) to lower case (i.e., a-z); and (3) filters tokens into two lists: those containing
- only alphabetic characters, and those containing other stuff. Note that delimiters (i.e.,
- whitespace) should be discarded and should not appear in any list. (4) the words()
- function should return a tuple containing the list of alphabetic tokens and the list of
- non-alphabetic tokens.
- (5 points) Create a python function find_word_ratios(corpus) that returns a
- sorted list containing the fraction of alphabetic tokens in each file in the corpus (so each
- element in the list is a tuple with the fraction and then the file name). Use this function
- to determine the document with the lowest ratio of purely alphabetic words. Put the
- document name and the fraction of alphabetic tokens it contains into the function’s
- doc-comment.
- (5 points) Create a python function word_frequencies(corpus) that records
- the frequency with which each alphabetic token appears in the entire corpus. That is, for each element in your corpus this function should: (1) use your words() function
- to isolate the alphabetic tokens; (2) update frequencies for all words observed in the
- document; (3) return a sorted list where each element in the list is a tuple of (frequenct,
- word). Use this function to find the ten most frequently occuring words in the corpus
- and their specific frequencies. Put these results into your doc-comment.
- . (5 points) Using the data from word_frequencies(corpus), create a pdf loglog
- plot of the frequencies of the all words in the corpus on the y axis, and the rank on
- the x axis. Save this as loglog.pdf. For 3 points of extra credit, figure out how to do a
- least squares line fit and plot the fitted line in addition to the observed data. (Hint, use
- the scipy package).
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement