Untitled

#No suprise, It's borked!

This is a little side project i've been working on for the last few weeks because I decided I wanted to create a dynamic test case system
for a friend of mines, machine learning class homework. Needless to say, I've had my hands full and I'm kinda tired of working
on the f\*\*ker! It's a little broken, the ps0 assignment that I finished should should be somewhere along the lines of complete, save for a few things.
***It workes for the original test cases but not my randomly generaterated ones*** and that could be either because I fucked up the generator somehow
or It doesn't cover every string case...

Most likely it's a generator error because that's the thing that's taken me the longest to program. Between minor optimizations and design choices
I had to completely figure the damn thing out myself. *Nothing like the actual assignment for which I was given a list of constraints*

Also, since this is my first time using the python language for anything wholesome please forgive bad implementation. I'll figure it out evenutally...

> Note, I actually didn't plot the data using a pyplot loglog, but instead chose a format that looks a lot prettier

#PS0 Constraints

(5 points) Write a python function create_corpus(d) that reads each .txt file
in the directory d and returns a dictonary mapping the file name (without the directory
name) to a string representing the entire text document. The following modules/functions
will probably be useful

(5 points) Write a python function corpus_char_stats(corpus) that takes a
dictionary corpus as the one you created in question 1, and returns a 2 element tuple in
which element 0 is itself a tuple that contains the length (in characters) of the shortest
file along with that file’s name. Element 1 of the returned tuple should similarly be a
two element tuple with the length and name of the longest file in the corpus. Just count
the characters in each file here; don’t do any fanciness. Put these lengths (and the file
names) in the function’s doc comment.

(5 points) Create a python function words(data) that takes the raw data string (the
value associated with a filename key in your corpus) and: (1) splits the data string into
a sequence of white-space delimited tokens; (2) changes all alphabetic characters (i.e.,
a-zA-Z) to lower case (i.e., a-z); and (3) filters tokens into two lists: those containing
only alphabetic characters, and those containing other stuff. Note that delimiters (i.e.,
whitespace) should be discarded and should not appear in any list. (4) the words()
function should return a tuple containing the list of alphabetic tokens and the list of
non-alphabetic tokens.

(5 points) Create a python function find_word_ratios(corpus) that returns a
sorted list containing the fraction of alphabetic tokens in each file in the corpus (so each
element in the list is a tuple with the fraction and then the file name). Use this function
to determine the document with the lowest ratio of purely alphabetic words. Put the
document name and the fraction of alphabetic tokens it contains into the function’s
doc-comment.

(5 points) Create a python function word_frequencies(corpus) that records
the frequency with which each alphabetic token appears in the entire corpus. That is, for each element in your corpus this function should: (1) use your words() function
to isolate the alphabetic tokens; (2) update frequencies for all words observed in the
document; (3) return a sorted list where each element in the list is a tuple of (frequenct,
word). Use this function to find the ten most frequently occuring words in the corpus
and their specific frequencies. Put these results into your doc-comment.

. (5 points) Using the data from word_frequencies(corpus), create a pdf loglog
plot of the frequencies of the all words in the corpus on the y axis, and the rank on
the x axis. Save this as loglog.pdf. For 3 points of extra credit, figure out how to do a
least squares line fit and plot the fitted line in addition to the observed data. (Hint, use
the scipy package).