Advertisement
Guest User

Untitled

a guest
Feb 20th, 2017
69
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.99 KB | None | 0 0
  1. #No suprise, It's borked!
  2.  
  3. This is a little side project i've been working on for the last few weeks because I decided I wanted to create a dynamic test case system
  4. for a friend of mines, machine learning class homework. Needless to say, I've had my hands full and I'm kinda tired of working
  5. on the f\*\*ker! It's a little broken, the ps0 assignment that I finished should should be somewhere along the lines of complete, save for a few things.
  6. ***It workes for the original test cases but not my randomly generaterated ones*** and that could be either because I fucked up the generator somehow
  7. or It doesn't cover every string case...
  8.  
  9. Most likely it's a generator error because that's the thing that's taken me the longest to program. Between minor optimizations and design choices
  10. I had to completely figure the damn thing out myself. *Nothing like the actual assignment for which I was given a list of constraints*
  11.  
  12. Also, since this is my first time using the python language for anything wholesome please forgive bad implementation. I'll figure it out evenutally...
  13.  
  14. > Note, I actually didn't plot the data using a pyplot loglog, but instead chose a format that looks a lot prettier
  15.  
  16. #PS0 Constraints
  17.  
  18. (5 points) Write a python function create_corpus(d) that reads each .txt file
  19. in the directory d and returns a dictonary mapping the file name (without the directory
  20. name) to a string representing the entire text document. The following modules/functions
  21. will probably be useful
  22.  
  23. (5 points) Write a python function corpus_char_stats(corpus) that takes a
  24. dictionary corpus as the one you created in question 1, and returns a 2 element tuple in
  25. which element 0 is itself a tuple that contains the length (in characters) of the shortest
  26. file along with that file’s name. Element 1 of the returned tuple should similarly be a
  27. two element tuple with the length and name of the longest file in the corpus. Just count
  28. the characters in each file here; don’t do any fanciness. Put these lengths (and the file
  29. names) in the function’s doc comment.
  30.  
  31. (5 points) Create a python function words(data) that takes the raw data string (the
  32. value associated with a filename key in your corpus) and: (1) splits the data string into
  33. a sequence of white-space delimited tokens; (2) changes all alphabetic characters (i.e.,
  34. a-zA-Z) to lower case (i.e., a-z); and (3) filters tokens into two lists: those containing
  35. only alphabetic characters, and those containing other stuff. Note that delimiters (i.e.,
  36. whitespace) should be discarded and should not appear in any list. (4) the words()
  37. function should return a tuple containing the list of alphabetic tokens and the list of
  38. non-alphabetic tokens.
  39.  
  40. (5 points) Create a python function find_word_ratios(corpus) that returns a
  41. sorted list containing the fraction of alphabetic tokens in each file in the corpus (so each
  42. element in the list is a tuple with the fraction and then the file name). Use this function
  43. to determine the document with the lowest ratio of purely alphabetic words. Put the
  44. document name and the fraction of alphabetic tokens it contains into the function’s
  45. doc-comment.
  46.  
  47. (5 points) Create a python function word_frequencies(corpus) that records
  48. the frequency with which each alphabetic token appears in the entire corpus. That is, for each element in your corpus this function should: (1) use your words() function
  49. to isolate the alphabetic tokens; (2) update frequencies for all words observed in the
  50. document; (3) return a sorted list where each element in the list is a tuple of (frequenct,
  51. word). Use this function to find the ten most frequently occuring words in the corpus
  52. and their specific frequencies. Put these results into your doc-comment.
  53.  
  54. . (5 points) Using the data from word_frequencies(corpus), create a pdf loglog
  55. plot of the frequencies of the all words in the corpus on the y axis, and the rank on
  56. the x axis. Save this as loglog.pdf. For 3 points of extra credit, figure out how to do a
  57. least squares line fit and plot the fitted line in addition to the observed data. (Hint, use
  58. the scipy package).
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement