Advertisement
MickeyLater

WEVA--Webpage English Vocabulary Assessor

Feb 9th, 2018
147
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 3.65 KB | None | 0 0
  1. """***THIS APPLICATION IS UNDER DEVELOPMENT***"""
  2.  
  3. """WEVA--Webpage English Vocabulary Assessor
  4. Evaluates a source text for difficulty of vocabulary and returns qualitative evaluation, related vocabulary statistics, and list of difficult words.
  5.  
  6. Difficulty evaluated based on:
  7.  
  8.    :basic:         Basic English words extracted into a set from https://simple.wikipedia.org/wiki/Wikipedia:List_of_1000_basic_words
  9.  
  10.    :combined:      Combined lists of English words extracted into a set from https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist
  11.  
  12. Qualitative evaluation returned is:
  13.  
  14.    :BASIC:         At least 85% of unique words in text are found in basic words set
  15.  
  16.    :INTERMEDIATE:  Between 50% and 84% of unique words in text are found in basic words set, and at least 85% in combined words set
  17.  
  18.    :CHALLENGING:   Fewer than 50% of uninque words in text are found in basic words set, and between 50% and 84% in combined words set
  19.  
  20.    :ADVANCED:      Fewer than 50% of unique words in text are found in both basic and combined word sets
  21.  
  22. The evaluation also returns percentages, as well as two sets of words:
  23.  
  24.    :intermediate words:    Unique words in text found in combined words set but not in basic words set
  25.  
  26.    :challenging words:     Unique words in text not found in basic or combined words sets
  27. """
  28.  
  29. import bs4
  30. import requests
  31. import re
  32. from string import punctuation, ascii_uppercase, digits
  33. import os
  34. import sys
  35. import pickle
  36.  
  37. def get_raw_words(url):
  38.     """Retrieves the basic and combined word list pages and returns them as a raw BeautifulSoup object.
  39.  
  40.        :url:       url of webpage words are being extracted from
  41.  
  42.        :returns:   extracted set of words
  43.    """
  44.  
  45.     raw_data = requests.get(url)
  46.     soup =  bs4.BeautifulSoup(raw_data.text, 'html.parser')
  47.     raw_text = soup.find_all()
  48.     raw_strings = [item.string for item in raw_text if item.string]
  49.     word_set = set()
  50. #    regex = re.compile(punctuation)
  51.     for string in raw_strings:
  52.         string2 = ''
  53.         for char in string:
  54.             if char in ascii_uppercase or char in digits:
  55.                 break
  56.             if char in punctuation:
  57.                 string2 += ' '
  58.             else:
  59.                 string2 += char
  60.         string = string2.split()
  61.         for word in string:
  62.             word_set.add(word)
  63.     return word_set
  64.  
  65.  
  66. def pickle_word_set(url, word_set):
  67.     """Pickles word sets to disk after checking if they already exists, to minimize rescraping of same pages.
  68.  
  69.        :url:       Url of page word set is from
  70.  
  71.        :word_set:  Set of words from Url
  72.  
  73.        :returns:   Status message
  74.    """
  75.  
  76.     url = url.split("/")
  77.     url = "./" + "-".join(url) + ".p"
  78.     url = re.sub(":", "-", url)
  79.     if os._exists(url):
  80.         return "File {} exists, skipping.".format(url)
  81.     sys.setrecursionlimit(5000)
  82.     with open(url, 'bw') as handle:
  83.         pickle.dump(word_set, handle)
  84.     return "File {} pickled successfully.".format(url)
  85.  
  86.  
  87.  
  88. def set_up_reference():
  89.     """Pickles the original basic and comb ined word list refernece files. Executed only once per installation.
  90.  
  91.        :returns:       None
  92.    """
  93.  
  94.     basic_words_url = "https://simple.wikipedia.org/wiki/Wikipedia:List_of_1000_basic_words"
  95.     combined_words_url = "https://simple.wikipedia.org/wiki/Wikipedia:Basic_English_combined_wordlist"
  96.  
  97.     basic_set = get_raw_words(basic_words_url)
  98.     status = pickle_word_set(basic_words_url, basic_set)
  99.     print(status)
  100.     combined_set = get_raw_words(combined_words_url)
  101.     status = pickle_word_set(combined_words_url, combined_set)
  102.     print(status)
  103.  
  104.  
  105. set_up_reference()
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement