Advertisement
Guest User

readable

a guest
Feb 19th, 2018
59
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.96 KB | None | 0 0
  1. As part of my mission to discover hidden gold in the Steemit stream, I have been writing tools to help sift the nuggets from the mud.
  2.  
  3. Turns out there are existing algorithms out there that can make life easier. One of which is "Readability Scoring".
  4.  
  5. What is a Readability Score?
  6. There are tons of options, from USA "grade level" (eg. your content could be read by a Grade 5 student), through to highly complex math that is way over my head.
  7.  
  8. What I want is something that says essentially "This text would be easily read by half of the audience". It is just one filter out of many so does not need to be perfect.
  9.  
  10. Python to the rescue
  11. # these modules help us do the script
  12. import html2text
  13. import requests
  14. import sys
  15. First, we get the basic imports that will help us build the scaffolding to the script.
  16.  
  17. We need to:
  18.  
  19. specify the URL
  20.  
  21. load content from the specified URL
  22.  
  23. remove HTML from the web page that we load
  24.  
  25. Then we need to pass it through the function that will give us the readability score. That function lives in the textstat library:
  26.  
  27. # this is the important library that actually does the work
  28. from textstat.textstat import textstat
  29. Without stripping HTML we will confuse the function so we need to ensure even hyperlinks are removed:
  30.  
  31. # we want to not include even link tags
  32. h.ignore_links = True
  33.  
  34. # function to remove html - could be more robust
  35. def remove_html(in_text):
  36.  
  37. return h.handle(in_text)
  38.  
  39. For simplicity, I am specifying the URL via a command line argument for now. Later it will be part of my Discord bot.
  40.  
  41. We pass in the argument then grab it from the web using Requests.
  42.  
  43. # grab whatever content was specified in the command line
  44. url = sys.argv[1]
  45.  
  46. # we need to strip html tags
  47. test_string = remove_html(requests.get( url ).text)
  48.  
  49. # show what we grabbed
  50. print( test_string )
  51. print()
  52. All that remains is to run it through the test!
  53.  
  54. # So how readable is it?
  55. print( textstat.flesch_reading_ease(test_data) + " /100" )
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement