readable

As part of my mission to discover hidden gold in the Steemit stream, I have been writing tools to help sift the nuggets from the mud.

Turns out there are existing algorithms out there that can make life easier. One of which is "Readability Scoring".

What is a Readability Score?
There are tons of options, from USA "grade level" (eg. your content could be read by a Grade 5 student), through to highly complex math that is way over my head.

What I want is something that says essentially "This text would be easily read by half of the audience". It is just one filter out of many so does not need to be perfect.

Python to the rescue
# these modules help us do the script
import html2text
import requests
import sys
First, we get the basic imports that will help us build the scaffolding to the script.

We need to:

specify the URL

load content from the specified URL

remove HTML from the web page that we load

Then we need to pass it through the function that will give us the readability score. That function lives in the textstat library:

# this is the important library that actually does the work
from textstat.textstat import textstat
Without stripping HTML we will confuse the function so we need to ensure even hyperlinks are removed:

# we want to not include even link tags
h.ignore_links = True

# function to remove html - could be more robust
def remove_html(in_text):

    return h.handle(in_text)

For simplicity, I am specifying the URL via a command line argument for now. Later it will be part of my Discord bot.

We pass in the argument then grab it from the web using Requests.

# grab whatever content was specified in the command line
url = sys.argv[1]

# we need to strip html tags
test_string = remove_html(requests.get( url ).text)

# show what we grabbed
print( test_string )
print()
All that remains is to run it through the test!

# So how readable is it?
print( textstat.flesch_reading_ease(test_data) + " /100" )