Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ## Phrase Detection
- To provide an arguably more efficient and useful index of the webpage, the pipeline could have included phrase detection functionality whilst processing
- ---
- ## TF*IDF/Ranking
- Rather than indexing the terms with just the frequency of how often they appear within the webpage, the pipeline could have applied a weighting to the terms using TF*IDF as a formula. However this might not have been efficient in this circumstance as there are very few pages being processed. Things like headings and words in bold could have been given a higher weight also, as it might be deemed that these words are more important/relevant than others. This could have been done by using a multiplier for if word were featured in headings or in bold etc.
- ---
- ## Named Entity Recognition
- The pipeline could also have split the terms into things like names, places, dates etc. NLTK, the library used for this assignment, is capable of Named Entity Recognition using ne_chunk (named-entity chunks). This may have provided a further structured and more useful index of terms.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement