Advertisement
Guest User

Untitled

a guest
Feb 22nd, 2018
56
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.05 KB | None | 0 0
  1. ## Phrase Detection
  2. To provide an arguably more efficient and useful index of the webpage, the pipeline could have included phrase detection functionality whilst processing
  3. ---
  4.  
  5. ## TF*IDF/Ranking
  6. Rather than indexing the terms with just the frequency of how often they appear within the webpage, the pipeline could have applied a weighting to the terms using TF*IDF as a formula. However this might not have been efficient in this circumstance as there are very few pages being processed. Things like headings and words in bold could have been given a higher weight also, as it might be deemed that these words are more important/relevant than others. This could have been done by using a multiplier for if word were featured in headings or in bold etc.
  7. ---
  8. ## Named Entity Recognition
  9. The pipeline could also have split the terms into things like names, places, dates etc. NLTK, the library used for this assignment, is capable of Named Entity Recognition using ne_chunk (named-entity chunks). This may have provided a further structured and more useful index of terms.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement