Untitled

## Phrase Detection
To provide an arguably more efficient and useful index of the webpage, the pipeline could have included phrase detection functionality whilst processing
---

## TF*IDF/Ranking
Rather than indexing the terms with just the frequency of how often they appear within the webpage, the pipeline could have applied a weighting to the terms using TF*IDF as a formula. However this might not have been efficient in this circumstance as there are very few pages being processed. Things like headings and words in bold could have been given a higher weight also, as it might be deemed that these words are more important/relevant than others. This could have been done by using a multiplier for if word were featured in headings or in bold etc.
---
## Named Entity Recognition
The pipeline could also have split the terms into things like names, places, dates etc. NLTK, the library used for this assignment, is capable of Named Entity Recognition using ne_chunk (named-entity chunks). This may have provided a further structured and more useful index of terms.