Untitled

	<section id="lab4">

        <h2>Lab 4 Questions and Answers</h2>

		<ol>

			<li class="qandA">

				<h3>Question 1a</h3>

				<p class="question">Have a closer look at the term frequency matrix. Given this table, how well do you think the tokenizer did its job? Do you see any room for improvement?</p>

                <p class="answer">The tokenizer did what it's supposed to. It shows the frequency of each word in every chapter correctly.</p>

                <p class="answer">To make the tokenizer better, it should ignore stop words, such as 'the', 'that', 'a' etc. as these are not descriptive of the chapter</p>

			</li>

			<li class="qandA">

				<h3>Question 1b</h3>

				<p class="question">Looking at the top 30, does it make sense to include all of them in the stop words list? Explain your answer.</p>

				<p class="answer">It does not make sense to include stop words in this list, because they do not really provide relevance to the text, in most cases. The tokenizer should be able to distinguish between actual stop words (such as 'the') and words that may appear to be stop words but really are not (such as in the title 'The Who').</p>

			</li>

			<li class="qandA">

				<h3> Question 1c</h3>

				<p class="question">Looking at the top 30, do you think the frequency follows Zipf’s law? You can run <code>z &lt;- zipf(alice_index$total)</code> to find out. Explain in your own words what "following Zipf’s law" means in this case.</p>

				<p class="answer">No, the frequnecy does not follow Zipf's law. The second most frequent term does, but the subsequent terms do not follow Zipf's law, by quite a lot.</p>

				<p class="answer">Since the most frequent word ('the') appears 1644 times, the second most frequent word should appear approximately 822 times(1644/2). The third most frequent word, should appear about 548 times (1644/3). The fourth should appear approximately 411 times (1643/4) and so forth.</p>

			</li>

			<li class="qandA">

				<h3>Question 1d</h3>

				<p class="question">Similarly, run h &lt;- heaps(alice_index$total) to determine whether Heaps law applies. Explain in your own words what "Heaps’ law applies" means in this case.</p>

				<p class="answer">Yes, Heap's Law applies in this case. This is because, initially the slope is steep but then it's not as steep.This means that the number of disctinct vocabulary decreases as the text is scanned.</p>

			</li>

			<li class="qandA">

				<h3>Question 2a</h3>

				<p class="question">How do you assess the quality of the TF-based search. What can you say about precision and recall?</p>

				<p class="answer">To assess the quality of the TF-based search, we can examine the relevance of each result to the query.</p>

				<p class="answer">Precision is the fraction of how many of the returned results are relevant.</p>

				<p class="answer">Recall is the fraction of how many of the returned relevant results are retrieved.</p>

			</li>

			<li class="qandA">

				<h3>Question 2b</h3>

				<p class="question">Compare the TF-based search with the TFIDF-based search. What has changed? Try to find a few term that score relatively high on TFIDF for the highest ranking document. List the three term with the highest TFIDF score that you found, along with their score. Do you think these terms "describe" the associated Alice chapter well? </p>

				<p class="answer">When using TF-based search, the numbers returned are whole, whereas when using TFIDF-based search, the numbers are decimals.</p>

				<p class="answer">When using TF-based search, each term has the same weight. When using TFIDF-based search, each word receives a weight that depends on its frequency.</p>

				<p class="answer">Chapter IX is about Mock Turtle. The three highest scoring terms using TFIDF are: moral - 5.46, turtle - 4.62 and mock - 4.52.</p>

				<p class="answer">These terms seem to describe the chapter well enough. We know that it is about a turtle called Mock and that there is moral questioning.</p>
			</li>

			<li class="qandA">

				<h3>Question 2c</h3>

				<p class="question">Alice in Wonderland only has 11 chapters, what can you say about the precision at k where k=5 for both types of searches. That is, what if you would only look at the 5 highest ranked results, and discard the other results as irrelevant.</p>

				<p class="answer">Using TF, the top 5 terms are not descriptive at all of the chapter, as they are almost only stop words. This means that the precision is very low, almost, if not, 0.Using TFIDF, the top 5 terms are much more descriptive of the chapter than using TF. The precision would be much higher than that of the TF search, reflecting that the terms are much more descriptive of the chapter.</p>

			</li>

			<li class="qandA">

				<h3>Question 2d</h3>

				<p class="question">What would happen to precision and recall if you implemented the improvements to the tokenizer you suggested in question 1a.?</p>

				<p class="answer">The precision of the TF search will increase, as stop words are ignored, meaning the top 5 terms will be much more representative of the chapter.</p>

				<p class="answer">The recall of the TFIDF will increase, since the weight of each term will increased by some amount.<p>

			</li>

		</ol>

	</section>