Advertisement
Captain_Chen

Assignment

Oct 1st, 2013
176
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.25 KB | None | 0 0
  1. Objective: The purpose of assignment 1 is to write programs that take a collection of documents and generate its inverted index.
  2.  
  3. Test collection:
  4.  
  5. You will use CACM collection, which is a standard test collection for IR research. It is a collection of titles and abstracts from a journal - Communications of the ACM (CACM), and includes articles published between 1958 and 1979. There are altogether 3204 documents and 10446 terms. The main file you need to use for this assignment is cacm.all which contains the text of documents. You should keep the information from the following fields: .I (for document ID), .T (for title), .W (for abstract), .B (for publication date), and .A (for author list). The terms are extracted from the title and the abstract.
  6.  
  7. Requirements:
  8.  
  9. 1. You need to write a program invert to do the index construction. The input to the program is the document collection. The output includes two files - a dictionary file and a postings lists file. Each entry in the dictionary should include a term, its document frequency and a link to its postings list. You should use a proper data structure to build the dictionary (e.g. hashing or search tree or others). The structure should be easy for random lookup and insertion of new terms. All the terms should be sorted in alphabetical order. Postings list for each term should include postings for all documents the term occurs in (in the order of document ID), and the information saved in a posting includes document ID, term frequency in the document, and positions of all occurrences of the term in the document.
  10.  
  11. 2. You should have a component for stop word removal, using the stop word list provided in the CACM collection or a shorter stop word list. You should also have a stemming component implemented using Porter's Stemming algorithm or other stemming algorithms. Please make sure these two components can be turned on or off when you run the program.
  12.  
  13. 3. You need to write the second program test to test your inverting program. The inputs to the program are the two files generated from the previous program invert. It then keeps asking user to type in a single term. If the term is in one of the documents in the collection, the program should display the document frequency and all the documents which contain this term, for each document, it should display the document ID, the title, the term frequency, all the positions the term occurs in that document, and a summary of the document highlighting the first occurrence of this term with 10 terms in its context. When user types in the term ZZEND, the program will stop (this requirement is valid only when your program doesn't have a graphical interface). Each time, when user types in a valid term, the program should also output the time from getting the user input to outputting the results. Finally, when the program stops, the average value for above-mentioned time should also be displayed.
  14.  
  15. 4. Write a brief report (one or two pages) to describe the algorithms and data structures you have used for the first program. The report should also include instructions on how to run your programs.
  16.  
  17. 5. You can choose one of the following languages: Java, C++, C#, Perl, Python or Ruby.
  18.  
  19. Note: You can write more than two programs and generate more files if necessary.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement