Advertisement
Guest User

Untitled

a guest
Oct 1st, 2014
212
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.69 KB | None | 0 0
  1. elcome to the Team 5 WikiStats tool.
  2. README
  3. The WikiStats tool is used for analyzing statistics about page views for WikiPedia.
  4. CONFIGURATION
  5. To configure the tool, open "wikistats.pbs". Here, modify the number of nodes to use for this run. In addition, specify the max amount of time (wall time) to run for. Modify the below lines in "wikistats.pbs"
  6. PBS -l nodes=NO_OF_NODES:ppn=8,pmem=2gb
  7. PBS -l walltime=HH:MM:SS
  8. Next, edit the command line arguments section (lines 33-37). The arguments are: ARGSLANGS : The M popular languages to analyze the spikes for. ARGSSPIKES : The N top page spikes to view for each of the popular languages. ARGSODAYS : The O day period to use for this run. This will be every O day period over the input data. ARGSNODES : The number of nodes to use for this run. This is required as an argument in order to calculate the ideal number of reducers to use. ARGSMAXDAYS : The number of days to consider data over.
  9. Next, go to LOCALINPUT in line 50. Set this to the directory where the data is stored for input. Ideally, this directory will contain the same number of days as set in the max days args. Any additional files in the directory will be read over, but the data within will not be considered.
  10. Finally, go to LOCALOUTPUT in line 54. Set this to the directory where the output should be stored. The output will come in one file.
  11. For a collection of Wikistats data, see: /lustre/cs562178/wikistats The data in this directory is not guarenteed to be accurate or persistent.
  12. EXECUTION
  13. To run the program, execute: chmod u+x runit.sh ./runit.sh
  14. INPUT:
  15. Files within the given input directory should be named according to the following protocol: pagecounts-yyyymmdd-hh0000.gz Where yyyy is the year, mm is the month, dd is the day, and hh is the 24-hr associated with the corresponding wikistats data.
  16. Each input file should contain data with each line according to the following protocol: ln pgName pageViews bytes Where ln is the language code, pgName is the name of the Wikipedia page, pageViews are the number of page views for that time section, and bytes is the size of the page. Language codes other than two characters long will be ignored.
  17. OUTPUT:
  18. The output for this tool will come in one file, with the following format: ln1 pageName1 pageSpike1 ... ln1 pageNameN pageSpikeN Language: ln1 uniquePages ... lnM pageName1 pageSpike1 ... lnM pageNameN pageSpikeN Language: lnM uniquePages
  19. The pages with the highest spikes for a language are output first, followed by that language and the unique pages for the dataset. This is repeated for each of desired top languages.
  20. For sample output, see 5-10-1-5.txt. This file uses parameters of: M = 5, N = 10, and O = 1. This is for 5 days of input data.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement