- crawl started in: crawl
- rootUrlDir = urls
- threads = 10
- depth = 3
- topN = 2
- Injector: starting
- Injector: crawlDb: crawl/crawldb
- Injector: urlDir: urls
- Injector: Converting injected urls to crawl db entries.
- Injector: Merging injected urls into crawl db.
- Injector: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: crawl/segments/20100317120411
- Generator: filtering: true
- Generator: topN: 2
- Generator: jobtracker is 'local', generating exactly one partition.
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: crawl/segments/20100317120411
- Fetcher: threads: 10
- QueueFeeder finished: total 1 records.
- fetching http://thestar.com.my/
- -finishing thread FetcherThread, activeThreads=9
- -finishing thread FetcherThread, activeThreads=8
- -finishing thread FetcherThread, activeThreads=7
- -finishing thread FetcherThread, activeThreads=6
- -finishing thread FetcherThread, activeThreads=5
- -finishing thread FetcherThread, activeThreads=4
- -finishing thread FetcherThread, activeThreads=3
- -finishing thread FetcherThread, activeThreads=2
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=0
- -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
- -activeThreads=0
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: crawl/crawldb
- CrawlDb update: segments: [crawl/segments/20100317120411]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: crawl/segments/20100317120421
- Generator: filtering: true
- Generator: topN: 2
- Generator: jobtracker is 'local', generating exactly one partition.
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: crawl/segments/20100317120421
- Fetcher: threads: 10
- QueueFeeder finished: total 2 records.
- fetching http://thestar.com.my/news/story.asp?file=/2010/3/17/nation/5873356&sec=nation
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
- * queue: http://203.115.194.20
- maxThreads = 1
- inProgress = 0
- crawlDelay = 1000
- minCrawlDelay = 0
- nextFetchTime = 1268798666076
- now = 1268798665960
- 0. http://thestar.com.my/news/story.asp?file=/2010/3/17/nation/20100317104521&sec=nation
- fetching http://thestar.com.my/news/story.asp?file=/2010/3/17/nation/20100317104521&sec=nation
- -finishing thread FetcherThread, activeThreads=9
- -finishing thread FetcherThread, activeThreads=7
- -finishing thread FetcherThread, activeThreads=7
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=2
- -finishing thread FetcherThread, activeThreads=3
- -finishing thread FetcherThread, activeThreads=4
- -finishing thread FetcherThread, activeThreads=5
- -finishing thread FetcherThread, activeThreads=6
- -finishing thread FetcherThread, activeThreads=0
- -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
- -activeThreads=0
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: crawl/crawldb
- CrawlDb update: segments: [crawl/segments/20100317120421]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: crawl/segments/20100317120431
- Generator: filtering: true
- Generator: topN: 2
- Generator: jobtracker is 'local', generating exactly one partition.
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: crawl/segments/20100317120431
- Fetcher: threads: 10
- QueueFeeder finished: total 2 records.
- fetching http://thestar.com.my/news/story.asp?file=/2010/3/17/nation/5878747&sec=nation
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
- * queue: http://203.115.194.20
- maxThreads = 1
- inProgress = 0
- crawlDelay = 1000
- minCrawlDelay = 0
- nextFetchTime = 1268798676585
- now = 1268798676215
- 0. http://thestar.com.my/news/nation/
- fetching http://thestar.com.my/news/nation/
- -finishing thread FetcherThread, activeThreads=8
- -finishing thread FetcherThread, activeThreads=8
- -finishing thread FetcherThread, activeThreads=1
- -finishing thread FetcherThread, activeThreads=2
- -finishing thread FetcherThread, activeThreads=3
- -finishing thread FetcherThread, activeThreads=4
- -finishing thread FetcherThread, activeThreads=5
- -finishing thread FetcherThread, activeThreads=6
- -finishing thread FetcherThread, activeThreads=7
- -finishing thread FetcherThread, activeThreads=0
- -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
- -activeThreads=0
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: crawl/crawldb
- CrawlDb update: segments: [crawl/segments/20100317120431]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- LinkDb: starting
- LinkDb: linkdb: crawl/linkdb
- LinkDb: URL normalize: true
- LinkDb: URL filter: true
- LinkDb: adding segment: file:/C:/nutch-1.0/crawl/segments/20100317120411
- LinkDb: adding segment: file:/C:/nutch-1.0/crawl/segments/20100317120421
- LinkDb: adding segment: file:/C:/nutch-1.0/crawl/segments/20100317120431
- LinkDb: done
- Indexer: starting
- Indexer: done
- Dedup: starting
- Dedup: adding indexes in: crawl/indexes
- Dedup: done
- merging indexes to: crawl/index
- Adding file:/C:/nutch-1.0/crawl/indexes/part-00000
- done merging
- crawl finished: crawl
- Found 1 hits
- Html content:
- <ul>
- <li>boost = 0.14707822</li>
- <li>digest = 02930a5240bf62309821cfec88b819b3</li>
- <li>segment = 20100317120431</li>
- <li>title = The Star Online: Nation</li>
- <li>tstamp = 20100317040436985</li>
- <li>url = http://thestar.com.my/news/nation/</li>
- </ul>
- Created html file
- Start open calais web service.....
- End open calais web service.....
- Title is: The Star Online: Nation
- (http://thestar.com.my/news/nation/)
- Date Fetched: Wed Mar 17 12:04:36 SGT 2010
- ... promise, UN sec-gen tell rich countries Rich nations have not kept their promises ...
- ----------------------------------------
