Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- [hbase@sandbox bin]$ echo 'https://docs.oracle.com/javase/7/docs/api/' > seeds.txt
- [hbase@sandbox bin]$ ./nutch inject seeds.txt
- InjectorJob: starting at 2017-10-25 09:25:28
- InjectorJob: Injecting urlDir: seeds.txt
- InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
- InjectorJob: total number of urls rejected by filters: 0
- InjectorJob: total number of urls injected after normalization and filtering: 1
- Injector: finished at 2017-10-25 09:25:36, elapsed: 00:00:07
- [hbase@sandbox bin]$ ./nutch generate -topN 100
- GeneratorJob: starting at 2017-10-25 09:26:05
- GeneratorJob: Selecting best-scoring urls due for fetch.
- GeneratorJob: starting
- GeneratorJob: filtering: true
- GeneratorJob: normalizing: true
- GeneratorJob: topN: 100
- GeneratorJob: finished at 2017-10-25 09:26:26, time elapsed: 00:00:21
- GeneratorJob: generated batch id: 1508923565-2134519272 containing 1 URLs
- [hbase@sandbox bin]$ ./nutch fetch -all
- FetcherJob: starting at 2017-10-25 09:26:40
- FetcherJob: fetching all
- FetcherJob: threads: 10
- FetcherJob: parsing: false
- FetcherJob: resuming: false
- FetcherJob : timelimit set for : -1
- Using queue mode : byHost
- Fetcher: threads: 10
- QueueFeeder finished: total 1 records. Hit by time limit :0
- Fetcher: throughput threshold: -1
- Fetcher: throughput threshold sequence: 5
- -finishing thread FetcherThread1, activeThreads=9
- -finishing thread FetcherThread7, activeThreads=8
- -finishing thread FetcherThread6, activeThreads=7
- -finishing thread FetcherThread5, activeThreads=6
- -finishing thread FetcherThread2, activeThreads=5
- -finishing thread FetcherThread4, activeThreads=4
- -finishing thread FetcherThread3, activeThreads=3
- fetching https://docs.oracle.com/javase/7/docs/api/ (queue crawl delay=5000ms)
- -finishing thread FetcherThread8, activeThreads=2
- -finishing thread FetcherThread9, activeThreads=1
- 0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 1 queues
- -finishing thread FetcherThread0, activeThreads=0
- 0/0 spinwaiting/active, 1 pages, 0 errors, 0.1 0 pages/s, 2 4 kb/s, 0 URLs in 0 queues
- -activeThreads=0
- Using queue mode : byHost
- Fetcher: threads: 10
- QueueFeeder finished: total 0 records. Hit by time limit :0
- -finishing thread FetcherThread0, activeThreads=0
- Fetcher: throughput threshold: -1
- Fetcher: throughput threshold sequence: 5
- -finishing thread FetcherThread1, activeThreads=7
- -finishing thread FetcherThread2, activeThreads=6
- -finishing thread FetcherThread3, activeThreads=5
- -finishing thread FetcherThread4, activeThreads=4
- -finishing thread FetcherThread5, activeThreads=3
- -finishing thread FetcherThread6, activeThreads=2
- -finishing thread FetcherThread7, activeThreads=1
- -finishing thread FetcherThread8, activeThreads=0
- -finishing thread FetcherThread9, activeThreads=0
- 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
- -activeThreads=0
- FetcherJob: finished at 2017-10-25 09:27:06, time elapsed: 00:00:25
- [hbase@sandbox bin]$ hbase shell
- 2017-10-25 09:27:27,494 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
- HBase Shell; enter 'help<RETURN>' for list of supported commands.
- Type "exit<RETURN>" to leave the HBase Shell
- Version 0.98.4.2.2.4.2-2-hadoop2, rdd8a499345afc1ac49dc5ef212ba64b23abfe110, Tue Mar 31 16:18:12 EDT 2015
- hbase(main):001:0> list
- TABLE
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/usr/hdp/2.2.4.2-2/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/usr/hdp/2.2.4.2-2/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- scan weiemployee
- webpage
- 2 row(s) in 4.6180 seconds
- => ["iemployee", "webpage"]
- hbase(main):002:0> scan 'webpage'
- ROW COLUMN+CELL
- com.oracle.docs:https/javase/7/docs/api/ column=f:bas, timestamp=1508923620099, value=https://docs.oracle.com/javase/7/docs/api/
- com.oracle.docs:https/javase/7/docs/api/ column=f:bid, timestamp=1508923585567, value=1508923565-2134519272
- com.oracle.docs:https/javase/7/docs/api/ column=f:cnt, timestamp=1508923620099, value=<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/T
- R/html4/frameset.dtd">\x0A<!-- NewPage -->\x0A<html lang="en">\x0A<head>\x0A<!-- Generated by javadoc on Mon Oct 09 00:19:07
- PDT 2017 -->\x0A<title>Java Platform SE 7 </title>\x0A<script type="text/javascript">\x0A tmpTargetPage = "" + window.lo
- cation.search;\x0A if (tmpTargetPage != "" && tmpTargetPage != "undefined")\x0A tmpTargetPage = tmpTargetPage.subs
- tring(1);\x0A if (tmpTargetPage.indexOf(":") != -1 || (tmpTargetPage != "" && !validURL(tmpTargetPage)))\x0A tmpTa
- rgetPage = "undefined";\x0A targetPage = tmpTargetPage;\x0A function validURL(url) {\x0A try {\x0A u
- rl = decodeURIComponent(url);\x0A }\x0A catch (error) {\x0A return false;\x0A }\x0A v
- ar pos = url.indexOf(".html");\x0A if (pos == -1 || pos != url.length - 5)\x0A return false;\x0A va
- r allowNumber = false;\x0A var allowSep = false;\x0A var seenDot = false;\x0A for (var i = 0; i < url.l
- ength - 5; i++) {\x0A var ch = url.charAt(i);\x0A if ('a' <= ch && ch <= 'z' ||\x0A
- 'A' <= ch && ch <= 'Z' ||\x0A ch == '$' ||\x0A ch == '_' ||\x0A ch
- .charCodeAt(0) > 127) {\x0A allowNumber = true;\x0A allowSep = true;\x0A } else if
- ('0' <= ch && ch <= '9'\x0A || ch == '-') {\x0A if (!allowNumber)\x0A
- return false;\x0A } else if (ch == '/' || ch == '.') {\x0A if (!allowSep)\x0A r
- eturn false;\x0A allowNumber = false;\x0A allowSep = false;\x0A if (ch == '.')\
- x0A seenDot = true;\x0A if (ch == '/' && seenDot)\x0A return false;\x
- 0A } else {\x0A return false;\x0A }\x0A }\x0A return true;\x0A }\x0A
- function loadFrames() {\x0A if (targetPage != "" && targetPage != "undefined")\x0A top.classFrame.locat
- ion = top.targetPage;\x0A }\x0A</script>\x0A</head>\x0A<frameset cols="20%,80%" title="Documentation frame" onload="top.l
- oadFrames()">\x0A<frameset rows="30%,70%" title="Left frames" onload="top.loadFrames()">\x0A<frame src="overview-frame.html"
- name="packageListFrame" title="All Packages">\x0A<frame src="allclasses-frame.html" name="packageFrame" title="All classes
- and interfaces (except non-static nested types)">\x0A</frameset>\x0A<frame src="overview-summary.html" name="classFrame" tit
- le="Package, class and interface descriptions" scrolling="yes">\x0A<noframes>\x0A<noscript>\x0A<div>JavaScript is disabled o
- n your browser.</div>\x0A</noscript>\x0A<h2>Frame Alert</h2>\x0A<p>This document is designed to be viewed using the frames f
- eature. If you see this message, you are using a non-frame-capable web client. Link to <a href="overview-summary.html">Non-f
- rame version</a>.</p>\x0A</noframes>\x0A</frameset>\x0A</html>\x0A
- com.oracle.docs:https/javase/7/docs/api/ column=f:fi, timestamp=1508923536028, value=\x00'\x8D\x00
- com.oracle.docs:https/javase/7/docs/api/ column=f:prot, timestamp=1508923620099, value=\x02\x00\x00
- com.oracle.docs:https/javase/7/docs/api/ column=f:pts, timestamp=1508923620099, value=\x00\x00\x01_R\xD9\xD4\xB1
- com.oracle.docs:https/javase/7/docs/api/ column=f:st, timestamp=1508923620099, value=\x00\x00\x00\x02
- com.oracle.docs:https/javase/7/docs/api/ column=f:ts, timestamp=1508923620099, value=\x00\x00\x01_R\xDB5D
- com.oracle.docs:https/javase/7/docs/api/ column=f:typ, timestamp=1508923620099, value=text/html
- com.oracle.docs:https/javase/7/docs/api/ column=h:Accept-Ranges, timestamp=1508923620099, value=bytes
- com.oracle.docs:https/javase/7/docs/api/ column=h:Connection, timestamp=1508923620099, value=close
- com.oracle.docs:https/javase/7/docs/api/ column=h:Content-Encoding, timestamp=1508923620099, value=gzip
- com.oracle.docs:https/javase/7/docs/api/ column=h:Content-Length, timestamp=1508923620099, value=1083
- com.oracle.docs:https/javase/7/docs/api/ column=h:Content-Type, timestamp=1508923620099, value=text/html
- com.oracle.docs:https/javase/7/docs/api/ column=h:Date, timestamp=1508923620099, value=Wed, 25 Oct 2017 09:26:58 GMT
- com.oracle.docs:https/javase/7/docs/api/ column=h:ETag, timestamp=1508923620099, value="e6133d8aad7082b3c3290041f83cc357:1508252293"
- com.oracle.docs:https/javase/7/docs/api/ column=h:Last-Modified, timestamp=1508923620099, value=Thu, 12 Oct 2017 04:20:15 GMT
- com.oracle.docs:https/javase/7/docs/api/ column=h:Server, timestamp=1508923620099, value=Apache
- com.oracle.docs:https/javase/7/docs/api/ column=h:Vary, timestamp=1508923620099, value=Accept-Encoding
- com.oracle.docs:https/javase/7/docs/api/ column=mk:_ftcmrk_, timestamp=1508923620099, value=1508923565-2134519272
- com.oracle.docs:https/javase/7/docs/api/ column=mk:_gnmrk_, timestamp=1508923620099, value=1508923565-2134519272
- com.oracle.docs:https/javase/7/docs/api/ column=mk:_injmrk_, timestamp=1508923620099, value=y
- com.oracle.docs:https/javase/7/docs/api/ column=mk:dist, timestamp=1508923620099, value=0
- com.oracle.docs:https/javase/7/docs/api/ column=mtdt:_rs_, timestamp=1508923620099, value=\x00\x00\x06L
- com.oracle.docs:https/javase/7/docs/api/ column=s:s, timestamp=1508923536028, value=?\x80\x00\x00
- 1 row(s) in 0.4910 seconds
- hbase(main):003:0> exit
- [hbase@sandbox bin]$ ./nutch parse -all
- ParserJob: starting at 2017-10-25 09:28:14
- ParserJob: resuming:false
- ParserJob: forced reparse:false
- ParserJob: parsing all
- Parsing https://docs.oracle.com/javase/7/docs/api/
- ParserJob: success
- ParserJob: finished at 2017-10-25 09:28:32, time elapsed: 00:00:17
- [hbase@sandbox bin]$ ./nutch updatingdb -all
- Error: Could not find or load main class updatingdb
- [hbase@sandbox bin]$ ./nutch solrindex http://127.0.0.1:8983/solr/#/collection1 -all
- IndexingJob: starting
- Active IndexWriters :
- SOLRIndexWriter
- solr.server.url : URL of the SOLR instance (mandatory)
- solr.commit.size : buffer size when sending to SOLR (default 1000)
- solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
- solr.auth : use authentication (default false)
- solr.auth.username : username for authentication
- solr.auth.password : password for authentication
- SolrIndexerJob: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected content type application/octet-stream but got text/html;charset=ISO-8859-1. <html>
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
- <title>Error 405 HTTP method POST is not supported by this URL</title>
- </head>
- <body><h2>HTTP ERROR 405</h2>
- <p>Problem accessing /solr/admin.html. Reason:
- <pre> HTTP method POST is not supported by this URL</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- <br/>
- </body>
- </html>
- at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:455)
- at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
- at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
- at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
- at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
- at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:146)
- at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:124)
- at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:186)
- at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
- at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
- at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
- [hbase@sandbox bin]$
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement