Useful commands.
Start scan
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
Get URL crawl statistics
bin/nutch readdb crawl/crawldb -stats
( -)
bin/nutch readseg -dump crawl/segments/* segmentAllContent
( )
bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate - noparse -noparsedata
URL-, URL- .
bin/nutch readlinkdb crawl/linkdb/ -dump linkContent
URL. , , , , ..
bin/nutch readdb crawl/crawldb/ -dump crawlContent
. , index-extra .
:
this this