Nutch: reading and adding metadata

Recently, I started looking for Apaches. I could customize and was able to crawl the web pages of my interest using nutch. I do not quite understand how to read this data. I basically want to associate the data of each page with some metadata (some random data at the moment) and store it locally, which will later be used for search (semantics). Do I need to use solr or lucene for the same? I am new to all of this. As far as I know, Nutch is used to crawl web pages. Can there be additional features like adding metadata to workarounds?

+5
source share
1 answer

Useful commands.

Start scan

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Get URL crawl statistics

bin/nutch readdb crawl/crawldb -stats

( -)

bin/nutch readseg -dump crawl/segments/* segmentAllContent

( )

bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate -     noparse -noparsedata

URL-, URL- .

bin/nutch readlinkdb crawl/linkdb/ -dump linkContent

URL. , , , , ..

bin/nutch readdb crawl/crawldb/ -dump crawlContent

. , index-extra .

:

this this

+3

All Articles