What is going on inside Nutch 2?

Question

What is going on inside Nutch 2?

I really want to know (and should know) about nutch and its algorithms (because it belongs to my project), which it uses to extract, classify, ... (usually Crawling).
I am reading this material, but it is a little difficult to understand.
Is there anyone who can explain this to me in a complete and understandable way?
thanks in advance.

+5

algorithm analysis nutch

Soroush Jul 27 '12 at 22:22

source share

1 answer

Xantix · Accepted Answer · 2012-07-28T06:24:55+0000

Short answer

In short, they developed a web browser designed to scan the Internet very efficiently from many computer environments (but which can also run on the same computer).

, .

, , , .

Hadoop, Java , MapReduce. MapReduce - , Google .

MapReduce/Hadoop, , , , - ( ).

wikipedia MapReduce.

, Node, () ( ), , .

, Node - ( -), ( ) .

, , .

4 :

Fetch

*

-, , : "-".

Node ( , ).

- :

? , .
URL-, "http://www.google.com/" "http://www.google.com/../" -.
- .

( -, , )

topN (, 10 ) -.

* Fetch

URL- -, , .

Slaves URL , -, .

URL- HTML- - .

*

- -, , .

, , .

- -.

*

- -, .

URL-, ( ) URL- ( - ).

- .

.

Repeat

, - -. , , . .

MapReduce - .

, .

, , . .

MapReduce :

Mapper, Partitioner Reducer.

MapReduce , , kila. (, Mega-).

What is going on inside Nutch 2?

Short answer

*

* Fetch

*

*

Repeat

More articles: