What is going on inside Nutch 2?

I really want to know (and should know) about nutch and its algorithms (because it belongs to my project), which it uses to extract, classify, ... (usually Crawling).
I am reading this material, but it is a little difficult to understand.
Is there anyone who can explain this to me in a complete and understandable way?
thanks in advance.

+5
source share
1 answer

Short answer

In short, they developed a web browser designed to scan the Internet very efficiently from many computer environments (but which can also run on the same computer).

, .

, , , .

Hadoop, Java , MapReduce. MapReduce - , Google .

MapReduce/Hadoop, , , , - ( ).

wikipedia MapReduce.

, Node, () ( ), , .

, Node - ( -), ( ) .

, , .

4 :

  • Fetch

*

-, , : "-".

Node ( , ).

- :

  • ? , .
  • URL-, "http://www.google.com/" "http://www.google.com/../" -.
  • - .

( -, , )

topN (, 10 ) -.

* Fetch

URL- -, , .

Slaves URL , -, .

URL- HTML- - .

*

- -, , .

, , .

- -.

*

- -, .

URL-, ( ) URL- ( - ).

- .

.

Repeat

, - -. , , . .

MapReduce - .

, .

, , . .

, , . .

MapReduce :

Mapper, Partitioner Reducer.

MapReduce , , kila. (, Mega-).

+18

All Articles