I am looking for a way to find all web pages and subdomains in a domain. For example, in the uoregon.edu domain, I would like to find all web pages in this domain and in all subdomains (for example, cs.uoregon.edu).
I looked at the nut, and I think that it can cope with this task. But it looks like nutch loads entire web pages and indexes them for later searches. But I want the crawler to only view the webpage for URLs belonging to the same domain. In addition, it seems that nutch saves linkdb in a serialized format. How can I read this? I tried solr and it can read the collected data. But, I do not think that I need solr, since I do not perform any searches. All I need are URLs belonging to this domain.
thank
source
share