Find all web pages in a domain and its subdomains

Question

Find all web pages in a domain and its subdomains

I am looking for a way to find all web pages and subdomains in a domain. For example, in the uoregon.edu domain, I would like to find all web pages in this domain and in all subdomains (for example, cs.uoregon.edu).

I looked at the nut, and I think that it can cope with this task. But it looks like nutch loads entire web pages and indexes them for later searches. But I want the crawler to only view the webpage for URLs belonging to the same domain. In addition, it seems that nutch saves linkdb in a serialized format. How can I read this? I tried solr and it can read the collected data. But, I do not think that I need solr, since I do not perform any searches. All I need are URLs belonging to this domain.

thank

+3

url web-crawler solr nutch

gmemon Apr 22 '12 at 23:01

source share

2 answers

- DNS DNS Zone Transfer ; - DNS, (, , ) , DNS-. , DNS- - .

, HTTP- , , , , , , , , , , . FTP , , .

0

sarnold 22 . '12 23:12

sunnyrjuneja · Accepted Answer · 2012-04-22T23:08:58+0000

If you are familiar with ruby, consider using an anemone. Wonderful workaround. Here is an example of code that works out of the box.

require 'anemone'

urls = []

Anemone.crawl(site_url)
  anemone.on_every_page do |page|
    urls << page.url
  end
end

https://github.com/chriskite/anemone

Disclaimer: You need to use the patch from the problems to bypass the subdomains, and you may want to add the maximum number of pages.

Find all web pages in a domain and its subdomains

More articles: