Help Needed for a Web Spider

Question

Help Needed for a Web Spider

I am writing a very simple web spider in java.I am faced with one problem that the content downloaded for the same URL is different from the content in the browser. For example, try finding a URL.

http://www.google.co.in/search?sourceid=chrome&ie=UTF-8&q=web+spider#sclient= psi & hl = ep & source = hp & Q = Web + spider & water = F & AQI = & acl = & OQ = Web + spider & PBX = 1 & Fp = d8e8e41d6d2bda33 & BIW = 1366 & BiH = 643

If you download this URL in a browser and through the JAVA-URL class, the content is different. This may be due to the following reasons.

Javascript can send XMLHTTP requests and combines the result to display the final HTML.
URL redirection can finally do HTML.
Any other reasons that I don't know about.

So, there is a way that I model the browser in my java program. Are there any third-party libraries that load a page similar to what the browser does, and finally return the content. Any help is appreciated.

+3

java html browser web scraping

hnm May 31 '11 at 3:43

source share

1 answer

Frederic bazin · Accepted Answer · 2011-05-31T06:02:16+0000

try htmlunit , it can emulate browser behavior and handle javascript

Help Needed for a Web Spider

More articles: