Help Needed for a Web Spider

I am writing a very simple web spider in java.I am faced with one problem that the content downloaded for the same URL is different from the content in the browser. For example, try finding a URL.

http://www.google.co.in/search?sourceid=chrome&ie=UTF-8&q=web+spider#sclient= psi & hl = ep & source = hp & Q = Web + spider & water = F & AQI = & acl = & OQ = Web + spider & PBX = 1 & Fp = d8e8e41d6d2bda33 & BIW = 1366 & BiH = 643

If you download this URL in a browser and through the JAVA-URL class, the content is different. This may be due to the following reasons.

  • Javascript can send XMLHTTP requests and combines the result to display the final HTML.
  • URL redirection can finally do HTML.
  • Any other reasons that I don't know about.

So, there is a way that I model the browser in my java program. Are there any third-party libraries that load a page similar to what the browser does, and finally return the content. Any help is appreciated.

+3
source share
1 answer

try htmlunit , it can emulate browser behavior and handle javascript

+1
source

All Articles