Help with screen cleaning / parsing

I am trying to clear and ultimately parse some data (in particular, prices and availability) on hostels.com, for example http://www.hostels.com/hosteldetails.php/HostelNumber.11890 . The problem is that after you select the number of nights and select "book now", nothing is transmitted via the URL string (all this is done through Ajax, I believe). I cannot go directly to a specific date or time frame.

I tried using browser emulators like Selenium, IRobotSoft, and FakeApp, and although I got Selenium and Fake to do most of the work, taking the full source, it was awful and still tedious to have to scratch (and analyze with other software) a few pages a day.

I also tried HTML DOM Parser, PHP scriptable Web Browser, HTMLUnit, cScrape.php, Crowbar. Either they could not cope with Ajax, or I was not lucky, even if they ran.

Ideally, I would like something that can be run from the server, with the greatest possible dependency, but at this point I just would like to run it.

Now, having spent many hours trying to get this to work. I still feel like I don’t know where to start. Can someone just point me in the right direction ?. Should I go back and spend more time using HTMLUnit? What would be the best practice for such a site?

thank

+3
source share
4 answers

I'm really at Node.js atm (server-side javascript if you are not familiar), so what I recommend. What is surprising is that you can use it to clean sites, you can use jQuery or something else that your favorite JS framework is to do all the work of parsing the information you want! To get started, check out the following resources:

http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga

https://github.com/tmpvar/jsdom

https://github.com/chriso/node.io/wiki/Scraping

https://github.com/joshfire/node-crawler

+2

, , , AJAX. , AJAX, POST ( , URL-, GET). . , , , - , .

Firebug , POST. . HTML POST .

, +1 .

+2

, Selenium, IRobotSoft FakeApp. HTML DOM Parser, PHP able Web Browser, HTMLUnit, cScrape.php, Crowbar.

iMacros? http://wiki.imacros.net/Data_Extraction

HTMLUnit iMacros ajaxy.

script :

URL GOTO=http://www.hostels.com/hostels/ottawa/ottawa-backpackers-inn/11890
TAG POS=1 TYPE=INPUT:TEXT FORM=NAME:theForm ATTR=ID:ArrivingField CONTENT=15<SP>Jun<SP>2011
TAG POS=1 TYPE=DIV FORM=NAME:theForm ATTR=CLASS:calIcon
TAG POS=1 TYPE=SELECT FORM=NAME:theForm ATTR=NAME:NumNights CONTENT=%3
TAG POS=1 TYPE=SELECT FORM=NAME:theForm ATTR=NAME:NumNights CONTENT=%4
TAG POS=1 TYPE=INPUT:SUBMIT FORM=NAME:theForm ATTR=VALUE:Book<SP>Now
+2

Celerity (http://celerity.rubyforge.org), JRuby, HTMLUnit , " ".

Celerity, Ruby, , Java (HTMLUnit). , - Celerity "" HTMLUnit - HTMLUnit, .

, DHTML, Ajax; , sleep() Ajax - , .

!

+1

All Articles