How do you clean a site that requires credentials (SSL)?

Question

How do you clean a site that requires credentials (SSL)?

I was wondering if anyone could point me in the right direction. I want to clear the html / text content from an SSL enabled website (https in the url). There will be several branches in the file system of the specified site.

My questions:

How do I provide credentials for an external website from my Rails application?

Thank!

+5

ruby-on-rails ssl web scraping screen-scraping

Symba 25 sept. '12 at 22:26

source share

3 answers

Typhoeus gem.

.

, Typhoeus,

1.9.3p194 :001 > Typhoeus # Checking that Typhoeus gem is being used.
 => Typhoeus 
1.9.3p194 :002 > url = "https://twitter.com/"
 => "https://twitter.com/" 
1.9.3p194 :003 > response = Typhoeus::Request.get(url, :timeout => 5000)

 => #<Typhoeus::Response:0x007fdd8cc00488 @code=200, @curl_return_code=0, @curl_error_message="No error", @status_message=nil, @http_version=nil, @headers="HTTP/1.1 200 OK\r\nDate: Tue, 25 Sep 2012 23:56:32 GMT\r\nStatus: 200 OK\r\nX-Runtime: 0.08814\r\nX-MID: 0cfcab7a410834bf31115f9a5cd7fb62651aa568\r\nStrict-Transport-Security: max-age=631138519\r\nCache-Control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0\r\nContent-Type: text/html; charset=utf-8\r\nX-Frame-Options: SAMEORIGIN\r\nLast-Modified: Tue, 25 Sep 2012 23:56:32 GMT\r\nETag: \"95db45f50f8dc1a45be3895e03a23d53\"\r\nExpires: Tue, 31 Mar 1981 05:00:00 GMT\r\nX-Transaction: 72253ef75f0755e1\r\nPragma: no-cache\r\nSet-Cookie: k=10.35.35.113.1348617392068257; path=/; expires=Tue, 02-Oct-12 23:56:32 GMT; domain=.twitter.com\r\nSet-Cookie: guest_id=v1%3A134861739271966362; domain=.twitter.com; path=/; expires=Fri, 26-Sep-2014 11:56:32 GMT\r\nSet-Cookie: _twitter_sess=BAh7CToPY3JlYXRlZF9hdGwrCFBS3P85AToMY3NyZl9pZCIlNTY2MzNjOTM0%250AOTIyMDE4ZmNkY2E4NjViZmE3ZTBkMDAiCmZsYXNoSUM6J0FjdGlvbkNvbnRy%250Ab2xsZXI6OkZsYXNoOjpGbGFzaEhhc2h7AAY6CkB1c2VkewA6B2lkIiViYjAw%250AY2Q1YWZkMDAwNmExNWJhNjAyYmNiNzBhOTA0Yg%253D%253D--5ffbea931432fe65a2128be90048e3bb6fc9dbca; domain=.twitter.com; path=/; HttpOnly\r\nX-XSS-Protection: 1; mode=block\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\nContent-Length: 13733\r\nServer: tfe\r\n\r\n", @body="<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    \n    <script>document.domain='twitter.com'</script>\n\n      <title>Twitter</title>\n\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\n    \n      <meta name=\"description\" content=\"Instantly connect to what&#39;s most important to you. Follow your friends, experts, favorite celebrities, and breaking news.\">\n    \n    \n      <link href=\"/favicons/favicon.ico\" rel=\"shortcut icon\" type=\"image/x-icon\">\n    \n    \n          <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_core_logged_out.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n        <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/13485592

1.9.3p194 :005 >    response.body # returns html document
 => "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    \n    <script>document.domain='twitter.com'</script>\n\n      <title>Twitter</title>\n\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\n    \n      <meta name=\"description\" content=\"Instantly connect to what&#39;s most important to you. Follow your friends, experts, favorite celebrities, and breaking news.\">\n    \n    \n      <link href=\"/favicons/favicon.ico\" rel=\"shortcut icon\" type=\"image/x-icon\">\n    \n    \n          <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_core_logged_out.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n        <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_more.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n          <script>\n      (function() {\n        function getPhxPath(){var a=l.href.match(/#(.)(.*)$/);return a&&a[1]==\"!\"&&a[2]}function getEvent(a){return a?(a=a.replace(/^#|\\/$/,\"\").toLowerCase(),a.match(/^[a-z0-9_]+$/)?a:!1):!1}function redirectEventPath(a){var a=getEvent(a);if(a){var b=document.referrer||\"none\",c=\"ev_redir_\"+a+\"=\"+b+\"; path=/\";document.cookie=c,l.replace(\"/hashtag/\"+a)}}function resolveInlineRedirects(){var a=getPhxPath();a&&l.replace(a),l.hash!=\"\"&&redirectEventPath(l.hash.substr(1).toLowerCase())}var l=window.location;resolveInlineRedirects(),window.addEventListener?window.addEventListener(\"hashchange\",resolveInlineRedirects,!1):window.attachEvent&&window.attachEvent(\"onhashchange\",resolveInlineRedirects);\n      }());\n      </script>\n    \n    <script>\n      \n      \n      (func

!

+2

Jason Kim 26 . '12 0:01

. .

open("http://...", :http_basic_authentication=>[user, password])

, . .

require "open-uri"
require "zlib"

SHINSO_HEADERS = {
  'Accept'          => '*/*',
  'Accept-Charset'  => 'utf-8, windows-1251;q=0.7, *;q=0.6',
  'Accept-Encoding' => 'gzip,deflate',
  'Accept-Language' => 'bg-BG, bg;q=0.8, en;q=0.7, *;q=0.6',
  'Connection'      => 'keep-alive',
  'Cookie'          => '',
  'From'            => 'email@example.com',
  'Referer'         => 'http://svejo.net/',
  'User-Agent'      => 'Your user agent'
}

def crawl(url_address)
  self.errors = Array.new
  begin
    begin
      url_address = URI.parse(url_address)
    rescue URI::InvalidURIError
      url_address = URI.decode(url_address)
      url_address = URI.encode(url_address)
      url_address = URI.parse(url_address)
    end
    url_address.normalize!
    stream = ""
    timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
    if stream.size > 0
      url_crawled = URI.parse(stream.base_uri.to_s)
    else
      self.errors << "Server said status 200 OK but document file is zero bytes."
      return
    end
  rescue Exception => exception
    self.errors << exception
    return
  end
end

url_crawled - , .

. https://developer.mozilla.org/en-US/docs/HTTP_access_control

If you still experience an error, your server may not be configured correctly, the certificate will be wise, and you should check this.

And if you're serious about parsing, you can also use the CharGuess and Zlib stones to read content rights, and then convert the problematic ones with Iconv. Here is an example.

if    stream.content_encoding.include?('gzip')
  document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
  document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
  document = stream.read
end
self.charset_guess = CharGuess.guess(document)

Then just use Iconv on the content.

Hope this helps you.

Regards, Yavor

+1

Yavor Ivanov Sep 26 '12 at 15:04

source share

Symba · Accepted Answer · 2012-11-06T23:10:24+0000

require 'httpclient'
require 'nokogiri'

client = HTTPClient.new

client.set_auth("http://domain.com", "username", "password")

doc = Nokogiri::HTML(c.get_content("http://example.com"))

, , , . . ( , ). , openuri, mechanize .., , MD5 . .

How do you clean a site that requires credentials (SSL)?

More articles: