I need to write a web crawler for a specific user agent

I need to write a web crawler and want to scan using a well-known user agent. For example, I want my crawler to work as an iphone to scan the mobile website of a website, then scan it again using the Mozilla PC agent, etc.

Thus, Ill will be able to scan every "type" of the site (mobile and PC). However, I also want to have the crawler user agent installed, so webmasters also see in their statistics that it is a crawler who visited the entire site, not real users.

So my question is: do you guys know how to install mobile agent + crawler agent simultaneously in PHP? Is it possible?

+2
source share
2 answers

For a>:

10.15 User-Agent

The User-Agent request-request header field contains information about the user agent sending the request. This is done for statistical purposes, tracking protocol violations and automatically recognizing user agents to adapt responses to avoid specific user agent restrictions. Although not required, user agents should include this request field. The field may contain several product tokens (section 3.7) and comments identifying the agent and any offal that make up a significant part of the user agent. From convention, product tokens are listed in order of their value to identify the application.

 User-Agent     = "User-Agent" ":" 1*( product | comment )

Example:

  User-Agent: CERN-LineMode/2.15 libwww/2.17b3

, , . GoogleBot-Mobile:

iPhone

Mozilla/5.0 (iPhone; U; CPU iPhone OS) (compatible; MyBot/1.0; +http://about.my/bot")
+3
    function crawl($url){

        $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"; // <-- this is user agent
        $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        $headers[]  = "Accept-Language:en-us,en;q=0.5";
        $headers[]  = "Accept-Encoding:gzip,deflate";
        $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $headers[]  = "Keep-Alive:115";
        $headers[]  = "Connection:keep-alive";
        $headers[]  = "Cache-Control:max-age=0";

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
        curl_setopt($curl, CURLOPT_ENCODING, "gzip");
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        $data = curl_exec($curl);
        curl_close($curl);
        return $data;

    }

echo crawl("http://www.google.com"); // revenge

, , http://m.facebook.com/ , - , .

0

All Articles