How do you prohibit scanning on the origin server and yet have robots.txt distributed correctly?

Question

How do you prohibit scanning on the origin server and yet have robots.txt distributed correctly?

I ran into a rather unique problem. If you are scaling up large sites and working with a company like Akamai, you have the origin servers that Akamai speaks to. No matter what you serve Akama, they will be distributed on their CD.

But how do you handle the robots.txt file? You do not want Google to crawl your origin. This can be a HUGE security issue. Think of denial of service attacks.

But if you file the robots.txt file in its origin with "disallow", then your whole site will be indelible!

The only solution I can think of is to use a different robots.txt file for Akamai and for the whole world. Do not allow the world, but allow Akama. But it is very hacky and prone to so many problems that I think it over.

(Of course, origin servers should not be viewable, but I would venture to say that most of them are for practical reasons ...)

The problem seems to be that the protocol should work better. Or perhaps allow the site-specific hidden robots.txt file in search engine webmasters ...

Thoughts?

+3

robots.txt cdn akamai

joedevon May 11 '11 at 11:03

source share

1 answer

Chris Adams · Answer 1 · 2012-04-27T03:03:59+0000

, , / , , Akamai, - , , IP- -.

, , , , , Host, . - , , . Apache mod_rewrite virtualhost setup RedirectPermanent / http://canonicalname.example.com/.

, , , (, cdn-bypass.mycorp.com), , .

How do you prohibit scanning on the origin server and yet have robots.txt distributed correctly?

More articles: