1

The returned page is only viewable in a text editor, and looks like thus:

<html style="height:100%">
  <head>
    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
    <meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05"></script>
  </head>
  <body style="margin:0px;height:100%">
    <iframe src="/_Incapsula_Resource?CWUDNSAI=23&xinfo=8-12690372-0 0NNN RT(1406173695342 164) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U10000&incident_id=257000050029892977-66371435311988824&edet=12&cinfo=4b6fe7bcc753855a04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 257000050029982977-66371435131988824</iframe>
  </body>
</html>

I'm doing the following in perl:

# Suddenly web robot.
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Mac Safari');

How are they detecting it? It can't be just from the user agent string I wouldn't think. Is there any way to bypass this? I'm not doing anything nasty, just trying to download my retirement account savings without having to do it manually.

I see several results on how to honor a robots.txt, but nothing on how to escape detection.

Looking through the page with Chrome, it seems that they use these guys somehow:

http://www.incapsula.com/website-security/

Anyone have any ideas?

John O
  • 4,863
  • 8
  • 45
  • 78

2 Answers2

2

I recommend that you use an alternative that lets you hijack a browser for automation.

This has the side benefit that it will enable you to work with Javascript, which is likely to be a requirement of this website anyway.

Two options are:

  1. WWW::Mechanize::Firefox - use Firefox as if it were WWW::Mechanize

  2. Selenium::Remote::Driver - Perl Client for Selenium Remote Driver

Miller
  • 34,962
  • 4
  • 39
  • 60
  • 1
    I may try that again, but last time I played with ::Firefox I could not get it to run. Been a few years, and I'm on a new machine, so worth a shot. – John O Jul 24 '14 at 15:47
  • Selenium can also be a good option. Saw a presentation on it recently, although haven't yet tried it myself. Here are the slides: http://www.slideshare.net/Caroline_Burns/selenium-perl – Miller Jul 24 '14 at 17:13
1

It's using Bot Agent Detection technique.

Bot Agent Detection is done to identify the most common bot agents that perform site scraping and to stop them to cause any further harm. For this, various advanced software are used that automatically differentiate between robots and actual human users. The site you mentioned is using some software from incapsula to detect bots. I would suggest: do not try to scrape data if they are not allowing it. They might be setting some cookies via JavaScript and those would not be picked up by Mechanize.

Also read: Detecting Bots and Spiders with Plack Middleware and How do I prevent site scraping?

Hint on bypassing:

  1. Try adding calls to sleep to prevent triggering the bot-detection code.

  2. Use LiveHTTPHeaders to see what gets submitted by the browser and replicate that.

Community
  • 1
  • 1
Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133
  • Thanks for the hint, but it's useless. It does this on the first page fetch. I don't think it's flagged my IP either, since I can still log in via the browser... – John O Jul 24 '14 at 04:25
  • Then it might be using cookies or something else. Use LiveHTTPHeaders and see what's the difference. – Chankey Pathak Jul 24 '14 at 04:40
  • I was missing a few of the request headers that Chrome sends. I've since fixed that. Doesn't seem to matter, they won't even let me get the index... – John O Jul 24 '14 at 05:24