2

I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>) then move to the next child (or next <p>) and get the text.

There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.

Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?

Thanks!

EDIT (Working/Current Code) -----------

<?php

class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;

/*Data generated from cURL*/
public $content = null;
public $response = null;

/* Links */
private $url = array(
                    'login'      => 'https://website.com/login.jsp',
                    'submit'     => 'https://website.com/LoginServlet',
                    'page1'      => 'https://website.com/page1',
                    'page2'      => 'https://website.com/page2', 
                    'page3'      => 'https://website.com/page3'
                    );

/* Fields */
public $data = array();

public function __construct ($user, $pass)
{

    $this->user = $user;
    $this->pass = $pass;

}

public function login()
{

            $this->cURL($this->url['login']);

            if($form = $this->getFormFields($this->content, 'login'))
            {
                $form['login'] = $this->user;
                $form['password'] =$this->pass;
                // echo "<pre>".print_r($form,true);exit;
                $this->cURL($this->url['submit'], $form);
                //echo $this->content;//exit;
            }
           //echo $this->content;//exit;
}

// NEW TESTING
public function loadPage($page)
{
            $this->cURL($this->url[$page]);
            echo $this->content;//exit;
}

/* Scan for form */
private function getFormFields($data, $id)
{
        if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
            $inputs = $this->getInputs($matches[1]);

            return $inputs;
        } else {
            return false;
        }

}

/* Get Inputs in form */
private function getInputs($form)
{
    $inputs = array();

    $elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);

    if ($elements > 0) {
        for($i = 0; $i < $elements; $i++) {
            $el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);

            if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
                $name  = $name[1];
                $value = '';

                if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
                    $value = $value[1];
                }

                $inputs[$name] = $value;
            }
        }
    }

    return $inputs;
}

/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
        // "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
    curl_setopt($ch, CURLOPT_TIMEOUT, 120);

    if($post)   //if post is needed
    {
        curl_setopt($ch, CURLOPT_POST, 1);
        curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
    }

    curl_setopt($ch, CURLOPT_URL, $url);
    $this->content = curl_exec($ch);
    $this->response = curl_getinfo( $ch );
    $this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
    curl_close($ch);
}
}


$sc = new Scrape('user','pass');
$sc->login();

$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";

$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";

$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";

(note: credit to @Ramz scrape a website with secured login)

Community
  • 1
  • 1
Kenny
  • 2,124
  • 3
  • 33
  • 63

2 Answers2

1

You can divide your problem in several parts.

  1. Retrieving the data from the data source. For that, you can possibly use CURL or file_get_contents() depending on your requirements. Code examples are everywhere. http://php.net/manual/en/function.file-get-contents.php and http://php.net/manual/en/curl.examples-basic.php

  2. Parsing the retrieved data. For that, i would start by looking into "PHP Simple HTML DOM Parser" You can use it to extract data from an HTML string. http://simplehtmldom.sourceforge.net/

  3. Building and generating the output. This is simply a question of what you want to do with the data that you have extracted. For example, you can print it, reformat it, or store it to a database/file.

MegaAppBear
  • 1,220
  • 1
  • 9
  • 11
1

I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content"> I use $client->filter('#content')->text()

It even allows me to find and 'click' on links and submit forms to retreive and process the content.

It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually

Horaland
  • 857
  • 8
  • 14
  • Thanks for the response, Horaland, is Goutte able to search for plain-text? Because this terrible website has no css class or ids, so I have to use tricky methods. Also, I looked into Goutte, but it was very confusing on how to set up. It required Guzzle I think, and I'm not familiar with how it's used, could you please provide some enlightenment? Thanks! – Kenny Feb 26 '15 at 17:15
  • I use Symfony so setting up Guzzle & Goutte is simply a case of making a "composer require fabpot/goutte" command. It needs some kind of tag to work on but doesn't need an ID or a class so you can get the content of the tag or a tag.
    – Horaland Feb 26 '15 at 18:01
  • so it cannot do a case match like `preg_match`? I'd need to do something like `preg_match('website.com/page1&id=' . $id);` so searching for actual text, then finding the tag it's contained in, then moving to he next child of that tag. (So if the desired text is in a `

    ` and find the next occurrence of that tag, and get the content within)

    – Kenny Feb 26 '15 at 18:06