Looking for remote work in tech?
Download our app Apple App Store Google Play Store
Looking for remote work in tech? Download our mobile app now! Apple App Store Google Play Store

DOM manipulation with PHP, the ultimate page scraper?


 

January 12, 2010 | By: Adrian

These days we hear a good deal about DOM manipulation with JavaScript but some little known technologies ( for now – they’re quickly gaining ground ) are xPath, xQuery & XSLT.

Fellow developers will know that historically we’ve had to rely on a number of regular expressions to scrape a page and while this can most often be fast, it’s sometimes horrendous to read and edit as a TINY mis-write can effectively render the Regular Expression useless. That’s not to say it’s not useful when the hierarchy is small and simple but in today’s world of web 2.0 designs they’re often not.

What is page scraping?

Page scraping is a method that allows you to pull information from a web page, so that the data can be manipulated inside your own script. In your script, you can connect to another URL and request a page, just like a browser would do it. Once you make the request, the web server will send back the page you asked for and your script can manipulate the data and extract specific information.

What exactly does the DomDocument Object do?

If you’re not familiar with the DOM model than the following explanation probably isn’t going to make much sense as there really isn’t too much to say besides, DomDocument transform’s an HTML page into a tree-model of elements.  JavaScript does this upon load and is the entire point to the language: Dom Manipulation.  I find a visual representation often helps to understand the tree model so here is a simplified version of the aforementioned model present in DOM:

dom

The above is a representation of the following HTML:

<div>
    <ul>
        <li>
            <a href="#">URL</a>
        </li>
        <li>
            <a href="#"><img src="#" /></a>
        </li>
        <li>
            <a href="#">URL</a>
        </li>
    <ul>
</div>

Sidenote: If your development background is based on the more traditional languages this will look familiar to you as resembles the Binary Tree, how ever, DOM supports a limitless number of children per node.

Through the DomDocument object we are given to the ability to traverse the nodes, create them, and remove them as we see fit. However sometimes traversing the levels of DOM solely through the methods provided to us by the object is cumbersome and altogether impractical given the depth of the information that we sometimes need. Enter xPath; it is to XML compliant mark up languages what SQL is to databases, a query language. The entire breadth of xPath is outside the scope of this particular post but is covered in depth here. If you’re familiar with jQuery or any other JavaScript framework which supports CSS style selectors this’ll be easy for you.

How to use xPath with DomDocument

We’ll start off with something basic as an introduction, we’ll scrape the “Why us” section of TECKpert”s homepage.

<?php

// Define our URL & Start Dom Document
$url = 'https://teckpert.com';
$doc = new DOMDocument;

// Load the html into our object
$doc->loadHTMLFile($url);

// Alternatively this works too
$html = file_get_contents($url);
$doc->loadHTML($html);

// Now that we've created our dom object proper
// call the xPath object
$xPath = new DOMXPath( $doc );

// Query TECKpert's dom for the 'why us' section
$results = $xPath->query('//div[@class="why_us"]');

echo $results->item(0)->textContent;

Simple to use right? There will be more to come on the onset of technologies associates with XML traversing and it’s related query languages.

Note on this article
There exists the possibility of violating copyright laws using techniques outlined in this article if you misuse data you scrape. Please scrape responsibly.