Fellow developers will know that historically we’ve had to rely on a number of regular expressions to scrape a page and while this can most often be fast, it’s sometimes horrendous to read and edit as a TINY mis-write can effectively render the Regular Expression useless. That’s not to say it’s not useful when the hierarchy is small and simple but in today’s world of web 2.0 designs they’re often not.
What is page scraping?
Page scraping is a method that allows you to pull information from a web page, so that the data can be manipulated inside your own script. In your script, you can connect to another URL and request a page, just like a browser would do it. Once you make the request, the web server will send back the page you asked for and your script can manipulate the data and extract specific information.
What exactly does the DomDocument Object do?
The above is a representation of the following HTML:
<div> <ul> <li> <a href="#">URL</a> </li> <li> <a href="#"><img src="#" /></a> </li> <li> <a href="#">URL</a> </li> <ul> </div>
Sidenote: If your development background is based on the more traditional languages this will look familiar to you as resembles the Binary Tree, how ever, DOM supports a limitless number of children per node.
How to use xPath with DomDocument
We’ll start off with something basic as an introduction, we’ll scrape the “Why us” section of TECKpert”s homepage.
<?php // Define our URL & Start Dom Document $url = 'https://teckpert.com'; $doc = new DOMDocument; // Load the html into our object $doc->loadHTMLFile($url); // Alternatively this works too $html = file_get_contents($url); $doc->loadHTML($html); // Now that we've created our dom object proper // call the xPath object $xPath = new DOMXPath( $doc ); // Query TECKpert's dom for the 'why us' section $results = $xPath->query('//div[@class="why_us"]'); echo $results->item(0)->textContent;
Simple to use right? There will be more to come on the onset of technologies associates with XML traversing and it’s related query languages.
Note on this article
There exists the possibility of violating copyright laws using techniques outlined in this article if you misuse data you scrape. Please scrape responsibly.