How to Build a Website Scraper in PHP

Do you want to extract the content of a web page? It requires scraping a website and fetching the content. In this tutorial, let’s build a website scraper in PHP with the help of Symfony Components. Following this tutorial, you should be able to scrape the content of any website.

To see the scraping practically, I take the example of this blog. I’ll extract titles, links, excerpts, and images from the home page. If you inspect the home page you will find the CSS selectors of these contents as shown in the screenshot.

inspect-page

In your case, the CSS selectors will be different which you need to find and adjust in the code we’ll write in the next steps.

Install Symfony Components

The Symfony Components are nothing but libraries for PHP applications. These components are framework agnostic and you can use them in all types of frameworks and CMS.

To build a website scraper, I am going to use the below 2 components.

  • DomCrawler : This component eases DOM navigation for HTML and XML document.
  • CssSelector : It will help to pick up DOM elements using the CSS selector like jQuery.

Install these components through the commands below.

composer require symfony/dom-crawler
composer require symfony/css-selector

Website Scraping in PHP

On the home page of my blog, it’s listed 10 articles. To extract all titles I’ll first grab the content of a web page using the file_get_contents() method. Then I crawl the content, loop through each section of the title, and print the text.

<?php
require_once "vendor/autoload.php";

use Symfony\Component\DomCrawler\Crawler;

$url = "https://artisansweb.net";

$html = file_get_contents($url);

$crawler = new Crawler($html);

$title = $crawler->filter('h3.greatwp-fp04-post-title a')->each(function(Crawler $node, $i){
    return $node->text();
});

echo "<ul>";
foreach($title as $t) {
    echo "<li>$t</li>";
}
echo "</ul>";

Here, we used the filter() method on the selectors h3.greatwp-fp04-post-title a, going through each node and displaying the text from the HTML element.

Run this code and you will get the title of front-page articles as shown below.

scrap-title

Similarly, you will get the post excerpt as follows.

$snippet = $crawler->filter('.greatwp-fp04-post-snippet p')->each(function(Crawler $node, $i){
    return $node->text();
});

Use the link() method to get the URL of the anchor element. Additionally, you need to call the getUri() method.

$link = $crawler->filter('h3.greatwp-fp04-post-title a')->each(function(Crawler $node, $i){
    return $node->link()->getUri();
});

In case of fetching image path, use image() method along with getUri().

$image = $crawler->filter('.greatwp-fp04-post-thumbnail a img')->each(function(Crawler $node, $i){
    return $node->image()->getUri();
});

This is how one can build a website scraper in PHP with Symfony Components. Give it a try and let me know your thoughts and suggestions in the comment section below.

Related Articles

If you liked this article, then please subscribe to our YouTube Channel for video tutorials.

Leave a Reply

Your email address will not be published.