Scraping Reddit Posts with PHP

Jan 9, 2024 · 5 min read

Web scraping is the process of automatically extracting data from websites. This handy PHP script scrapes post data from Reddit by fetching the HTML content of a Reddit page and then using DOM parsing and CSS selectors to extract information like titles, scores, authors, etc.

here is the page we are talking about

Let's walk through it step-by-step.

Prerequisites

To run this code, you'll need:

  • PHP installed
  • curl enabled in PHP
  • The simple_html_dom library
  • First make sure you have PHP cli:

    php -v
    

    Then you can install simple_html_dom by downloading it from Sourceforge or via Composer:

    composer require sunra/php-simple-html-dom-parser
    

    Importing Libraries

    We start by including the simple_html_dom library which will handle parsing and searching the HTML:

    require('simple_html_dom.php');
    

    Defining URLs and Headers

    Next we define the Reddit URL we want to scrape, and a User-Agent header to send with the requests:

    $reddit_url = "<https://www.reddit.com>";
    
    $headers = array(
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    );
    

    It's best practice to identify your scraper instead of faking a browser User-Agent. But some sites block scraping bots so this helps bypass that.

    Initializing cURL

    We use cURL to make the HTTP requests in PHP. So we initialize a cURL session:

    $ch = curl_init();
    

    And configure the options:

    curl_setopt($ch, CURLOPT_URL, $reddit_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    

    Here we set the URL to fetch, enable return transfer to get the response directly, and add our custom User-Agent header.

    Making the Request

    With cURL configured, we use curl_exec() to make the GET request:

    $response = curl_exec($ch);
    

    We also check that it was successful by verifying the response code is 200 OK:

    if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
    
      // Request succeeded!
    
    } else {
    
      // Request failed
    
    }
    

    And save the HTML content to a file:

    $html_content = $response;
    
    $filename = "reddit_page.html";
    
    file_put_contents($filename, $html_content);
    

    This saves the raw HTML we'll parse next.

    Parsing HTML

    With simple_html_dom, parsing HTML is easy. We just initialize a new DOM object and load the HTML content:

    $html = new simple_html_dom();
    
    $html->load($response);
    

    Now we have tons of helpful DOM traversal methods to extract data!

    Extracting Data with Selectors

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    This is where most people struggle with web scraping - how to write the CSS selectors to actually match the content you want.

    Let's break this down:

    $blocks = $html->find('shreddit-post[class=block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible]');
    

    The key things to understand are:

  • find() searches DOM elements that match our CSS selector
  • shreddit-post selects tags
  • [class=block...] checks for elements with a matching class attribute
  • We're matching Reddit post blocks that have the needed classes
  • This gives us all the post block elements.

    Extracting Post Data

    Inside the loop, we use other DOM methods to get attributes and values:

    foreach ($blocks as $block) {
    
      $permalink = $block->getAttribute('permalink');
    
      $content_href = $block->getAttribute('content-href');
    
      $comment_count = $block->getAttribute('comment-count');
    
      $post_title = $block->find('div[slot=title]', 0)->plaintext;
    
      $author = $block->getAttribute('author');
    
      $score = $block->getAttribute('score');
    
      // Print post data
    
    }
    
  • getAttribute() gets the named attribute
  • find() searches inside this block only
  • plaintext gets the raw text content
  • And we print out all the extracted fields!

    Full Code

    Here is the complete script for reference:

    <?php
    // Include the simple_html_dom library for HTML parsing
    require('simple_html_dom.php');
    
    // Define the Reddit URL you want to download
    $reddit_url = "https://www.reddit.com";
    
    // Define a User-Agent header
    $headers = array(
        "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  // Replace with your User-Agent string
    );
    
    // Initialize a cURL session
    $ch = curl_init();
    
    // Set the cURL options
    curl_setopt($ch, CURLOPT_URL, $reddit_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    
    // Send the GET request to the URL
    $response = curl_exec($ch);
    
    // Check if the request was successful (status code 200)
    if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
        // Get the HTML content of the page
        $html_content = $response;
    
        // Specify the filename to save the HTML content
        $filename = "reddit_page.html";
    
        // Save the HTML content to a file
        file_put_contents($filename, $html_content);
    
        echo "Reddit page saved to $filename\n";
    } else {
        echo "Failed to download Reddit page (status code " . curl_getinfo($ch, CURLINFO_HTTP_CODE) . ")\n";
    }
    
    // Create a DOM object
    $html = new simple_html_dom();
    $html->load($response);
    
    // Find all blocks with the specified tag and class
    $blocks = $html->find('shreddit-post[class=block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible]');
    
    // Iterate through the blocks and extract information from each one
    foreach ($blocks as $block) {
        $permalink = $block->getAttribute('permalink');
        $content_href = $block->getAttribute('content-href');
        $comment_count = $block->getAttribute('comment-count');
        $post_title = $block->find('div[slot=title]', 0)->plaintext;
        $author = $block->getAttribute('author');
        $score = $block->getAttribute('score');
    
        // Print the extracted information for each block
        echo "Permalink: $permalink\n";
        echo "Content Href: $content_href\n";
        echo "Comment Count: $comment_count\n";
        echo "Post Title: $post_title\n";
        echo "Author: $author\n";
        echo "Score: $score\n\n";
    }
    
    // Close the cURL session
    curl_close($ch);

    Browse by tags:

    Browse by language:

    Tired of getting blocked while scraping the web?

    ProxiesAPI handles headless browsers and rotates proxies for you.
    Get access to 1,000 free API credits, no credit card required!