Troubleshooting HTTrack "Forbidden" and "Access Denied" Errors

Apr 2, 2024 ยท 2 min read

When using HTTrack to mirror or download a website, you may encounter "403 Forbidden" or "401 Access Denied" errors. These indicate the server is blocking HTTrack from accessing certain pages or files.

This can happen for several reasons:

The Site is Actively Blocking HTTrack

Some sites actively try to prevent scraping and mirroring by detecting and blocking tools like HTTrack. When HTTrack attempts to download these sites, the server returns 403 or 401 errors.

Unfortunately, if a site doesn't want to allow mirroring, there is little you can do besides contacting the site owner to request access. Using tricks to disguise HTTrack rarely works with sites actively trying to block scrapers.

Session or Login Required

Many sites restrict access to pages and files behind a login. For example, intranets, webmail services, social networks. HTTrack cannot automatically login to these sites, so you get errors when trying to access restricted pages.

Possible solutions:

  • Mirror the site while logged in manually first. HTTrack will cache the session cookie and use it when mirroring.
  • Look for a publicly accessible login-free portion of the site to mirror instead.
  • File or Folder Permissions

    Some files and folders on a server may be set to restrict public access with permissions. For example, directories like /admin, /dashboard, /download are commonly protected.

    HTTrack lacks the proper permissions to access these folders, hence the 403/401 errors.

    This is unlikely to be resolved in most cases. The permissions are intentionally set to prevent public access.

    Blocking Based on User Agent

    Sites trying to prevent scraping may block requests from certain User Agents like HTTrack.

    Try setting a custom User Agent in HTTrack to mimic a normal browser, so you bypass blocks based on default User Agents.

    User Agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36

    Other Possible Causes

    Here are some other things that could potentially cause 403/401 errors:

  • Blocking based on IP address range
  • Hotlink protection on images or files
  • Trying to access CGI, ASP, PHP scripts directly
  • Custom server rules blocking the HTTrack crawler
  • Unfortunately these cases can be trickier to resolve. It requires custom configuration on the server-side to allow HTTrack.

    Key Takeaways

  • Active blocking of scrapers cannot be bypassed easily
  • Mimic a real browser's User Agent string
  • Mirror sites while logged in to cache session cookies
  • Try allowing the IP address range of HTTrack
  • Some parts of sites will always be restricted
  • Getting past 403 and 401 errors takes trial and error. I hope these tips give you some ideas on overcoming common access restrictions.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: