How to stop Sitebulb from crawling specific URLs

There are numerous ways in which you can stop Sitebulb from crawling specific URLs, paths or domains. This guide consolidates all of these methods, to help you understand which rules you will need to customise your crawl.

As a core premise, internal and external URLs are treated differently by the software, so they will each have their own section below.

Excluding Internal URLs

There are 4 ways in which you can exclude particular internal URLs from being crawled:

  1. Excluding specific URLs or paths
  2. Including specific URLs or paths (subtle but important difference)
  3. Excluding query string parameters
  4. Rewriting URLs on the fly

In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the URL Exclusions option from the left hand menu of the audit setup.

URL Exclusions

As you scroll down the right hand side, you will see the different ways to exclude URLs, each of which is covered below.

#1 Excluding specific URLs or paths

Using Excluded URLs is a method for restricting the crawler, and this method allows you to specify URLs or entire directories to avoid.

Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will also not be crawled, even if it does not match the excluded list.

The list is pre-filled with some common patterns, which you can either over-write or add to using the lines underneath. As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/

URL to Exclude

#2 Including specific URLs or paths

Using Included URLs is a method for restricting the crawler, and this method allows you to restrict the crawl to only the URLs or directories specified.

As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/

URLs to include

It is worth noting a couple of things:

  • Excluded URLs over-ride included URLs, so ensure your rules do not clash.
  • Your Start URL must contain at least one link to an included URL, otherwise the crawler will simply crawl 1 URL and then stop.

#3 Excluding query string parameters

By default Sitebulb will crawl all internal URLs with query string parameters. However, on some sites you may wish to avoid this, such as on sites with a large, crawlable, faceted search system.

To stop Sitebulb crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').

Exclude query string parameters

#4 Rewriting URLs on the fly

URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.

You use URL Rewriting to strip parameters, so for example:

  • https://sitebulb.com/category/?ut_source=megamenu

Can become:

  • https://sitebulb.com/category/

And you end up with 'clean' URLs in your audit.

To set up the example above, you would enter the parameter 'ut_source' in the box in the middle of the page. If you also wish to add other parameters, add one per line.

Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.

Exclude parameters

Then, you can test your settings at the bottom by entering example URLs.

Test Excluded URLs

Excluding External URLs

When it comes to external URLs, it is worth noting that Sitebulb does not actually 'crawl' them in the first place - it merely does a HTTP status check on them. This allows you to check for broken links and redirects, without extracting and following links from another website (and accidentally crawling the entire internet...).

Excluding external URLs can be controlled in two different sections:

  1. In the audit settings, which only affects a specific audit
  2. In the global settings, which affects every audit

#1 Audit Settings

When setting your audit, make sure that 'Search Engine Optimization' is toggled on in the 'Audit Data' section (it is always on by default), then hit the 'Advanced Settings' button to open up the options underneath.

If you wish for Sitebulb to not check links to external websites, you need to uncheck this option.

Uncheck external links

#2 Global settings

While the above options give you most of the flexibility you need, sometimes you may require a bit more control. For instance, if you DID want to crawl external links and get their status codes, but DID NOT want to do this for a specific domain.

The URL Profiler site, for instance, links out to t.co a bunch of times:

Links to t.co

In order to exclude only these t.co links, you need to go to the global settings, navigate to Excluded External URLs and add 't.co' to the Excluded Hosts.

Exclude global settings

The typical use case for this is if you do want to check external links in general, but you know that you have tens or hundreds of thousands of links to a specific domain and you don't want them included in your audit as they make it more difficult to navigate. For instance, social sharing links on every single product page of an ecommerce store.

Excluding external subdomains

A quick note on external subdomains, as they are treated differently to 'internal' subdomains (i.e. subdomains of the start URL).

Consider these external links to Majestic's site from URL Profiler:

Majestic URLs

If I only wanted to exclude the link to the blog subdomain, I would need to add this rule to the Excluded Hosts:

  • blog.majestic.com

But if I wanted to exclude all of the links in the table above, I would need to add this rule to the Excluded Hosts:

  • majestic.com

Excluding external paths

By adding paths to the list of Excluded Paths you will stop any external URLs that include these paths from being scheduled and checked by the Sitebulb crawler.

Adding in 'tweet' would exclude:

  • Any URLs that had /tweet/ in the folder name (e.g. https://example.com/tweet/abc)
  • Any URLs that had tweet in the filename (e.g. https://example.com/abc/tweet.php)

You can limit this to make it more specific, for instance adding 'tweet.php' will only match URLs with that specific string.