There are numerous ways in which you can stop Sitebulb from crawling specific URLs, paths or domains. This guide consolidates all of these methods, to help you understand which rules you will need to customise your crawl.
As a core premise, internal and external URLs are treated differently by the software, so they will each have their own section below.
There are 4 ways in which you can exclude particular internal URLs from being crawled:
In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the URL Exclusions option from the left hand menu of the audit setup.
As you scroll down the right hand side, you will see the different ways to exclude URLs, each of which is covered below.
Using Excluded URLs is a method for restricting the crawler, and this method allows you to specify URLs or entire directories to avoid.
Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will also not be crawled, even if it does not match the excluded list.
The list is pre-filled with some common patterns, which you can either over-write or add to using the lines underneath. As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/
Using Included URLs is a method for restricting the crawler, and this method allows you to restrict the crawl to only the URLs or directories specified.
As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/
It is worth noting a couple of things:
By default Sitebulb will crawl all internal URLs with query string parameters. However, on some sites you may wish to avoid this, such as on sites with a large, crawlable, faceted search system.
To stop Sitebulb crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').
URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.
You use URL Rewriting to strip parameters, so for example:
Can become:
And you end up with 'clean' URLs in your audit.
To set up the example above, you would enter the parameter 'ut_source' in the box in the middle of the page. If you also wish to add other parameters, add one per line.
Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.
Then, you can test your settings at the bottom by entering example URLs.
When it comes to external URLs, it is worth noting that Sitebulb does not actually 'crawl' them in the first place - it merely does a HTTP status check on them. This allows you to check for broken links and redirects, without extracting and following links from another website (and accidentally crawling the entire internet...).
Excluding external URLs can be controlled in two different sections:
When setting your audit, make sure that 'Search Engine Optimization' is toggled on in the 'Audit Data' section (it is always on by default), then hit the 'Advanced Settings' button to open up the options underneath.
If you wish for Sitebulb to not check links to external websites, you need to uncheck this option.
While the above options give you most of the flexibility you need, sometimes you may require a bit more control. For instance, if you DID want to crawl external links and get their status codes, but DID NOT want to do this for a specific domain.
The URL Profiler site, for instance, links out to t.co a bunch of times:
In order to exclude only these t.co links, you need to go to the global settings, navigate to Excluded External URLs and add 't.co' to the Excluded Hosts.
The typical use case for this is if you do want to check external links in general, but you know that you have tens or hundreds of thousands of links to a specific domain and you don't want them included in your audit as they make it more difficult to navigate. For instance, social sharing links on every single product page of an ecommerce store.
A quick note on external subdomains, as they are treated differently to 'internal' subdomains (i.e. subdomains of the start URL).
Consider these external links to Majestic's site from URL Profiler:
If I only wanted to exclude the link to the blog subdomain, I would need to add this rule to the Excluded Hosts:
But if I wanted to exclude all of the links in the table above, I would need to add this rule to the Excluded Hosts:
By adding paths to the list of Excluded Paths you will stop any external URLs that include these paths from being scheduled and checked by the Sitebulb crawler.
Adding in 'tweet' would exclude:
You can limit this to make it more specific, for instance adding 'tweet.php' will only match URLs with that specific string.