How to find orphan pages on your website

It is relatively common to find orphan pages when conducting website audits. They are the debris of consistently evolving websites, where content is being changed, added or removed on a regular basis.

Orphan pages are generally not considered to be a good thing, although they are also not inherently bad, and they typically represent old pages that have forgotten about to some degree.

For background information on what orphan pages are and why they are important, check out our guide to what orphan pages are, why they're bad for SEO, and how to approach fixing them.

Table of contents

This article is focused on how to find orphan pages using Sitebulb, and breaks down into a few sections:

What are orphan pages?

Orphan pages are URLs that are not linked to by other internal URLs on the same website. This means they are not part of the main website architecture, and website visitors would not be able to find them simply by browsing the website. Similarly, search engine crawlers would not be able to discover orphan pages simply by crawling the website.

This means that orphan URLs are discovered by some other means, for example:

  • They are listed on an XML sitemap
  • They are findable on search engine results pages
  • They are linked to from other external websites
  • They are referenced externally in some other way (e.g. Adwords landing page, bookmarked pages etc...)

How does Sitebulb define orphan pages?

The Sitebulb tool defines orphan pages as URLs that were discovered in the audit, but NOT by the crawler.

They will only appear in your audit if you connect other crawl sources, such as XML Sitemaps or Google Analytics.

How to set up the crawler to find orphan pages

Firstly, you'll need to crawl the website, so start a new website audit or project.

Then it is a case of connecting up other URL sources, all of which are done in the main audit setup page.

Whilst the options below will show all the different sources, all of them are optional. So for instance, if you only want to identify orphan URLs that are contained in XML Sitemaps, only connect the XML Sitemaps in the audit setup.

Connecting Google Analytics

Scroll down to the 'Google Analytics' tickbox, and then follow these 3 steps:

  1. Tick to include Google Analytics data in your audit
  2. Add or select a Google Account, then choose the appropriate property/view from the options
  3. Tick the box at the bottom of 'Extract and crawl URLs found in Google Analytics'

Connecting Google Analytics

This third step is crucial, as it means that URLs which Sitebulb finds in Google Analytics but not in the crawl will also be included in the audit.

Connecting Google Search Console

Scroll down to the 'Google Search Console' tickbox, and then follow these 3 steps:

  1. Tick to include Google Analytics data in your audit.
  2. Add or select a Google Account, then choose the appropriate property/view from the options.
  3. Tick the box at the bottom of 'Extract and crawl URLs found in Google Analytics'.

Connecting GSC

Connecting XML Sitemaps

There are a few different ways to add XML Sitemaps. To access the XML Sitemaps bit you need to scroll further down still to the section entitled 'Select URL sources to Audit.'

The most basic way is simply to tick the box, like this:

XML Site Maps crawl sources

This works if the XML Sitemaps you wish to crawl are easy to discover by Sitebulb automatically. For instance, if they are listed in your robots.txt file, Sitebulb will go and grab these and add the to the list. Similarly, if you have connected Google Search Console, Sitebulb will also go and find any listed in there.

To see which XML Sitemaps that Sitebulb has queued up, click the word 'XML Sitemaps' to open up the full options panel. This will show you which URLs Sitebulb has found and queued to crawl already (in some cases this might be empty, in which case Sitebulb will warn you):

Show Sitemaps found

In this case, we can see 2 listed. This is because we (stupidly) refer to the XML Sitemap URL without the trailing slash in the robots.txt file, but with the trailing slash when we submitted to GSC (doh!). Either way, if we wanted to delete any sitemaps we don't want included, just hit the red Delete button over on the right.

Similarly, you can manually add extra XML Sitemap URLs which Sitebulb did not automatically discover, either one by one in the entry box, or lots all at once by hitting the green Add Multiple XML Sitemap URLs button.

Finally, if you want to add XML Sitemaps in file format rather than URLs, you can drag drop these into the panel at the bottom.

Add XML Sitemap Files

Connecting URL Lists

This one perhaps takes a little more explaining than any of the options already covered. A URL List is simply a list of URLs that Sitebulb will process during the audit, with the condition that they must be on the same root domain as the start URL.

In terms of finding orphan pages, you would use it as a means of confirming if a set of pre-defined pages are orphaned or not, for instance;

  • Adwords landing pages
  • 'Thank you' pages
  • Pages with external incoming backlinks (e.g. from a tool like Ahrefs or Majestic)

You can add a URL List as a URL source from the same 'Select URL sources to Audit' section that you control XML Sitemaps from.

It is a more straightforward process, however, simply tick the box and drop in the CSV file to import.

Upload URL list to Sitebulb

The file must be in .csv or .txt format and contain the list of URLs (either in the first column or in a column with a header of 'URL'). Only URLs that contain the same root domain as the start URL will be included.

Making sure to select the Crawler

This is perhaps obvious, but bears repeating: in the URL Sources section, make sure to tick the 'Crawler.' If you do not, then Sitebulb will not be able to tell you which URLs are orphaned as it won't know which ones were not discoverable via internal links in the website.

An audit set up with all of the Crawler, XML Sitemaps and URL List as sources will look like this:

Selecting crawl sources

How to find orphan pages in your audit

Once the audit has finished running, you can locate orphaned pages from a few different places;

  1. In a visual chart on the Audit Overview
  2. As one of the Hints in the Links section
  3. As a pre-filtered list in the URL Explorer

This caters for different workflows, allowing you to access the data visually, via the in-built 'Hints' system, or through a data table.

#1 Audit Overview

The first place you will find it is on the Audit Overview, if you scroll down to the chart 'HTML URL Sources', this shows you all the different sources and the URLs found in each:

Missing from the crawler

For each source, this chart shows 3 datapoints:

  • Found (Green) - URLs found by this source
  • Only (Orange) - URLs found only by this source
  • Missing (Red) - URLs not found by this source

In the case of orphan URLs, the segment we want is 'Crawler - Missing', AKA, 'URLs not found by the Crawler.' If you click this segment in the chart it will open up a URL List of the URLs in question;

URL List of Orphaned URLs

Cool cool, this looks pretty handy.

However, you can see that it is more useful still by scrolling right, to see the 'source' columns themselves:

URL in sitemap but not other sources

The row I have highlighted shows that this URL was found in the XML Sitemap, but not in GA, GSC or the URL List. So you can quickly get a good idea of where orphan URLs are coming from. As with all URL Lists, they can be sorted, filtered and exported to spreadsheet format.

You can also see this data in the single URL Details view (which you can access by right-clicking on the URL):

viewing URL details

#2 Links Hint

Many users make use of Sitebulb's Hint system as a core part of their website auditing workflow - taking each section in turn, and browsing through the prioritized Hints to understand all the issues within that particular section.

Since orphan URLs are to do with links (or lack thereof), if Sitebulb identifies any orphaned URLs you will find them in the Hints for the Links section:

Orphaned URL Hint

You can click the blue View URLs button to see the URL data, this will result again in a URL List showing you all the orphaned URLs, along with the crawl sources.

As with all the Hints within Sitebulb, if you click the blue outlined button Learn more about this hint and how to fix this issue it will open a browser window on your computer and take you to the 'Learn more' page on the Sitebulb website, in this case the Hint, 'URL is orphaned and was not found by the crawler.'

This page will give you further context on why orphan pages are important, and assistance in resolving the issue.

#3 URL Explorer Filter

Whilst some Sitebulb users prefer visual workflows, and others like to work through the Hints, some users just like to look at big lists of data. Sitebulb has this preference covered to, with the URL Explorer - which is located in the top menu bar.

To find orphaned URLs in the URL Explorer, click the 'Internal' dropdown menu and select 'Orphaned':

URL explorer orphaned pages

This will again take you through to a list of the URL data for all the orphaned pages.

However this time it will remain within the 'frame' of the URL Explorer, and won't take you off to another page:

URL Explorer Orphaned URLs

How to export orphan pages

All of the workflows above lead you to a 'big list of all the URL data', the idea being that you can dig into the data further in this view, and potentially filter or sort the list further.

For example, I could add a filter like this to my list:

adding and removing filters to URL lists

Which might constitute 'the things I want to send to my client to fix.'

And then, to generate the spreadsheet to actually send them, all I need to do is hit the green Export button, and select either CSV or Google Sheets.

how to export URL lists to CSV

Please check out this guide to learn more about incorporating Google Sheets into your audit workflows.