Sitebulb Release Notes

Transparent Release Notes for every Sitebulb update. Critically acclaimed by some people on Twitter. May contain NUTS.

Current Version v7.3

What's in this version

Version 7.3

Released on 5th December 2023

Bug Fixes

  • With the v7.1 update we made some pretty big changes to our Chrome Crawler, and this has had a knock on effect in some other areas. In some instances this has resulted in Sitebulb being unable to crawl the website with the Chrome Crawler. With this update we have improved it to handle frameworks a bit better, and changed how redirects work. We have also introduced a new Advanced Chrome Setting: 'Enable Service Workers,' which is disabled by default:

    Enable Service Workers

    Most websites will not need this option enabled, but if you are dealing with a website that absolutely relies on service workers to render, then you may need to tick this in order to allow Sitebulb to crawl the website.
  • GA4, the gift that keeps on giving, was stripping the trailing slash from URL paths when we collected them from the API. Many notable SEOs were commenting about it on Twitter (and Charlie Whitworth was as well). We have fixed it now so it doesn't do this any more.
  • Some websites are so cutting edge they have already upgraded to HTTP/3 (probably because the likes of Cloudflare made it a button-press operation). Unfortunately Sitebulb was not quite ready for this change, recording the protocol as HTTP 1 (which was released in 1996...) instead of HTTP 3 (released in 2022). What can I say, we're getting on a bit.
  • In a rather shambolic affair, Sitebulb was not properly identifying subdomains when loaded in via URL List, and would decide to go off on a complete tangent, stripping the subdomain and instead appending the page path to the root domain - creating entirely new and unsurprisingly 404-ing URLs. Sitebulb is no longer doing this!
  • Fixed an issue with the structured data 'Review' Search Feature validation - if will now not report the name is required if missing, but was found in the itemReviewed child type.
  • Google is once again shafting everyone by removing the Mobile-Friendly Test - and the Mobile-Friendly Test API. It still seems to be live as of today, but they claim to be retiring it 'starting December 1, 2023.' To ensure we don't get caught with our pants down during the festive season, we have proactively removed it from the tool. It is no longer present in the 'Tools' dropdown, nor the Single Page Analysis. At the same time we have also removed the 'Fetch and Render' tool, as this is basically defunct now we have upgraded the Response vs Render reporting in the Single Page Analysis to be Full Awesome.
  • Fixed a column naming inconsistency in the Link Explorer, where the column 'Linking URL' was listed as 'Referring URL' in the advanced filter.
  • Sitebulb was including whitespace in URLs if the XML Sitemap was set up in such a way that whitespace was present in the <loc> entries, which would result in errors, and a horrible, messy audit.
  • Changed the way we determine whether or not to download the HTML of a page for analysis, based upon the URL (this was manifesting to a customer as 'HTML is missing or empty' on URLs with Japanese characters)

Version 7.2.1

Released on 6th November 2023

The inevitable 'bug fix update to fix all the bugs we accidentally introduced in the last big update.' Sigh.

Bug Fixes

Raise your hand if you spotted any of these beauties. Well done, you win a Mars bar.

  • From one audit to the next, if a new hint had triggered that previously was not present, Sitebulb was claiming 'NO CHANGE.' When I raised this with Gareth and we checked the code, it appeared that Sitebulb was trying to divide by 0. Luckily, there was a catch in place so that it ACTUALLY DIDN'T DIVIDE BY 0, because it if had we would have obviously created a singularity event and caused the formation of a black hole, destroying our entire solar system and all known life in the universe.
  • If you are too busy and important to regularly read the release notes, you may have missed the fact that in the last update we added some shiny new hints, including 'title outside of <head>' and 'meta description outside of <head>', while at the same time updating our Chrome Crawler to automatically flatten embedded iframe content. I'm sure you can see where this is going... on pages which contained iframes, Sitebulb would find and flatten the iframe, then see the new title/meta description from the iframe as HTML elements within the <body> of the parent page. Facepalm o'clock. Cue many emails of 'this doesn't look right.' We screwed this up, and we're very sorry.
  • On some (particularly slow) sites we found that some of Sitebulb's new Chrome settings would clash, but only when 'Performance' was checked. It was only very few sites, affecting a very small number of customers. We told them to uncheck 'Performance' and just crack on with it, but this didn't go down very well, so we fixed the bug instead.
  • Sitebulb had started rejecting some pages with really heavy HTML content (one of the examples we looked at was a URL where the document was 8MB!). We duly administered 70 (seventy) lashes in an attempt to purge Sitebulb's discriminatory outlook, and so far she seems better behaved, in public at least.
  • We'd received some feedback that Sitebulb was crawling thousands of unnecessary URLs, and when we looked into it, this was being caused by HTTP checks to many different image versions, both from srcsets and lazy loaded images. We took the decision to NOT check these by default, and instead add an option to switch them on in the Page Resources options:

    Bloat Audit

Version 7.1

Released on 27th October 2023

There are many important and wildly consequential updates in this version that we hope you'll like, some of the key highlights include:

30+ New Hints

In this update we've included a whole bundle of new Hints. This is basically just Sitebulb moving with the times, we now exclusively wear <INSERT NAME OF FASHIONABLE CLOTHING> as we are nothing if not down with the kids.

Render vs Response Hints

The biggest change is a brand new Hints section for our Response vs Render report, which gives further emphasis to rendering issues, following customer-feedback calls with Aleyda 'she's so famous she only needs one name' Solis and renowned mayonnaise pioneer Arnout Hellemans

Their feedback was basically:

"We don't want core SEO elements to be rendered by JavaScript - only rely on it for less consequential stuff - as this can cause pages to take a lot longer to get indexed, especially if the page content is changing all the time."

So now, if there's rendering issues you probably should pay attention to, we're highlighting them more proactively with 12 new hints, plus a Response vs Render score:

Response vs Render Hints

The new hints, in order of severity, are as follows:

And here's all the other new Hints, which are nothing to do with Render vs Response:

New On Page Hints

New Internal URLs Hints

New Links Hint

Has broken bookmarks

New Performance Hint

Transferred image size is over 100KB

New Indexability Hints

We added this Hint as a Potential Issue:

Contains possible soft 404 phrases

In addition, we added a whole bunch of Insight Hints, which are also picked out via a new table in the Indexability report. These are additional 'indexing and serving rules' that go beyond the basic noindex/nofollow stuff, and allow you more granular control over what gets displayed in the SERPs.

For instance, folks might be interested in the nocache or noarchive directives, which allow you to exclude website content from being used for training Microsoft's generative AI models. You can find out more about these directives via our documentation.

Indexing and Serving Rules

Save Rendered HTML

In the same vein as the new Response vs Render Hints, we've also made it easier to see how on-page elements change after rendering - Sitebulb can now be configured to save both the response HTML and the rendered HTML so they can be compared in the UI.

For (hopefully?) obvious reasons, this feature only works when using the Chrome Crawler, and will appear as a new option on the audit setup screen:

New response vs render settings

NOTE IN CAPITALS FOR PEOPLE WHO DON'T READ - THIS WILL TAKE UP A LOT OF DISK SPACE. DON'T JUST SWITCH IT ON FOR EVERY AUDIT!!

Imagine the site you're auditing has about 200KB of HTML per page, you'll be saving this twice, which makes 400KB (that Maths degree is coming in handy again!). We're compressing the HTML, so it's ~15% of that, about 60KB. Do this on every page of a 10,000 page site and it'll take up 600MB disk space. On 100k site it's 6GB...

So we suggest only switching this on for certain specific audits, and for sampled audits on bigger websites.

Please note that the Hints will still be checked even if you don't save the HTML, so just use it when it makes sense.

You have been warned.

Once the audit is complete, Sitebulb will offer a new 'View Differences' button on any URL List in the Response vs Render report, including on any of the Hints. 

View Differences - Response vs Render

Then this will show you everything that has changed from the response to the rendered version;

  • Anything that has disappeared from the HTML response will be highlighted red
  • Anything that has been modified during rendering will be highlighted orange
  • Anything that has been added during rendering will be highlighted green

Look how easy it is to spot issues:

Rendering Issues saved HTML

See all the different tabs? These allow you to dig into all the different sections. The DOM option shows you all the major DOM elements that have changed (i.e. all the really important on page SEO shit). Then all of Links/Images/Text/HTML show you side by side comparisons of what has been added/removed/modified by JavaScript.

Now, let's say you forgot to switch on to save the 'Response vs Render' HTML (or you were chastened by my warnings above and you're now too frightened to ever turn it on) but the Hints show you that LOTS is changing during rendering.

Fret not! You don't need to run the audit again. Just shimmy on over to the Single Page Analysis tool and pop the URL in there, you'll get a live-action version of the same thing.

Check out all the renders you can render:

Response vs Render

Save Text, Screenshots and HTML

We have for a number of years had a 'crawl diagnosis' feature which allows you to save the HTML and/or take screenshots of pages while auditing the website.

It was kinda hidden, because it was mostly designed to help you figure out why pages aren't rendering like you thought they should, which is a bit of niche venture.

But occasionally folks would find it and turn it on, and then they'd be like 'Yo, I turned on screenshots. Where's the folder with all my screenshots saved?'

And we'd be all, 'Erm it's just made for diagnosis, there's no save folder.'

Then they'd get really angry, and in most cases throw multiple punches*;

'I have to look at them one by one?! That is whack dude.'

Tired of suffering such relentless abuse, we have finally relented, and added a method to save crawl data in bulk, via a new audit option:

*this is entirely fictional

Save the Crawl Data

The options are threefold:

  • Extract and save just the text content, which is accessible in the finished audit through the 'Bulk Exports'.
  • Take screenshots of each rendered page, either in mobile or desktop (or both) - accessible via a saved folder on your local computer
  • Extract and save the HTML for each page - accessible via a saved folder on your local computer

So let's say you chose the screenshots option (which is only available with the Chrome Crawler), then you'll be able to select the folder to which it will save on your local computer:

Screenshots save directory

As it runs, Sitebulb will render off all the pages and save the screenshots into a hierarchical folder structure, like this:

Screenshots folder structure

You'll get new subfolders for each different website, and each different audit, in case (for some reason) you wish to keep historical records of all this data.

Two things to note:

  1. Like the 'Response vs Render' save HTML option above, these options will take up lots of disk space on your machine, so use them wisely!
  2. This is really a desktop only feature. With cloud you can save HTML to be viewed in the URL Details page - but not into a folder - and there's no way to save the screenshots using cloud.

Advanced Chrome Crawler Settings

Sometime last year, a UK agency called Merj did some original research on 'session isolation': Validating Session Isolation For Web Crawling To Provide Data Integrity - it's a fascinating read, if data integrity is your jam (otherwise perhaps not so much...).

TL;DR it is to do with how the rendering of one web page can affect the functionality or the content of another - so think how you see 'items you recently viewed' in the browser when viewing an ecommerce store. They made a solid argument for why this is something that should be taken seriously and why it is very important when crawling certain websites.

Merj carried out a number of tests, and it was sad to see that Sitebulb failed a lot of them. We weren't the only tool - in fact most of the crawlers they tested failed at least one of the tests. But we knew it was something we wanted to put right, so we have added a new 'Incognito' mode, which will pass all the Merj validation tests for session isolation.

This has been added to a new setting panel which appears when you select the Chrome Crawler, our Advanced Chrome Crawler Settings:

Session Isolation Sitebulb

It's a simple tickbox to turn on incognito, and it is unticked by default for a very good reason - you do not always want to crawl with this on. In fact, on some sites, it will cause you massive headaches (e.g. on some sites that use session IDs, every Chrome instance would be given a fresh session ID and potentially use up an enormous amount of resources).

In fact, all of these new Chrome settings are not designed for typical use-cases. By default, Google will flatten the shadow DOM and flatten iframes, so by default Sitebulb does likewise - but these are now options you can switch off if you so wish.

The two at the top - 'Considered Loaded Event' and 'Navigation Timeout' are designed for folks who are exploring the rendering process on their website, so if you are unfamiliar with these concepts then just leave them with the recommended settings.

Now, for anyone that is familiar with these concepts will most likely realise that Sitebulb now offers Puppeteer-esque controls out of the box, so please feel free to have a play!

Structured Data Updates

As always, we like to keep our exceptional Structured Data Validation up-to-date, so there's a bunch of new things added:

  • Updated Schema.org validation to their latest update, version 23
  • New Rich Result for Vehicle listing, for all you smooth-talking car salespeople out there
  • Updated a required property for Video to only be recommended now

And by the way, don't forget we have our Structured Data Alerts system to keep you in-the-know when it comes to any new changes in the world of structured data.

Improvements to James Bond mode

Unobservant people tend to miss this feature, but Sitebulb has a dark mode which you can toggle in the top right hand corner:

Dark Mode Toggle

Most people tend to call it 'James Bond mode', actually.

Well, in reality it is only called that by people who work for Sitebulb.

Oh FINE.

I'm the only one that says it.

Anyway, I'm not even bothered. The point is there were some colour choices that were so difficult to see they made you want to stab your own eyeballs repeatedly with one of those fountain pens that were popular back in the 90s (and/or toggle 'light mode' back on).

So James Bond mode is a bit nicer to use now.

Just fuck off if you don't want to say it.

I don't even care.

Small (but not notable) improvements

  • Added an alphabetic sort to the Google Analytics 4 account dropdown, so that it's easier to find the right one.
  • Added colour-coding to the values on the URL tab of the performance report, so it's easier to decipher where scoring falls for web vitals metrics. Looking at a picture will make this much more obvious what I'm going on about:

Performance Scores

Bug Fixes

It appears that Charlie Whitworth has made a career based almost entirely on his nuisance value. He is responsible for finding these two bugs, so we thank you Charlie for your contribution:

  • Sitebulb (very cleverly) saves disallowed resource URLs and tells you about them - because that can screw up rendering innit. However it also (very stupidly) would show you a warning at the end of the audit saying 'there were some URLs you didn't crawl, press this button to finish the audit.' So Charlie (and mostly likely others) would see that button, press that button, and be perpetually annoyed that even after re-running, there were still uncrawled URLs! (the same disallowed resources). You won't get that warning any more Charlie, be calm.
  • To add insult to injury, if you, like Charlie, kept pressing this button and hoping in vain to finally get a fully completed audit, Sitebulb would actually populate the audit with duplicate URLs! I think it's fair to say this wasn't our finest moment.

There's also a load of bugs that Charlie didn't notice - and arguably he probably should have if we're being honest about it.

These are ones that were actually stopping users from crawling websites at all:

  • Sitebulb would on occasion refuse to crawl websites with the spell checker turned on. Gareth explained to me that this was happening because it could not find the dictionary.  It turns out his wife had borrowed said dictionary during a rather heated Scrabble tournament one evening, trying to debunk Gareth's claim that 'browbub' was a real word, and failed to put it back where it is supposed to go (in the bookshelf by the fireplace - the one they don't actually ever light). For those wondering, Gareth tried to claim that, 'a browbub is obviously when you find an eyebrow hair in your sandwich.'
  • There had been a number of websites that Sitebulb was struggling to crawl with the Chrome Crawler. As part of our Chrome improvements work (see above) we also fixed all of these issues.

We fixed a couple of issues we found with duplicate content detection:

  • With Sitebulb Cloud we now have customers crawling websites up to 10 million URLs </humblebrag> and with that volume of data, we were finding that our old hashing method was causing some false positives (this is known as 'hash collision', apparently). So now we run a second, more complicated hash, which eliminates these false positives.
  • We updated the 'Duplicate Content' Hint so that it is now only looking at the content area, rather than all of the HTML, which makes it a more useful and usable data-point. It also makes it more common! So if you find it suddenly triggering in your audits, to avoid being like: 'why the fuck is that there?' make sure you commit these release notes to memory word for word.

Some issues with content extraction:

  • One customer came to us with a client site that was rendering JavaScript content so slowly that our content extraction didn't pick it up. Gareth grudgingly improved the way this works in Sitebulb, whilst also muttering something about how he wouldn't piss on the website if it were on fire.
  • When extracting 'All matched items', Sitebulb would top out at 5 (even if there were more than 5 to scrape).
  • There was a silly bug where if the user added a rule name that included an apostrophe, Sitebulb's database would fall over (SQL logic error).

Some 'compare audits' bugs squashed:

  • There was an astonishingly annoying bug where you would go and tick two audits to compare (from the project page) and the UI would refresh in the background, unticking the audits in the process! If you had ninja-like reactions you could quickly select them and press 'Compare Audits', but when I suggested ninja training to the customer who reported the issue, I was met with a stern 'don't fuck with me' glare. So we just fixed the bug instead.
  • The compare feature had another bug, in the comparison export - only when you were checking 'Structured Data' - we were duplicating the 'Google Search Features' row.
  • The final compare bug was only on Sitebulb Cloud, and only when connecting to your cloud server via the desktop application (it did not happen in the browser) - clicking on some of the buttons would throw an 'Error Loading Website Audit' message.

And then just some random singular bugs that can't be neatly grouped:

  • Old-skool SEO and original inventor of Crawl Maps, Ian Lurie, was experiencing a bug with the Audit History Google Sheet creating new/duplicate export files instead of modifying the existing sheet to append new rows - which was then breaking his Looker Studio reports. This issue would only manifest when mixing-and-matching the bulk export with the auto-export to Google Sheets method. It no longer does this, and Ian is happy.
  • Sitebulb was caching the Hint change values, so if you viewed an audit where Hints were 'No Change', this would also show on audits where there actually were changes. Restarting the app would reset this, so many of you will not have noticed any issues. There are however a species of, I don't know how to phrase this politely... 'Mac users', who tend to leave their machines on for months at a time, only ever turning them off due to what is known as 'force majeure' events, in legal parlance (and 'holy fuck' events to you and I).
  • The malformed links report was leading to a SQL error in the Link Explorer, which was due to the 'Unique Links' stuff we added in v7.
  • Sitebulb wasn't crawling disallowed PDFs even if you have the 'Crawl Disallowed URLs' setting ticked, and it also wasn't displaying crawled disallowed URLs in the URL explorer(!) - both these things have now been fixed.
  • In the Link Explorer, the 'Does not start with' filters were not working - both for referring URLs and target URLs. You could hack it with 'contains' instead, but that's not the point really is it?
  • Sitebulb was sometimes flagging 'images missing alt text' when they actually did have it. Turns out Sitebulb was picking up noscript images and stylesheet images; it was also parsing the srcset images and not applying the original alt text to them. So we have made a bunch of improvements to how alt text issues are reported. But, to be frank, the concept of 'images missing alt text' is pretty antiquated in the first place. Sometimes images are lazy loaded. Sometimes images are loaded in with base-64 string and then the real image afterwards. Sometimes it is images loaded in via srcset. I mean I get it, folks want to have an easy report they give to a dev and go 'add alt text', but it's no longer as straightforward as all that. Anyway, I guess that's a discussion best saved for a future release note...

Version 7.0

Released on 31st July 2023

Note that this update contains big infrastructure changes: BEFORE you upgrade to v7 please make sure to let any currently running audits finish (DO NOT pause -> upgrade -> resume)

Sitebulb Cloud Launched

We are really excited to announce the launch of our new offering, 'Sitebulb Cloud.'

It answers the question that has annoyed many of you for years: 'what would Sitebulb be like if it was in the cloud?'

We now have the answer! Sitebulb Cloud has all the awesome stuff you already love about the desktop version, now accessible via a web browser.

Check it out:

The video covers some important points that I will reiterate here:

  • Sitebulb Cloud is comfortably cheaper than all the other cloud crawlers. I mean it's not even close.
  • There's no arbitrary project limits that are designed to extract more money from you.
  • We don't charge you extra for JavaScript rendering (like all the others do).

Sitebulb Cloud is an evolution of the 'Server' product we have been shilling since the beginning of the year, but now has the added benefit of the browser login, making it much more accessible.

It's going to be super useful to a whole bunch of customers - it is especially good for teams working together. We already have a number of customers using it and seeing the benefits:

"The most impactful change that Sitebulb Cloud provided was the ability for everyone to work from the same crawl data"

Case Study from Moving Traffic Media

Prices start from £195/month, and we can go all the way up to massive custom plans for loads of users and millions of URLs.

BUT it won't be for everyone. Sitebulb Pro is still going to be the weapon of choice for most consultants and smaller agencies, and we are still committed to relentlessly improving the desktop product (see below!).

If your interest is piqued by Sitebulb Cloud though, please check it out or get in touch.

New Feature: HTML Templates

Every other week there's a new SEO industry study out about how frustrating it is when we make audit recommendations and they are then just ignored by clients or developers. 

And we all know that we shouldn't write audit recommendations like this:

"The website has this issue <INSERT ISSUE NAME> on 4762 pages. Pls fix now - Dave"

Very reasonably, you might expect a developer to read this recommendation and think:

"Fuck. Right. Off."

Sitebulb's new feature is designed to help make that developer response a bit closer to "Ok I'll do it now!" (although this level of enthusiasm might be pushing it!).

It is designed to help make your recommendations clearer in the first place:

"The website has this issue <INSERT ISSUE NAME> on 4762 pages. The issue is only present on two of the page templates - Blog Posts (4102 URLs) and Subcategory Pages (760 URLs). If you can resolve the underlying issue on those two templates, it will eradicate the problem across the site. I would be forever in your debt if you could prioritize this fix for me. I have the honour to be your obedient servant, David."

Ok ok. Sitebulb won't turn you into a fawning sycophant, but it will help you deliver more meaningful recommendations by automatically identifying the HTML template the is being used for each URL.

What do you mean by HTML template?

I'm glad you asked. In this modern world of hipster coffee and sourdough bakeries, most websites are built using a content management system (CMS), which allows users to create content without necessarily having lots of technical expertise about how the underlying pages are built.

Typically, developers will build out a range of page templates for the website, which can be selected by users of the CMS. So, for example, a content writer might select the 'Blog Post' template when writing a new post. Or the ecommerce team might select the 'Product Page' template when adding a new product to the store-front.

Essentially, all pages using the same template will inherit the same underlying HTML - which will typically mean that if there is a problem on one page of a template, it will also be present on all the other pages using the same template.

As such, it is very valuable when auditing a website to be able to;

  1. Recognise the different HTML templates that are in use
  2. Group pages together based on the template they use
  3. Segment SEO issues based on template

And this is what Sitebulb's HTML Template feature allows you to do.

Here's our second video of the day:

You don't need to do anything in order to 'switch on' HTML Templates, as Sitebulb will automatically do it.

And then when the audit is finished, HTML Template information will be available once the audit has completed, via the 'HTML Templates' report option on the left hand menu:

HTML Templates

This will show you all the URLs on the website, split into the different page template groupings.

You can then go ahead and name all the templates, like in the video. For more comprehensive instructions on how to use this feature, please check out our guide here.

This is the sort of thing you can expect to see, once you have finished naming all the templates:

All Finished Templates

See how it works? Instead of an unorganized list of URLs, you can work with URL data in a way that reflects how the website has actually been built.

Without any further analysis, this list of templates is already useful data, as it helps you understand the underlying structure of the website you are dealing with, and the potential scope of actually getting issues fixed.

To dig into each template further, you can navigate onto the template report by selecting the template from the dropdown navigation, clicking the name or clicking the 'View' button;

Drill down into template

Each of these will take you to the same place, the report page for this specific template:

Documentation Template

From here you can explore the data as you would in the rest of Sitebulb tool, with the added context that everything you see on this page relates only to the page template you have selected. You can click through into the URL Lists or tables, or view the triggered hints for this template.

For example, clicking through the pie chart on the left to view indexable URLs will show you the indexable URLs that use this page template. You will notice that the HTML Template is shown as a column in the URL List:

Template name in URL List

The Hints data alongside each template will also help you easily spot when a particular issue is only present on certain templates. In the example below, we can see that there is a critical issue which affects only the 'Blog Post' template.

Critical Hint

If we click through to view this template, we can see the Hints values for the template, and view the triggered Hints for this template by clicking into the Hints tab;

Blog Post Template

This shows us all the triggered Hints for this specific template, ordered by importance, and therefore showing the Critical one at the top. To dig in further, you can click through to View URLs:

Mixed Content hint

Once you have named all your templates, this data will enrich your reports in other areas of Sitebulb as well.

When you view URLs, you can identify which page templates an issue was present on, which aids your understanding of the issue, and can make your communication with your client/developers a lot clearer.

Template Issues

You will also find the HTML Template in areas such as the URL Explorer, and configurable as a column you can add for any URL List it is not already present on.

URL Explorer Template Name

You'll also notice a 'HTML Templates' tab that becomes available in the different reports. This will show you a top-down matrix of common issues, and on which page templates they occur.

Templates within reports

The HTML Templates tab only appears on certain reports - those that make sense in terms of the issues covered - and in some places you will find them more valuable than others.

Within Performance, for example, it is hugely beneficial to see the breakdown of different Web Vitals metrics on each template, particularly since the work involved in improving performance is so heavily template-based.

Performance - Template

You can dig in further by exporting to CSV or Google Sheets, or by clicking through to View:

Web Vitals data for one template

Accessing and understanding your audit data through the lens of HTML templates should make it easier for you as the SEO to diagnose issues, and importantly should make your client/developer communication a lot clearer. If this topic is dear to your heart, check out the recent webinar that I was a guest on recently, along with Areej: O/SEO/O E16: Opinions About Leveling Up Tech Audits.

Final aside: you will occasionally come across some websites where the HTML Template data is NOT all that useful. These sites are typically doing something that injects dynamic elements into each page, which makes them look different when Sitebulb parses them - on these sort of sites you'll have tons of templates with like 1 URL in each one. Not helpful at all!

Updated: Reduced database sizes

If you've been a Sitebulb user for a while, you may have noticed that crawl data can be pretty chunky. Like hundreds of GB worth of chunky.

Over time, this would cause your computer to swell uncontrollably, like your body's response to your lack of self-restraint every Christmas Day.

Sitebulb audits that you run on v7 onwards will now be impressively svelte in comparison - we're talking before-and-after pictures that are 20% of their former size! 

Please consider this upgrade like the 'New Year New You' programme you follow every January - where for two weeks you treat your body like a temple - this does not turn back time and erase the months of sheer gluttony that proceeded it. Similarly, if you have old Sitebulb database files, they will retain their unwieldy stature.

Your new Sitebulb audits on the other hand have just opened an Instagram account, and in 3-6 months you can expect to see Facebook ads for their new fitness program.

Updated: Unique links

The launch of Sitebulb Cloud has forced us to reassess some of the data decisions we have made in the past, and make provisions for crawling bigger websites.

The area where this was most apparent was with internal links, as it has the biggest potential to blow up.

Take a site like sitebulb.com – we have about 500 HTML pages, and about 40,000 internal links – so on average, each page has about 80 links point at it. At most, maybe a thousand.

When looking at a site like this, it is very reasonable to want to look at every single link, and the scale of the data means this is feasible to do – either within Sitebulb's user interface (the 'Link Explorer') or as an export in Excel/Sheets.

Considering the average website is about 10,000 pages, with a similar link ratio that would be 800,000 – 1 million links. You can still, just about, wrangle this in spreadsheet format (although you are really knocking on the ceiling of Excel's limits), and Sitebulb's Link Explorer can handle it no problem.

But with Sitebulb Cloud, we're dealing with sites that have millions of URLs, and oftentimes, hundreds of millions of internal links. How do you look at THAT in a spreadsheet?

The answer is, you don't. Once you start to reach a certain scale, pairwise analysis of links just starts to break down. Spreadsheets aren't built to deal with 220 million rows of link data, and neither is your brain.

The granularity simply becomes redundant. Say you crawl a 5 million URL site, and every page links to the homepage in the header. This then corresponds to 5 million rows of data in your link table. I mean it is literally the same thing, 5 million times in a row. That is not useful data.

(Re-)introducing: unique links

The solution is to stop looking at every link as a pairwise relation, and instead group like links together.

Those 5 million links above all link to the same page, from the same location in the HTML, using the same anchor text. It is 1 unique link, that occurs 5 million times.

Unique links are actually not a new notion in Sitebulb, we have had them for years, but we've never given them the focus they now have or set any rules in place for when they should become the default.

So this is what's changed. If we find over 2.5 million links on the website, Sitebulb will stop showing 'All links' as options you can click on, and instead default to showing unique links.

Unique Links Report

Clicking through to any of those Unique Links values will drop you on the list of unique links in the Link Explorer;

Link Explorer

As you can see, scrolling across shows you the important data points you need to make decisions about these links:

  • Target URL
  • Anchor Text
  • Example Linking URL
  • Number of Linking URLs

If you want/need to explore all links pointing at a particular page, you can use the 'All Links to Target' button.

Added: Additional advanced settings

In the same vein as the above, we have added some other options to make it quicker and easier to do very large audits. By default, Sitebulb's processing stage will tackle some jobs that may not really be necessary for enormous sites, and can add hours to the processing time;

Pre-build exports

During the report building phase, Sitebulb will pre-generate lots of the CSV export files, so they are instantly available when the audit is complete, including all the hint exports. On sites with millions of URL, this can take a few hours. Considering you might not actually use a lot of this data, this is time you could trim off your audit by clicking this button in the 'Data Exports' advanced settings:

Skip report creation

Disable URL Rank calculation

URL Rank is our internal link popularity metric, based on the number of incoming internal links, relative to other pages on the same site. It is a useful metric for determining how powerful or important internal pages are. However, to calculate it we run an iterative formula, not dissimilar to PageRank, which can take a long time when you are dealing with millions of URLs.

It will always be on by default (as long as 'Link Analysis' is checked), but you can disable it in the Advanced Settings for Search Engine Optimization: 

Disable URL Rank

You may wish to do this if you are crawling a very big site but you are not interested in looking at page importance data.

Sitebulb Server Update

We are no longer publicly selling the 'DIY' server license, however existing customers can grab the latest installers from the Server Release Notes page here.

Bugs

  • We had some sort of Schrödinger's Cat scenario going on - Sitebulb was generating different word counts when 'Readability' was turned on and when it was turned off. Now I don't know about you, but I always thought that there would be the same number of words on a page whether or not I was reading them?  
  • If you did a content search, then added some columns into the URL List and tried to export it, Sitebulb would ignore all the columns you added, as if your opinion was not important. Regardless of the accuracy of this supposition, Sitebulb should probably not having been doing this.
  • Insight metrics were being duplicated, for any project with two or more audits. You'd only see this if you hovered over the little sparkline graphs - it would say 'July 10th 1203 URLs' and then the next dot would be 'July 10th 1203 URLs'.
  • In the structured data section, the option to use Advanced Filtering for URLs in "Property Explorer" was missing.
  • Running the 'All Hints' export would hit an API error if you had Accessibility turned on. This was because some of the Hint names are very long (and in my opinion, far too long, such that they are significantly LESS accessible for people who aren't very good at reading). This was creating Google Sheets with ridiculously long filenames, which was making Google Sheets have a meltdown. We are now truncating the Sheet filenames before they get uploaded, because truncating them as they come in is beyond the scope of an organization such as Google.
  • In the keywords report, it was not possible to search for specific keywords (it is now).

Archives

Access the archives of Sitebulb's Release Notes, to explore the development of this precocious young upstart:

Sitebulb Desktop

Ideal for SEO professionals, consultants and digital marketing agencies.

Try our fully featured 14 day trial. No credit card required.

Try Sitebulb for Free

Sitebulb Cloud

Perfect for collaboration, remote teams and extreme scale.

Cloud crawling with no project limits and very generous crawl limits.

Explore Sitebulb Cloud