Some Experiments into How Google's Crawler works

Why experiment with Googlebot

Beyond the fact it is interesting to understand how it works, it is potential useful if you can use the information to help it find your most important content. For this article I want to concentrate on a couple of things I have read. Those are firstly the claim that internal links don't matter and second the claim that every crawled page is rendered as a javascript. The second claim is particularly important if your site relies heavily on client side rendering (CSR) where the content can only really be seen if it is rendered.

Rather than a deep dive into this topic, we're just interested in quick answers on these two subjects. So please take any results as indicative, rather than conclusive.

Internal Linking

If a page has more internal links then it is likely more important. So how can we test it?

Easy - let's set up a page called one.html . The page has three links on it. The first one goes to the file two.html. The second one goes to the file three.html. The third one also goes to the file three.html.

It's important to be clear here that what we are testing is crawl priority, and not importance for ranking here (potentially importance for getting the page into the index though).

On a crawler that does not prioritize we would expect the crawler to crawl the page, find the link to page two.html which is higher up and add that to the crawl queue, find the link to three.html and add that to the crawl queue afterwards, then find the second link to three.html and realize it is already in the crawl queue so do nothing with that.

Then our sequence of page crawls is clear. one.html, two.html and three.html.

In a crawler that prioritizes we will see something different. We expect it to read one.html but then add three.html to the queue next, followed by two.html. Because three.html has more links.

So in a prioritized crawler we would see: one.html, three.html, two.html

What we saw:

Googlebot read one.html and then read three.html. It did not read two.html. It's cousin GoogleOther did read two.html in between these though. Which seems to be some experimentation bot. While that seems fair - we experiment on them, they experiment on us, it does mean the results are a bit more inconclusive than we'd like for this quick experiment. But if we say it is not the main Googlebot then it does seem it read one.html and then three.html.

Pages are Always Rendered

Vercel did a long article on this. stating:

"Out of over 100,000 Googlebot fetches analyzed on nextjs.org, excluding status code errors and non-indexable pages, 100% of HTML pages resulted in full-page renders, including pages with complex JS interactions."

Let's make four.html. Four.html simply contains the word four and a script link to some javascript that switches the text to the word "five".

If we submit this page for indexing we can see what happens/ the crawler fetches four.html

"GET /four.html HTTP/1.1" 200 121 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.95 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

But it doesn't fetch five.js after it.

Initially that would seem like it is untrue then and having CSR is not a good idea. But....

There are instances of Google fetching five.js when I submitted the url. If I go to google and search for:

five sitename:thesitenamegoeshere.com

Then it does come up. There's not much for Google to use as a description so it uses "five", from the modified DOM body!

So not only does googlebot seem to render them, it seems to quite aggressively cache resources it has seen previously too.

I should say here that this only means that CSR content can get rendered and indexed, it doesn't say anything about whether it is a good idea. Rendering complex javascript is clearly going to take longer and crawl rate does seem very related to speed. But fundamentally, CSR doesn't preclude you from being indexed.

The Crawl Tool Blog

Why experiment with Googlebot

Internal Linking

Pages are Always Rendered

Ready to find and fix your website's SEO issues?

Recent Posts

Some Experiments into How Google's Crawler works

Is LLMS.TXT useful?

It May Seem Cool - But Don't Make Your robots.txt a WAV file.

How to Have a Fast Mobile Website

Tracking Hits to robots.txt, sitemap.xml, and llms.txt for free

How to get the Anthropic AI Crawler to crawl a site