Some Experiments into How Google's Crawler works
Why experiment with Googlebot Beyond the fact it is interesting to understand how it works, it is potential useful if you ca...
Expert insights, guides, and tips to improve your website's SEO performance
Beyond the fact it is interesting to understand how it works, it is potential useful if you can use the information to help it find your most important content. For this article I want to concentrate on a couple of things I have read. Those are firstly the claim that internal links don't matter and second the claim that every crawled page is rendered as a javascript. The second claim is particularly important if your site relies heavily on client side rendering (CSR) where the content can only really be seen if it is rendered.
Rather than a deep dive into this topic, we're just interested in quick answers on these two subjects. So please take any results as indicative, rather than conclusive.
If a page has more internal links then it is likely more important. So how can we test it?
Easy - let's set up a page called one.html . The page has three links on it. The first one goes to the file two.html. The second one goes to the file three.html. The third one also goes to the file three.html.
It's important to be clear here that what we are testing is crawl priority, and not importance for ranking here (potentially importance for getting the page into the index though).
On a crawler that does not prioritize we would expect the crawler to crawl the page, find the link to page two.html which is higher up and add that to the crawl queue, find the link to three.html and add that to the crawl queue afterwards, then find the second link to three.html and realize it is already in the crawl queue so do nothing with that.
Then our sequence of page crawls is clear. one.html, two.html and three.html.
In a crawler that prioritizes we will see something different. We expect it to read one.html but then add three.html to the queue next, followed by two.html. Because three.html has more links.
So in a prioritized crawler we would see: one.html, three.html, two.html
What we saw:
Googlebot read one.html and then read three.html. It did not read two.html. It's cousin GoogleOther did read two.html in between these though. Which seems to be some experimentation bot. While that seems fair - we experiment on them, they experiment on us, it does mean the results are a bit more inconclusive than we'd like for this quick experiment. But if we say it is not the main Googlebot then it does seem it read one.html and then three.html.
Vercel did a long article on this. stating:
"Out of over 100,000 Googlebot fetches analyzed on nextjs.org, excluding status code errors and non-indexable pages, 100% of HTML pages resulted in full-page renders, including pages with complex JS interactions."
Let's make four.html. Four.html simply contains the word four and a script link to some javascript that switches the text to the word "five".
If we submit this page for indexing we can see what happens/ the crawler fetches four.html
"GET /four.html HTTP/1.1" 200 121 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.95 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
But it doesn't fetch five.js after it.
Initially that would seem like it is untrue then and having CSR is not a good idea. But....
There are instances of Google fetching five.js when I submitted the url. If I go to google and search for:
five sitename:thesitenamegoeshere.com
Then it does come up. There's not much for Google to use as a description so it uses "five", from the modified DOM body!
So not only does googlebot seem to render them, it seems to quite aggressively cache resources it has seen previously too.
I should say here that this only means that CSR content can get rendered and indexed, it doesn't say anything about whether it is a good idea. Rendering complex javascript is clearly going to take longer and crawl rate does seem very related to speed. But fundamentally, CSR doesn't preclude you from being indexed.
Start with a free crawl of up to 1,000 URLs and get actionable insights today.
Try The Crawl Tool FreeWhy experiment with Googlebot Beyond the fact it is interesting to understand how it works, it is potential useful if you ca...
LLMS.TXT again I've written about LLMS.TXT in the article about how getting one listed in an llms.txt directory mysteriously...
What's this about Adding Other Media to robots.txt I recently came across John Mueller's (a Google Search advocate) blog. I ...
Understanding the Importance of having a fast Mobile website I, personally, spend a lot of time focusing on site speed. The ...
What are robots.txt, sitemap.xml, and llms.txt These files are used by search engines and bots to discover content and to le...
AI Crawlers and Citing Sources The rise of AI, rather than search, crawlers visiting websites and "indexing" information is ...