How Google handles JavaScript throughout the indexing process

200 points2 monthsvercel.com

palmfacehn • 2 months ago

The rich snippet inspection tool will give you an idea of how Googlebot renders JS.

Although they will happily crawl and render JS heavy content, I strongly suspect bloat negatively impacts the "crawl budget". Although in 2024 this part of the metric is probably much less than overall request latency. If Googlebot can process several orders of magnitude of sanely built pages with the same memory requirement as a single React page, it isn't unreasonable to assume they would economize.

Another consideration would be that "properly" used, a JS heavy page would most likely be an application of some kind on a single URL, whereas purely informative pages, such as blog articles or tables of data would exist on a larger number of URLs. Of course there are always exceptions.

Overall, bloated pages are a bad practice. If you can produce your content as classic "prerendered" HTML and use JS only for interactive content, both bots and users will appreciate you.

HN has already debated the merits of React and other frameworks. Let's not rehash this classic.

tentacleuno • 2 months ago

> If you can produce your content as classic "prerendered" HTML and use JS only for interactive content, both bots and users will appreciate you.

Definitely -- as someone who's spent quite a lot of time in the JavaScript ecosystem, we tend to subject ourselves to much more complexity than is warranted. This, of course, leads to [mostly valid] complaints about toolchain pain[0], etc.

> HN has already debated the merits of React and other frameworks.

I'll note though that while React isn't the cure-all, we shouldn't be afraid of reaching for it. In larger codebases, it can genuinely make the experience substantially easier than plain HTML+JS (anyone maintain a large jQuery codebase?).

The ecosystem alone has definitely played into React's overall success -- in some cases, I've found the complexity of hooks to be unwarranted, and have struggled to use them. Perhaps I'm just not clever enough, or perhaps the paradigm does have a few rough edges (useEffect in particular.)

[0]: Toolchain pain is definitely a thing. I absolutely hate setting toolchains up. I spent several hours trying to setup an Expo app; curiously, one of the issues I found (which I may be misremembering) is that the .tsx [TypeScript React] extension wasn't actually supported. Definitely found that odd, as you'd assume a React toolkit would support that OOTB.

osrec • 2 months ago

HTML is often not flexible or capable enough. JS exists for a reason, and is an integral part of the web. Without it, you will struggle to express certain things online and lean JS sites can be really quite nice to use (and are generally indexed well by Google).

Bloated JS sites are a horrible thing, but they almost sideline themselves. I rarely visit a bloated site after an initial bad experience, unless I'm forced.

jraph • 2 months ago

For documents, you can absolutely have all the structured content in HTML, and add JS to improve things. This way, you have your feature rich experience, the bot can build its indexing without having to run this extra js, and I have my lightweight experience.

Progressive enhancement :-)

dlevine • 2 months ago

I work for a company that enables businesses to drop eCommerce into their websites. When I started, this was done via a script that embedded an iFrame. This wasn't great for SEO, and some competitors started popping up with SEO-optimized products.

Since our core technology is a React app, I realized that we could just mount the React app directly on any path at the customer's domain. I won't get into the exact implementation, but it worked, and our customers' product pages started being indexed just fine. We even ranked competitively with the upstarts who used server-side rendering. We had a prototype in a few months, and then a few months after that we had the version that scaled to 100s of customers.

We then decided to build a new version of our product on Remix (SSR framework similar to nextjs). It required us to basically start over from scratch since most of our technologies weren't compatible with Remix. 2 years later, we still aren't quite done. When all is said and done, I'm really curious to see how this new product SEOs compared to the existing one.

chrisabrams • 2 months ago

Given that your Remix version has been ~2 years in development by X number of developers, what are the other expected outcomes? It sounds like potential SEO performance is unknown? Is the development team happy with the choice? I can't recall working somewhere that allowed us to work on a project for two years and not release to production, how did you get business buy in?

giraffe_lady • 2 months ago

You don't need buy-in when they tell you to do it!

Not OP but I've definitely seen a "leadership has decided on a rewrite into a new technology" project not ship for a couple years. I doubt it ever shipped, I didn't stay around to find out.

dlevine • 2 months ago

The other outcomes are a redesign and more configurability plus a bunch of new features. It wasn't really an apples to apples comparison. The non-iFrame version was more of a 1.1, where the new thing we are building is a 2.0. Based on some other projects, I do suspect it would have gone faster if we built it on the old stack.

The development team made the choice to go with Remix (well, the tech lead and VP of engineering). No one had used this tech before. We have subsequently talked about whether it would have been better to do the whole thing with Rails + Hotwire. We have been using this approach elsewhere in our stack, and it seems to be a lot conceptually simpler than rendering JS server side and then hydrating it.

sanex • 2 months ago

Currently building something similar and following your path. Rendering an iframe and are working on bundling react app into a custom HTML tag and dropping that onto the page. Would be curious to hear more about your experience.

dlevine • 2 months ago

We ended up using a reverse proxy since we wanted each product page to have a separate URL. Basically /shop/* would resolve to our React app, which rendered the correct page based on the URL. You could configure this pretty easily using NGINX or Apache, but our customers were pretty technically unsophisticated so it was too much work for them to do it this way.

In the end, we built a Wordpress plugin since it turned out that most of our customers used Wordpress. This plugin acted as the reverse proxy. We went a step beyond this and did some cool stuff to let them use a shortcode to render our eCommerce menu within their existing Wordpress template.

One wrinkle with ditching the iFrame was getting our CSS to not conflict with their CSS. I ended up putting our stuff within a shadow DOM, which was a bit of work but ended up working pretty well.

38 • 2 months ago

> nextjs

FYI nextjs is notoriously user hostile and one of the worst pieces of client side code I've ever seen, second only to Widevine. Who dumps 2mb of JSON directly into the HTML?

cjblomqvist • 2 months ago

It's been SSR SPA best practice for a decade at least (when keeping your model data client side).

CharlieDigital • 2 months ago

You can do that and still not dump some really bad markup and JSON. Compare Astro.js output HTML+JS to Next.js and there's a striking difference.

38 • 2 months ago

SSR is when the server creates HTML, not JSON. nextjs is very obviously client side rendering

jxi • 2 months ago

I actually worked on this part of the Google Search infrastructure a long time ago. It's just JSC with a bunch of customizations and heuristics tuned for performance to run at a gigantic scale. There's a lot of heuristics to penalize bad sites, and I spent a ton of time debugging engine crashes on ridiculous sites.

esprehn • 2 months ago

This isn't accurate as of a number of years ago. They run headless chrome (as mentioned in the article). No more hacked up engine or JSC.

emptysea • 2 months ago

Do you know why they use JSC rather than the V8?

esprehn • 2 months ago

They don't, but at one point you might do that for lower resource consumption.

stonethrowaway • 2 months ago

Could we trouble you for a blog post? Super curious to read more about this.

ta12197231937 • 2 months ago

Care to explain why you sold your soul? Working for an ad agency that is.

jessyco • 2 months ago

This line of questioning doesn't invite any kind of positive conversation. Why not ask more politely or even just change your question so it invokes thoughtful answers.

orenlindsey • 2 months ago

I really think it would be cool if Google started being more open about their SEO policies. Projects like this use 100,000 sites to try to discover what Google does, when Google could just come right out and say it, and it would save everyone a lot of time and energy.

The same outcome is gonna happen either way, Google will say what their policy is, or people will spend time and bandwidth figuring out their policy. Either way, Google's policy becomes public.

Google could even come out and publish stuff about how to have good SEO, and end all those scammy SEO help sites. Even better, they could actively try to promote good things like less JS when possible and less ads and junk. It would help their brand image and make things better for end users. Win-win.

capnjngl • 2 months ago

I'm sure there's some bias since it's coming from the horse's mouth, but Google does publish this stuff. Their webmaster guidelines have said for years to make content for users, not robots, and some recent updates have specifically addressed some of the AI SEO spam that's flooding the internet[1]. Their site speed tools and guidelines give very specific recommendations on how to minimize the performance impact of javascript[2].

[1] https://developers.google.com/search/docs/fundamentals/creat...

[2] https://developers.google.com/speed/docs/insights/v5/about

sureIy • 2 months ago

Spam makes transparency impossible. Like you, spammers have to spend months figuring out what works and what doesn’t. If Google is clear, it’s just abused. You can see this every day with free services and they either have to make it harder for everyone or just succumb.

dplgk • 2 months ago

Except that spammers have the incentive to spend months figuring out and normal people don't. So the spammers prevail anyway.

mirkonasato • 2 months ago

It's from 2019 so things may have changed since, but there's a great video on YouTube explaining "How Google Search indexes JavaScript sites" straight from the horse's mouth: https://youtu.be/LXF8bM4g-J4

StressedDev • 2 months ago

Google will never tell you how the ranking algorithm works because if they did, the ranking algorithm would be gamed. Basically, the problem is a lot of people will try to get less relevant content to rank higher than the best content. If you tell these people how Google's ranker works, they will make Google search worse because they will learn how deceive the ranker.

A ranker is a piece of software which determines what results should be show to a user on the search results page.

TZubiri • 2 months ago

duh.

The nerve of parent comment telling Google what to do.

encoderer • 2 months ago

I did experiments like this in 2018 when I worked at Zillow. This tracks with our findings then, with a big caveat: it gets weird at scale. If you have a very large number of pages (hundreds of thousands or millions) Google doesn’t just give you limitless crawl and indexing. We had js content waiting days after scraping to make it to the index.

Also, competition. In a highly competitive seo environment like US real estate, we were constantly competing with 3 or 4 other well-funded and motivated companies. A couple times we tried going dynamic first with a page we lost rankings. Maybe it’s because fcp was later? I don’t know. Because we ripped it all out and did it server side. We did use NextJs when rebuilding trulia but it’s self hosted and only uses ssr.

dheera • 2 months ago

I actually think intentionally downranking sates that require JavaScript to render static content is not a bad idea. It also impedes accessibility-related plugins trying to extract the content and present it to the user in whatever way is compatible to their needs.

Please only use JavaScript for dynamic stuff.

dmazzoni • 2 months ago

> It also impedes accessibility-related plugins trying to extract the content and present it to the user in whatever way is compatible to their needs.

I'm not sure I agree that this is relevant advice today. Screen readers and other assistive technology fully support dynamic content in web pages, and have for years.

Yes, it's good for sites to provide content without JavaScript where possible. But don't make the mistake of conflating the "without JavaScript" version with the accessible version.

niutech • 2 months ago

Screen readers aren't the only assistive user agents. There are terminal-based web browsers too, like Links/Lynx, which doesn't support JS.

dheera • 2 months ago

> Screen readers and other assistive technology

Readers for the blind not the only form of assistive technologies, and unnecessary JS usage where JS is not necessary makes it hard to develop new ones.

There is a huge spectrum of needs in-between, that LLMs will help fulfill. For example it can be even as simple as needing paraphrasing of each section at the top, removing triggering textual content, translating fancy English to simple English, answering voice questions about the text like "how many tablespoons of olive oil", etc.

These are all assistive technologies that would highly benefit from having static text be static.

creesch • 2 months ago

It also is still general overhead, which does require capable devices and a good internet connection. Something a lot of developers with very capable computers and fast internet connections tend to overlook.

Specifically, if you are targeting a global audience, there are entire geographic regions where the internet is much, much slower and less reliable. So not only are these people experiencing slow load times with packet drops and all that some JavaScript libraries and such might not even load. Which isn't a huge deal if your main content does not rely on JavaScript to load, but of course is if it does require JavaScript.

In addition to that, in these same regions people often access the internet through much cheaper and slower devices.

sureIy • 2 months ago

> Please only use JavaScript for dynamic stuff.

Pretty sure that ship has sailed in 2015. It’s good to see people focusing on SSR again but that’s just an extra step and it’s hard to mess up. Too many developers don’t think it’s worth it. Just try to visit any top websites without JS, even just to read them.

dheera • 2 months ago

I do this all the time because a couple of websites display all the text and then there is a JS that erases all the text and replaces it with a stupid paywall, whereas if I disable JS I can just read it.

rvnx • 2 months ago

Strange article, it seems to imply that Google has no problem to index JS-rendered pages, and then the final conclusion is "Client-Side Rendering (CSR), support: Poor / Problematic / Slow"

madeofpalk • 2 months ago

The final recommendation, is to use their semi lock-in product.

meiraleal • 2 months ago

Vercel need people to believe they deliver any value for their absurd price for their AWS wrapper

mdhb • 2 months ago

Hint: they don’t and their entire business model is actively reliant upon deceiving naive junior developers as far as I can tell.

CharlieDigital • 2 months ago

They've deceived plenty of non-juniors as well.

See Target.com, Walmart.com, OpenAI.com and any number of major websites.

React (and specifically Next.js) is the new IBM[0]: You’ll never get fired for picking it, but it’s going to be expensive, bloated, difficult to get right, and it’s going to be joyless implementing it every step of the way.

[0] https://chrlschn.dev/blog/2023/02/react-is-the-new-ibm/

ggregoire • 2 months ago

elorant • 2 months ago

Well it is slow. You have to render the page through a headless browser which is resource intensive.

ea016 • 2 months ago

A really great article. However they tested on nextjs.org only, so it's still possible Google doesn't waste rendering resources on smaller domains

ryansiddle • 2 months ago

Martin Splitt mentioned on a LinkedIn post[1] as a follow up to this that larger sites may have crawl budget applied.

> That was a pretty defensive stance in 2018 and, to be fair, using server-side rendering still likely gives you a more robust and faster-for-users setup than CSR, but in general our queue times are significantly lower than people assumed and crawl budget only applies to very large (think 1 million pages or more) sites and matter mostly to those, who have large quantities of content they need updated and crawled very frequently (think hourly tops).

We have also tested smaller websites and found that Google consistently renders them all. What was very surprising about this research is how fast the render occured after crawling the webpage.

[1] https://www.linkedin.com/feed/update/urn:li:activity:7224438...

globalise83 • 2 months ago

Would be interested to know how well Google copes with web components, especially those using Shadow DOM to encapsulate styles. Anyone have an insight there?

niorad • 2 months ago

According to them, Shadow-DOM and light-DOM is being flattened.

https://developers.google.com/search/docs/crawling-indexing/...

globalise83 • 2 months ago

Thanks - nice find!

orliesaurus • 2 months ago

If Google handles front-end JS so well, and the world is basically a customer-centric SEO game to make money - why do we even bother use server side components in Next.js?

azangru • 2 months ago

> why do we even bother use server side components

The naive hope among some web developers is that not everything that we do on the web we do for the sole reason of SEO :-) Sometimes we do things in order to improve end user experience. Sometimes we even do things to improve website accessibility. Doing stuff either at build time or dynamically on the server allows us to send less javascript to the client for faster page loads (well, not necessarily when Next.js is involved; but still); as well as to send the prepared html to the browser, enabling progressive enhancement / graceful degradation. It also gives us full control over http status codes instead of always responding with a 200:OK. Plus relevant previews for when the url is shared in chat apps or social media.

meiraleal • 2 months ago

Could you please sum up what you think nextjs adds in this context? I read your comment a few times and still can't see any value added by nextjs

jjordan • 2 months ago

IMO for most use cases it's really not needed. Clients are fast and traditional single-page applications are easy and straightforward to build.

Vercel is a hosting company, and that was never more apparent than when they enabled App Router (i.e. server-side rendering) by default way before it was actually ready for mainstream adoption.

acdha • 2 months ago

> Clients are fast and traditional single-page applications are easy and straightforward to build.

None of that is true universally unless your target audience is corporate users with hardwired connections and high-end computers. If you measure performance, very few SPAs manage to approach the performance of a 2010 SSR app and even fewer developers spend the time to handle errors as well as the built-in browser page reloading behavior.

That’s not to say that an SPA can’t be fast or robust, only that the vast majority of developers never get around to doing it.

leerob • 2 months ago

FYI: The Next.js App Router prerenders by default, meaning pages will be statically generated as HTML. This has been the behavior since Next.js 9 (and the Pages Router). React Server Components, despite first impression from the name, can be prerendered and do not require an always-on server to use.

solardev • 2 months ago

Hey leerob,

Thanks for your work on Next! It's still my go-to framework for most projects.

But can I ask, please, what you think the primary advantages of RSCs and the App Router are over Pages?

Is it worth the additional complexity over Pages for most use cases? When does it make sense and when does it not?

solardev • 2 months ago

Vercel's gotta eat somehow.

niutech • 2 months ago

Google is not the only search engine - others may have problems with indexing CSR pages.

bbarnett • 2 months ago

I kinda wish Google would not index JS rendered stuff. The world would be so much better.

StressedDev • 2 months ago

There are a lot of web sites which will not work without JavaScript. I am not sure why so many people do not like JavaScript but it seems to work well.

The other reason Google will always support JavaScript on web sites is because advertisers (including Google) use it. In fact, I suspect one of the motivations for improving Chrome's JavaScript performance was to make ads work better (i.e. allow ads to run more complex scripts to improve ad targeting, and fight ad fraud).

bux93 • 2 months ago

Do those sites need to be indexed, though? Web applications? Games? What is lost if a newspaper's animated pie charts are not indexed for search? I see more assertions in this thread that some websites need javascript, but does the javascript mediated content really add something for search? And if it does, is it perhaps content that actually ought not to have been mediated by js?

mdaniel • 2 months ago

> I am not sure why so many people do not like JavaScript but it seems to work well.

"the dose makes the poison" springs to mind: some interactivity or XHR is often a win for UX, but it also opens up negligent actor's ability to completely torpedo a site for any number of silly reasons. Take the most common thing one wishes to use a web browser for: reading content. Clicking on a link and staring at a blank page because 85mb of JS is still loading, and when it does load it makes swooshing things that hijack the scroll behavior is very likely why so many people do not like JavaScript

revskill • 2 months ago

Like Hackernews ?

enjoyyourlife • 2 months ago

HN loads without JavaScript

tempor6767 • 2 months ago

Amen to that!

kawakamimoeki • 2 months ago

That's right.

EcommerceFlow • 2 months ago

They tested Google's ability to index and render JS, but not how well those sites ranked. I know as an SEO those results would look completely different. When you're creating content to monetize, the thought process is "why risk it?" with JS.

rstupek • 2 months ago

What does your experience tell you about Wix websites which have 100% JavaScript returned and renders the content entirely with JavaScript?

lobsterthief • 2 months ago

We have a customer who recently migrated from Wix to our headless CMS offering (which is WP/Next.js based) and their organic search traffic has improved 30% over the course of a month. But we believe this was largely due to CWV/speed improvements; impossible to parse out how much was due to Wix’s JS-based architecture itself.

EcommerceFlow • 2 months ago

Wouldn't want to presume without seeing any data and barely using Wix, but I wouldn't say Wix is known as a ranking hero platform lol.

TZubiri • 2 months ago

It's not a coincidence Google developed Chrome. They needed to understand what the fuck they were looking at, so they were developing a JS + DOM parser anyways.

zooq_ai • 2 months ago

The CTO of Vercel was a core Google Search Developer. He could have just written the article without any research :), but I guess that would be a professional breach of ethics / trust and the audience wouldn't have believed it anyway.

azangru • 2 months ago

Here's Google search team talking about this in a podcast: https://search-off-the-record.libsyn.com/rendering-javascrip...

llmblockchain • 2 months ago

This blog post is akin to Phillip Morris telling us smoking isn't bad. Go on, try it.

toastal • 2 months ago

Just a reminder: there are alternative search engines, & most of those don’t work without JavaScript & should still be a target

sylware • 2 months ago

headless blink like the new BOTs: headless blink with M&K driven by an AI (I don't think google have click farms with real humans like hackers)

kawakamimoeki • 2 months ago

I have noticed that more and more blog posts from yesteryear are not appearing in Google's search results lately. Is there an imbalance between content ratings and website ratings?

jeffbee • 2 months ago

Did you check whether the origin sites still exist? Google is not Internet Archive.

kawakamimoeki • 2 months ago

Yes. Adjusting the query quite finely will give you a hit. That's the information I really want, but I can't get to it with a simple query.

DataDaemon • 2 months ago

This is a great auto-promotion article, but everyone knows Googlebot is busy; give him immediate content generated on the server or don't bother Googlebot.