But we don't just use any web browser, we use Chrome, and we use a headless version of Chrome that we have customized for our own needs. We made it more performant, more resilient, scalable, and more.
If you have been using Puppeteer on your own you will know it comes with lots of intricacies, and pain points. One of those is ensuring all of your content is fully loaded before outputting your result being a PDF or an Image. What is the best way to ensure your content is all there?
Puppeteer has an option called waitUntil where you can pass in several options. These options change the behavior of how and when it will complete the rendering of your page, and return the results.
Below are the options currently available as of this writing:
load- consider navigation to be finished when the load event is fired.
domcontentloaded- consider navigation to be finished when the DOMContentLoaded event is fired.
networkidle0- consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
networkidle2- consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.
So that begs the question, why doesn't it just wait until its "finished" and render the result. After all, when I go to a page, the browser loads the page and then it's done right? Well, no actually. The progress bar on your browser may have stopped, and it may appear to have finished but in reality many websites are still holding open connections to the server.
These connections are used to give you real-time updates, notifications, and things like that. So they are necessary, not every website uses them, but many do and because of this the browser has no way of knowing if the website is indeed actually finished. No definitive way at least, but there are several indicators that you can use to determine if you "think" it's finished.
So you have
domcontentloaded mentioned above, these are based on static events that will be very consistent. However, if you are getting inconsistent content loading using those events, you would want to move on to the more heuristic based options. That's what you would use
networkidle2 for, as these are heuristic based methodologies for determining if a page is fully loaded. Since these are heuristic based, they are not perfect, and will not cover every scenario.
Note: We will discuss some edge cases these don't cover and what you can do about them further below.
So when should you use each one?
networkidle0is specifically tailored for SPA based applications or applications written with code that explicitly closes their connections when finished. For example, anything that uses
networkidle2is tailored more towards page that use streams, or long lived connections, such as polling or background tasks that involve network connections. It's important to note that if the website keeps more than 2 active connections open this option will timeout and indicate the page is completed still.
So there are some edge cases that none of these options would fix, and this is by no means a complete list of these but the most common.
- Lazy loaded images
- Lazy loaded content based on scrolling position
- Videos, or animated content
For lazy loaded images, and lazy loaded content the fix is relative simple. Scroll the content of the page to the end and render the result. You would also most likely want to use either
networkidle2 in conjunctions. One catch to this solution is infinitely scrolling sites could cause a memory exception or an excessively large rendering so you will want to build in some techniques to prevent that from ocurring, or you could use our service where we have done all the hard work for you, and all you have to do is pass in the options.
scrollTowill simulate scrolling the page. This will attempt to scroll down the page, which will cause the lazy loaded elements on the page to render themselves. If you combine this with
networkidle2this could give you the result you are looking for, each scenario behaves differently so there is no gaurantee.