Overview
The article is talking about how to improve performance or speed when you are using puppeteer library.
Example codes are under 19.4.1 version, so lower or higher version may not work properly.
1 Way: pass unnecessary resources
Most types of resource are not important to scrap the web page.
Especially, Website spends a lot of time to load stylesheet such as CSS and optimize images.
Example code
typescript
const browser = await puppeteer.launch({
devtools: true,
headless: false,
})
// Open page
const page = await browser.newPage()
await page.setRequestInterception(true)
page.on('request', (req) => {
if (req.resourceType() === 'stylesheet'
|| req.resourceType() === 'image'
|| req.resourceType() === 'media'
|| req.resourceType() === 'font'
|| req.resourceType() === 'manifest'
|| req.resourceType() === 'websocket'
) {
req.abort()
} else {
req.continue()
}
})
req.abort()
It passes to load
req.continue()
It continues to load
Major Resource type
Type | Description |
---|---|
stylesheet | Stylesheet |
image | Images (jpeg, png, gif...) |
media | media (Video, Audio) |
font | Font family |
websocket | websocket |
2 way: block to request with specific URL
You can prevent to load specific URL that is not required to scrap
Example code
typescript
const blocked_domains = [
'googlesyndication.com',
'adservice.google.com',
]
const browser = await puppeteer.launch({
devtools: true,
headless: false,
})
// Open page
const page = await browser.newPage()
await page.setRequestInterception(true)
page.on('request', (req) => {
if (req.url().endsWith('.png') || // Prevent to load end with .png
req.url().endsWith('.jpg'), // Prevent to load end with .png
blocked_domains.some(domain => req.url().includes(domain)) // Block domains that you intialize
) {
req.abort()
} else {
req.continue()
}
})
req.url()
It returns string
type of full URL.
3 way: change arguments
You can disable some settings that you don't need. There are a lot of arguments, so look at list. The article is mentioning some useful arguments.
typescript
const args = [
'--autoplay-policy=user-gesture-required',
'--disable-background-networking', // Disable several subsystems which run network requests in the background. This is for use when doing network performance testing to avoid noise in the measurements
'--disable-background-timer-throttling', // Disable task throttling of timer tasks from background pages.
'--disable-backgrounding-occluded-windows', // Disable backgrounding renders for occluded windows. Done for tests to avoid nondeterministic behavior.
'--disable-breakpad', // Disables the crash reporting
'--disable-dev-shm-usage', // You may need when you use Docker(VM) https://pptr.dev/troubleshooting/#tips
'--disable-domain-reliability',
'--disable-extensions',
'--disable-hang-monitor', // Suppresses hang monitor dialogs in renderer processes. This may allow slow unload handlers on a page to prevent the tab from closing, but the Task Manager can be used to terminate the offending process in this case
'--disable-ipc-flooding-protection',
'--disable-notifications',
'--disable-offer-store-unmasked-wallet-cards',
'--disable-popup-blocking',
'--disable-print-preview',
'--disable-prompt-on-repost',
'--disable-renderer-backgrounding',
'--disable-setuid-sandbox',
'--disable-speech-api', // Disables the Web Speech API (both speech recognition and synthesis).
'--disable-sync',
'--hide-scrollbars', //Hide scrollbar
'--ignore-gpu-blacklist',
'--metrics-recording-only',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings', // Don't send hyperlink auditing pings
'--no-sandbox',
'--no-zygote', // Disables the sandbox for all process types that are normally sandboxed. Meant to be used as a browser-level switch for testing purposes only
'--password-store=basic',
'--use-gl=swiftshader',
'--use-mock-keychain',
]
const browser = await puppeteer.launch({
devtools: true,
headless: false,
args,
})
4 way: Cache resources into Directory
Using userDataDir
option enable you to cache your resources.
Example code
typescript
const browser = await puppeteer.launch({
userDataDir: './my/path'
})