4 ways to improve puppeteer performance

4 ways to improve puppeteer performance

2023-01-07
3 min read

Overview

The article is talking about how to improve performance or speed when you are using puppeteer library.
Example codes are under 19.4.1 version, so lower or higher version may not work properly.

1 Way: pass unnecessary resources

Most types of resource are not important to scrap the web page.
Especially, Website spends a lot of time to load stylesheet such as CSS and optimize images.

Example code

typescript
const browser = await puppeteer.launch({
    devtools: true,
    headless: false,
  })
  // Open page
  const page = await browser.newPage()
  await page.setRequestInterception(true)

  page.on('request', (req) => {
    if (req.resourceType() === 'stylesheet'
      || req.resourceType() === 'image'
      || req.resourceType() === 'media'
      || req.resourceType() === 'font'
      || req.resourceType() === 'manifest'
      || req.resourceType() === 'websocket'
    ) {
      req.abort()
    } else {
      req.continue()
    }
  })

req.abort()

It passes to load

req.continue()

It continues to load

Major Resource type

TypeDescription
stylesheetStylesheet
imageImages (jpeg, png, gif...)
mediamedia (Video, Audio)
fontFont family
websocketwebsocket

2 way: block to request with specific URL

You can prevent to load specific URL that is not required to scrap

Example code

typescript
const blocked_domains = [
  'googlesyndication.com',
  'adservice.google.com',
]

const browser = await puppeteer.launch({
    devtools: true,
    headless: false,
  })
  // Open page
  const page = await browser.newPage()
  await page.setRequestInterception(true)

  page.on('request', (req) => {
    if (req.url().endsWith('.png') || // Prevent to load end with .png
      req.url().endsWith('.jpg'), // Prevent to load end with .png
      blocked_domains.some(domain => req.url().includes(domain)) // Block domains that you intialize
  ) {
      req.abort()
    } else {
      req.continue()
    }
  })

req.url()

It returns string type of full URL.

3 way: change arguments

You can disable some settings that you don't need. There are a lot of arguments, so look at list. The article is mentioning some useful arguments.

typescript
const args = [
  '--autoplay-policy=user-gesture-required',
  '--disable-background-networking', // Disable several subsystems which run network requests in the background. This is for use when doing network performance testing to avoid noise in the measurements
  '--disable-background-timer-throttling', // Disable task throttling of timer tasks from background pages. 
  '--disable-backgrounding-occluded-windows', // Disable backgrounding renders for occluded windows. Done for tests to avoid nondeterministic behavior.
  '--disable-breakpad', // Disables the crash reporting
  '--disable-dev-shm-usage', // You may need when you use Docker(VM) https://pptr.dev/troubleshooting/#tips
  '--disable-domain-reliability',
  '--disable-extensions',
  '--disable-hang-monitor', // Suppresses hang monitor dialogs in renderer processes. This may allow slow unload handlers on a page to prevent the tab from closing, but the Task Manager can be used to terminate the offending process in this case
  '--disable-ipc-flooding-protection',
  '--disable-notifications',
  '--disable-offer-store-unmasked-wallet-cards',
  '--disable-popup-blocking',
  '--disable-print-preview',
  '--disable-prompt-on-repost',
  '--disable-renderer-backgrounding',
  '--disable-setuid-sandbox',
  '--disable-speech-api', // Disables the Web Speech API (both speech recognition and synthesis). 
  '--disable-sync',
  '--hide-scrollbars', //Hide scrollbar
  '--ignore-gpu-blacklist',
  '--metrics-recording-only',
  '--mute-audio',
  '--no-default-browser-check',
  '--no-first-run',
  '--no-pings', // Don't send hyperlink auditing pings
  '--no-sandbox',
  '--no-zygote', // Disables the sandbox for all process types that are normally sandboxed. Meant to be used as a browser-level switch for testing purposes only
  '--password-store=basic',
  '--use-gl=swiftshader',
  '--use-mock-keychain',
]
const browser = await puppeteer.launch({
    devtools: true,
    headless: false,
    args,
})

4 way: Cache resources into Directory

Using userDataDir option enable you to cache your resources.

Example code

typescript
const browser = await puppeteer.launch({
  userDataDir: './my/path'
})

Ref

Free open source project made by Youngjin