# Puppeteer Ripper

Puppeteer ripper is an alternative for Phantom and Jsoup rippers.

It uses Puppeteer(headless Chrome API for NodeJS) as a core technology.

This article describes the process of installation, configuration, and usage of the Puppeteer ripper.

# Installation

To have the mrf-puppeteer command on your machine:

Pull MarfeelXP and execute in terminal the following.

mrf-env -R

# Available commands

  • mrf-puppeteer extract executes Puppeteer Ripper to extract tenant data
  • mrf-puppeteer launch starts headless Chrome instance. By default mrf-puppeteer extract command will try to connect to the browser instance. If connection fails, it will launch it's own instance of browser.

# mrf-puppeteer extract command flags

To see all supported flags use mrf-puppeteer --help command.

# --uri

Required. Followed by web page URI which the instance of headless Chrome will connect to perform content extraction.

mrf-puppeteer extract --uri http://tenant.com/section1`

# --scriptPath

Required. Absolute path to WhiteCollar script which will be injected to the page.

mrf-puppeteer extract --scriptPath ~/path/to/whiteCollarScript.js

# --dev

Activates dev mode, will open the browser UI allowing to visually follow the command execution process. Useful for debugging purposes since you will get access to all injected JavaScript on the page.

mrf-puppeteer extract --dev

# --metadataProviderFiles

Injects metadata provider files into the page. It must be followed by a comma-separated list of absolute paths to the metadata provider files.

mrf-puppeteer extract --metadataProviderFiles ~/path/to/metadata/provider1.js,~/path/to/metadata/provider2.js

TIP

Only for local testing, in production the metadatas will always be available.

# --enableExternalScriptRequests

Enables external scripts to load.

By default mrf-puppeteer will prevent external scripts(from other domains) to load.

WARNING

This flag should only be used in very rare edge cases.WhiteListedDomains should be enough. Before adding it, communicate with the content-platform chapter.

mrf-puppeteer extract --enableExternalScriptRequests true

# --whiteListedDomains

Enables external scripts to be loaded by domain. It must be followed by a comma-separated list of domains.

Useful when the page uses libraries located on CDN servers (e.g. jQuery) which are required for making page to function correctly.

mrf-puppeteer extract --whiteListedDomains domain1.com,domain2.com

# --enableImageRequests

Enables image file requests since by default in mrf-puppeteer they are disabled for performance reasons.

mrf-puppeteer extract --enableImageRequests true

# --enableFontRequests

Enables font file requests since by default in mrf-puppeteer they are disabled for performance reasons.

mrf-puppeteer extract --enableFontRequests true

# --enableStylesheetRequests

Enable stylesheet file requests, by default in mrf-puppeteer they are enabled.

# --pageSelectorPrev

The css selector to detect the link to the previous page ; default is [rel='prev']

# --pageSelectorNext

The css selector to detect the link to the next page ; default is [rel='next']

mrf-puppeteer extract --enableStylesheetRequests true

# --sortItemsByDOM

Use DOM order to sort the items. By default, the order is based on relevance.

mrf-puppeteer extract --sortItemsByDOM true

# --waitUntil

Configures the point on which the extraction starts. domcontentloaded by default, the headless Chrome instance waits for the DOM to be loaded before starting the extraction process.

Possible values:

  • domcontentloaded: When the DOMContentLoaded event is fired.
  • load: When the load event is fired.
  • networkidle0: When there are no more than 0 network connections for at least 500ms.
  • networkidle2: When there are no more than 2 network connections for at least 500ms.
mrf-puppeteer extract --waitUntil networkidle2

# --userAgent

Configures the userAgent to be use in the extraction. By default, the headless Chrome instance will set the a default userAgent.

Possible values:

  • mobile: It will set the mobile userAgent.
  • marfeel: It will set as a userAgent Marfeel-crawler.
  • Custom String: any other value is used as-is as user agent.
mrf-puppeteer extract --userAgent "mobile"

or for a custom string

mrf-puppeteer extract --userAgent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

# Enable Puppeteer Ripper

To enable Puppeteer ripper for a tenant you need to set feedRipper property in the configuration of definiton.json:

"configuration": {
    ...
    "feedRipper": "puppeteerRipper",
    ...
}

or in the section definition to enable it for a specific section:

 "sectionDefinitions": [
    {
        ...
        "feedDefinitions": [
            {
                ...
                "alibabaDefinition" : {
                    "configuration" : {
                        "feedRipper": "puppeteerRipper"
                    }
                }    
                ...
            }
        ]
    }
]

# Puppeteer Ripper Flags in Production

To enable the mentioned flags use the following syntax in the configuration of definition.json:

"puppeteerRipper:<FLAG>": "<VALUE>"
"configuration": {
    ...
"puppeteerRipper:whiteListedDomains":"randomDomain.com"
    ...
}

# User Interaction Library

User Interaction Library is an API registered on the window object and accessible through window.userInteraction, enabling puppetter to use the scrollPage function.

Allows Puppeteer to simulate a user scrolling to the end of the document so the all content (even lazy-loaded one) is loaded by the time extraction starts.

In order to enable it, scrollPage has to be called in the setup of WhiteCollar.

Example of usage

        async setupFunction(callback) {
          await window.userInteraction.scrollPage(350, 200, 10);
          return callback();
        }

# scrollPage function (opens new window)

window.userInteraction.scrollPage(pageScrollAwaitPeriod, articleLoadAwaitInterval, maxArticleLoadScrolls): Promise

Scrolls to the bottom of the dom, waits for new content to load and checks if more content was loaded. If it's the case, repeats the iteration.

Configuration parameters:

# pageScrollAwaitPeriod

Defines the await time (for articles to load) after the first scroll.

Default value: 350ms

# articleLoadAwaitInterval

Defines the await time (for articles to load) once the document end is reached.

Default value: 2000ms

# maxArticleLoadScrolls

Defines the maximum number of content loads allowed.

Default value: 10