# Section pages extraction

Section pages are the first navigation level of every tenant. They list the different articles that compose the site. They are extracted through Marfeel Alibaba.

Generally, sections pages are transformed following those concepts:

  • One Marfeel section page can contain several Desktop section pages: this is configured in the tenant's site repository.
  • Section pages can contain list of articles, widgets, and static content.

Specifically, a section processed by Alibaba goes through a ripper. This is the name we gave to the tool we use to grab the data from the tenant to generate the section page.

Three different ones are currently available:

  1. WhiteCollar ripper
  2. Jsoup ripper
  3. MarfeelPress ripper

TIP

Different sections of the same tenant can use different rippers. Then, choose the one that fits better each section depending on its characteristics.

# WhiteCollarRipper

Currently the most commonly used ripper. It works well for most section pages, and is needed if the pages require javascript to load all the content.

Behind the scenes, whiteCollar uses Puppeteer (opens new window) as a headless browser (opens new window) at extraction time, against the desktop website of a tenant. Thanks of that, we can access to the tenant HTML loading the Javascript and CSS.

As a headless browser we can create and execute Javascript functions in our whiteCollar files. It allows us to generate functions to manipulate the DOM to grab necesary data for extraction.

TIP

Existing whitecollar often mix layouts and extractions. Follow the whiteCollar documentation recommendations for new repositories.

TIP

The migratiom towards Puppeteer is not complete, and the WhiteCollar ripper can also use PhantomJS, a deprecated headless browser.

# JsoupRipper

Based in JSOUP (opens new window) which is a JAVA library that provides an API for extracting and manipulating data using the DOM. It scraps and parses the HTML from a URL letting you play with the HTML.

As a difference with a headless browser, it does not execute any Javascript or CSS. This one is his power, it just grabs the HTML doing the extraction more efficient and it powerful when all the data we want is in the HTML.

Dynamic sections

In Marfeel, dynamic sections must be extracted using JSOUP. Check the dynamic sections creation guide for details. .

# MarfeelPressRipper

Specific to tenants using the MarfeelPress plugin, it takes advantage of Wordpress's API and JSON rendition of pages.

It extracts the content directly from the WordPress database and therefore it's the most performant one. It should be used on all tenants that use WordPress.

Learn the details about MarfeelPressRipper in the MarfeelPress API endpoints article.