# WhiteCollar

The whiteCollar is a set of Javascript files present in a tenant's site repository which allows the accurate selection of the articles that compose every section. You can focus on articles selection, and leave the design decisions to the layout descriptor files.

Same as the layout descriptor, whiteCollar is the name of the folder containing all the files for the feature. All (non MarfeelPress) site repositories are created with a default main.js which applies to all sections by default.

Create as many files as sections with specific extraction needs. The default file must be referenced in the definition.json, as the value of the whiteCollarScript property:

    "whiteCollarScript" : "index/src/whiteCollar/main.js",

Additional files must be named after the section they describe. The structure of the site repository is:

www.example.com
└─── index
    └─── src
        └─── whiteCollar
            ├─── main.js
            ├─── home.js
            └─── travel.js

A main.js file is created in every site repository during scaffolding, with the following structure:

document.whiteCollar = (function () {
    // Put here local variables to avoid polluting the global scope

    return {
        getItems: [
            {
                selector: '',
                extractors: {
                    title: 'h2 > a',
                    uri: 'h2 > a',
                    media: 'img',
                    excerpt: '',
                    date: '',
                    subtitle: '',
                    author: '',
                    pocket: {}
                },
                modifiers: []
            }
        ],
        modifiers: []
    }
})();

# Ripper configuration

By default, WhiteCollar uses Puppeteer to retrieve the content from the tenant's site.

PhantomJS is the Ripper configuration that Marfeel used before Puppeteer, it should be avoided as it is deprecated.

User-agent

By default, WhiteCollar uses a custom Marfeel user-agent, representing a desktop browser. Check the whiteCollarUserAgent to learn how to change it and what is its default value.

# Interface to implement

# getItems

getItems: [ {...}, {...} ]

Mandatory array, it contains as many objects as separate selections are required. A separate selection is required when articles are grouped in different parts of the DOM (thus requiring different selectors), or when the design prototype uses content groups.

Read the article about items to get more details about what each item can include.

# getContentGroups (puppeteerRipper only)

getContentGroups: [ {...}, {...} ]

Optional array, it contains as many objects as separate selections are required. A separate selection is required when content groups are in different parts of the DOM (thus requiring different selectors).

Read the article about content group filler to get more details about what each content group can include.

# modifiers

modifiers: [
    WC.limitArticles(30),
    WC.filterEqConsecutiveArticles(),
    functionC
]

Optional function or array of functions, it contains functions that manipulate each item from getItems. Defaults to empty.

All the functions apply to all the items. This array can also be an item property, to apply modifiers to a specific group only.

All modifiers are functions returning a function. You can build your own, or use one available through the WC library.

Custom modifier functions always receive the list of items as argument, and must always return the modified list.

# setup

setup: function(collectItems) {
    ...
    collectItems();
}

Optional function. Defaults to empty. Use it to apply changes to the DOM before extracting articles. You must always call the callback function inside the setup, otherwise, no extraction will take place.

# removeDuplicates

removeDuplicates: true

Optional boolean. Defaults to false. Set to true to remove duplicates (determined by URI).

Duplicates are removed with the following logic:

Article part of a content group > Article with a pocket > any other article.

If an article is duplicated with the same priority score, only the first instance of that article is kept.

`numColumns`

All existing whiteCollar files also have a deprecated numColumns numerical property.

Modifiying its value will change the extraction results - leave it as it is!

It is a relic of Marfeel's L tablet theme.

# enableDuplicate

enableDuplicate: true

Optional boolean. Defaults to false. By default, Marfeel only extracts one instance of each article, ignoring duplicates.

Set the flag to true to force the rendering of all articles.

INFO

removeDuplicates flag has more relevance than enableDuplicate. Check the WhiteCollar does not have removeDuplicates flag set before adding enableDuplicate.

# pagination

pagination: {
    prevPage: "...",
    nextPage: "...",
    currentPage: "..."
}

Optional object, with every selector used for section pagination. This interface has the following properties:

  • prevPage: Property used to configure the selector of the link to the previous pagination page.
  • nextPage: Property used to configure the selector of the link to the next pagination page.
  • currentPage: Property used to configure the selector of the link to the current pagination page. This property isn't required in all cases. You should only use it if you need to fix the label of the current page.

Strings will be passed down to a querySelector function.

The nodes must have the href attribute with the pagination url and innerText property must be the number of the page. Here is an example:

//HTML code
<a class="next" data-url="https://tenantexample.com/page/3">next</a>
<a class="prev" data-url="https://tenantexample.com/page/2">prev</a>

//WhiteCollar implementation
function fixPaginationLabels(callback) {
    const nextPageElement = document.querySelector('.next');
    const prevPageElement = document.querySelector('.prev');
    
    if (nextPageElement) {
       nextPageElement.innerText = currentPage + 1;
       nextPageElement.href = nextPageElement.dataset.url;
    }
    
    
    if (prevPageElement) {
       prevPageElement.innerText = currentPage - 1;
       prevPageElement.href = prevPageElement.dataset.url;
    }

    return callback && callback();
}
...
setup: fixPaginationLabels,
...
pagination: {
    prevPage: !isFirstPage() && '.prev',
    nextPage: '.next'
}

We shouldn't use this interface if the original HTML is already well-formatted like here:

<a class="next" href="https://tenantexample.com/page/3">3</a>
<a class="prev" href="https://tenantexample.com/page/2">2</a>

`pagination`

This interface must be used as the last resource to implement pagination. We must only use this interface in the tenant has a custom pagination of if we need to fix (or add) the label of the pages. Check the section pagination documentation to know if you need to configure the whitecollar, there are other ways to get a section's pages.

# Available libraries

The WhiteCollar code doesn't have to be in VanillaJS. You can use two different libraries: Ramda.js and the WC library.

# Ramda.js

Ramda.js v.0.22.1 (opens new window) is available, be sure to check the documentation for all the available functions, and to avoid writing something that already exists.

Usage example:

R.always(' ')

# WC Library

WC is a Marfeel utility library to manipulate DOM elements and create whiteCollar-compatible objects. It also uses Ramda. Prefer using it rather than writing your methods.

Usage example:

WC.getUniqueUri()

See all the available methods in the article dedicated to the WC library.

# jQuery

jQuery v1.6.4 (opens new window) is available.

Usage example:

$('.article').attr('title')

# Test or debug the whiteCollar?

There is no unit test structure with the whiteCollar.

However, you can run it in isolation, without any actual extraction happening. Use glue section:rip -d to debug in your local environment.

# Articles order

Heuristics determine articles relevance and with it, the order in which they are extracted. In addition to DOM order, the visual order (with CSS) is taken into account.

It is possible to control the display order of articles in a section with two properties: order and position, as described in the whiteCollar items article.

# Real-life use cases

Check different use cases in the whiteCollar guide.