# Content group filler

Content group filler is a feature for whitecollar to autommaticly group items by a key. The idea is that the developer selects what elements are items and what are content groups. The relation of knowing which item belongs to each content group is done without any extra work from the developer.

It's a feature only working with puppeteerRipper whitecollar.

# Benefits

We can avoid any logic related to the content groups in the item extraction, this is a good step towards a very simple whitecollar.

# Interface to implement

getContentGroups is an optional array in the whitecollar script (see the whitecollar article).

This article details all the properties each content group can have, for example:

{
	selector: ".balcon",
	extractors: {
		name: ".title"
	}
}

# selector

Mandatory string passed to querySelectorAll under the hood, to select all the content groups. It can contain different comma separated selectors that wrap the information of every content group (e.g. #latest-news ARTICLE, .featured-items .post).

# prefix

Optional property to prefix the keys of that group of selected balcones. All balcones detected will prefix the name with that prefix. This is really useful in conjunction with startsWith feature of the layoutDescriptor.

# extractors

Mandatory object containing all the instructions on how to extract content group properties. All extractors instructions are applied to each content group node found by the selector.

# name

This property will select inside the node the desired selector and sanitize to use it as an identifier for every content group.

For example:

<div class="separator">
     <span class="title">Breaking news</span>
</div>

With name: ".title" will be extracted as breakingnews.

The special keyword INNER_TEXT will alow you to get the text content of the selected group.

This is specially useful for content groups without children, for example:

<div class="group-title">Breaking news</div>
<article></article>
<article></article>
<article></article>

<div class="group-title">Repairing news</div>
<article></article>
<article></article>

With name: "INNER_TEXT", it will extract to content groups with names breakingnews with 3 articles and repairingnews with 2 articles.

WARNING

Take into account that INNER_TEXT will get the innerText property of the HTML node provided in the selector. So for content groups with children it will get the whole content group innerText.

# title

Selector to extract the title of the content group. Not sanitized.

TIP

You can also use INNER_TEXT for the title property.

# children

This is a special property to define the type of content group.

There are two types of content groups: with children or without.

It's important to define the content groups properly as for each type of content group the strategy to link them is different.

# Balcon with children

In this case, we have to define the property children to true.

Example:

<div class="content-group">
	<div class="title">Breaking news</div>
	<article>....</article>
	<article>....</article>
</div>

# Balcon without children

In this case we don't need to put the property children, as by default is false.

Example:

<div class="content-group">
	<div class="title">Breaking news</div>
</div>
<article>....</article>
<article>....</article>

# What happens under the hood

If everything is configured correctly your items will have the key from the content group that they belong to.

For example:

<div class="content-group">
	<div class="title">Ultimas noticias</div>
</div>
<article>....</article>
<article>....</article>
<article>....</article>
<article>....</article>
getContentGroups: [
	{
		selector: ".content-group",
		extractors: {
			name: ".title"
		}
	}
]
[
	{
		"title": "El Síndic achaca las largas listas de espera en Sanidad  a los pacientes del resto de España",
		"uri": "https://www.vozpopuli.com/elliberal/politica/Sindic-achaca-Sanidad-pacientes-Espana_0_1307869371.html",
		"subtitle": "12:03",
		"relevance": 9189,
		"column": 1,
		"media": null,
		"pocket": {
			"key": "ultimasnoticias"
		}
	},
	{
		"title": "El PNV advierte al PSOE: también es “importante” que avancen las conversaciones con ellos",
		"uri": "https://www.vozpopuli.com/politica/PNV-advierte-PSOE-importante-conversaciones_0_1307869397.html",
		"subtitle": "11:38",
		"relevance": 9330,
		"column": 1,
		"media": null,
		"pocket": {
			"key": "ultimasnoticias"
		}
	},
]

So on the layoutDescriptor you can create a content group with that key.

# How to disable it

In general, if you don't create the getContentGroups in the whitecollar nothing will be applied. If you want to disable it for a particular item, what you can do is to create a pocket with a key. In case that a pocket with a key exists in an item, content groups filler won't do anything for that particular item.

# How it works

The algorithm saves all detected content groups and for every item tries to assign it to them. In order to do that it has two strategies:

# Contains strategy

Imagine this scenario:

<div class="content-group">
	<div class="title">Breaking news</div>
	<article>....</article>
	<article>....</article>
	<article>....</article>
	<article>....</article>
</div>

In this case, the content group has to be defined as children true. Something like:

{
	"selector": ".content-group",
	"children": true,
	"extractor": {
		"name": ".title"
	}
},

And the strategy here to know if it belongs or not to that particular content group it to check if the DOM element is inside the content group.

# Position strategy

Imagine this other scenario:

<div class="content-group">
	<div class="title">Breaking news</div>
</div>
<article>....</article>
<article>....</article>
<article>....</article>
<article>....</article>

In this case, the content group has to be defined as children false. Something like:

{
	"selector": ".content-group",
	"extractor": {
		"name": ".title"
	}
},

The strategy here its a little bit more complex and we rely on CSS calculation to understand where is the item positioned in the screen. In a simplified way if an item is under a content group we say it belongs to that content group. This takes into account different columns and so on, to relate to the proper key.