# Extraction configuration migration

Any tenant at Marfeel can use several extraction strategies. For example, one section can be extracted with the Whitecollar ripper, while the others work with MarfeelPress ripper. Section and articles extraction configuration can also be mixed: extract all the articles of a site with BoilerpipePressExtractor, but all the section pages with Jsoup Ripper for example: there are as many possible configurations as there are tenants.

This article describes how to change a tenant's article extraction strategy and section extraction strategy.

WARNING

Changing the configuration for a tenant using BoilerpipePressExtractor or marfeelPressRipper is discouraged and should be done as the last resort, as it lowers performance.

TIP

"Removing MarfeelPress", "Migrating the section" are different ways of saying the same thing: modifying the extraction configuration.

For example, if you "remove MarfeelPress for the home section" of a site, you're migrating the home section to a different ripper.

If you "remove MarfeelPress from articles", you're changing the Article extraction strategy.

# Article extraction strategy

Article extraction is handled by Boilerpipe, which processes the content to produce the Marfeel version. By default, regular tenants use BoilerpipeExtractor and MarfeelPress tenants use BoilerpipePressExtractor.

To change the article extraction strategy of a tenant, use the itemExtractorType in the global configuration of the definition.json file. As value, add the desired extractor.

The Fetcher is the Boilerpipe component in charge of retrieving the content from the tenant's site.

BoilerpipePressExtractor is bound to MarfeelPressFetcher and therefore it can't be changed.

BoilerpipeExtractor can use other fetchers, described in the Fetchers section of the Article pages extraction article. To change the Fetcher used by a tenant, add the boilerpipeFetcher flag in the global configuration of the definition.json file.

WARNING

All sections of a tenant must use the same Extractor and Fetcher, that's why they need to be configured in the global configuration, not in the section configuration.

# Section extraction strategy

Section extraction is handled by Alibaba, which can be configured to use different Rippers depending on the tenant's needs.

The different Rippers available are described in the Sections pages extraction article.

WARNING

Rippers can be configured for a specific section, therefore it is possible to have multiple Rippers being used in the same tenant.

To change the Ripper configuration of a section, use the feedRipper flag in the desired section configuration.

TIP

The Ripper can also be set in the global configuration of a tenant. In that case, it will be used by default when a section doesn't have a specific Ripper configured.

# Global configuration

# Programmed invalidations

Tenants using the MarfeelPress plugin don't need programmed invalidations, as they have a different strategy to detect when the content needs to be extracted.

Tenants that don't have the MarfeelPress plugin though, use scheduled invalidations.

When an active tenant installs the MarfeelPress plugin, the scheduled invalidations can be disabled. This will reduce the load on the tenant's servers.

To disable the scheduled invalidations:

If an active tenant using the MarfeelPress plugin uninstalls it, the scheduled invalidations must be enabled. In that case, set quartzInvalidation to true and disabledConsumerInvalidation to false.

# Common issues

When migrating a tenant to use BoilerpipePressExtractor, the following flags may be required to guarantee a successful extraction:

TIP

These flags are configured by default for tenants that install the MarfeelPress plugin.