# Extraction flags

The following are all the flags that can be defined in a Tenant's definition.json regarding extraction.

# allowJavascriptLoad, alibabaWaitPageOpen, allowExternalJavascriptLoad

WARNING

Before using these flags (allowJavascriptLoad, alibabaWaitPageOpen, allowExternalJavascriptLoad), make sure you fully understand that they load unknown javascript scripts, and this might seriously affect Marfeel behaviour and even prevent it to work.

Used on section pages that are loaded with JavaScript. Each flag is, in order, a more efficient but more costly way of extracting pages rendered with javascript.

Include the following flags one by one, checking each individually. If unsuccessful, try with the next one until you use the three simultaneously.

  • Type: boolean
  • Default: false

Example:







 
 
 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "allowJavascriptLoad": "true",
        "alibabaWaitPageOpen":"true",
        "allowExternalJavascriptLoad":"true",
        ...
    },
    ...
}

# articlePathLastParts

If an article path is short, this flag is used to define the minimum words of the last part of the article and check whether it's an article or not.

  • Type: number
  • Default: 1 if the path only has 2 parts or less, and 2 for longer paths.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "articlePathLastParts": 3,
        ...
    },
    ...
}

WARNING

Before using these flags (articlePathParts and articlePathLastParts), make sure you fully understand how they work.

These flags are meant to be used in URLs following a pattern and NEVER for a single URL case. If you want a specific URL to be detected as a section instead of an article you might want to develop a static section for this specific case.

If you still need to use these flags, keep in mind that you are changing how this tenant is identifying articles and sections, so please make sure to test several articles and sections to check everything is still working properly.

# articlePathParts

Defines the patterns to use to check whether a page is an article or not.

  • Type: number
  • Default: 4

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "articlePathParts": 2,
        ...
    },
    ...
}

# authorBio (MarfeelPress-specific)

Used only on MarfeelPress Tenants. Adds the author's bio in the article details.

  • Type: string
  • Format: One of:
    • bottom - The bio is added before the content of the article
    • top - The bio is added after the content of the article

Example:







 





{
    ...,
    "title" : "Title of the awesome example site",
    "uri" : "www.example.com",
    "configuration" : {
        ...,
        "authorBio" : "bottom",
        ...
    },
    ...
}

MarfeelPress specific

This flag is only active with the MarfeelPressFetcher.

# blacklist

Avoids the extraction of elements from article pages. See more information in the documentation about blacklist and whitelist.

  • Type: string
  • Default: undefined
  • Template: comma-separated list

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "blacklist": "desktop-footer,==off-phones",
        ...
    },
    ...
}

# blacklistedUrlPatterns

Defines blacklisting content based on URL patterns.

WARNING

The patterns only check against the path, not the domain or the protocol.

It defines an AntPathMatcher (opens new window).

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "blacklistedUrlPatterns": "/*/example-url-pattern.shtml**",
        ...
    },
    ...
}

WARNING

When blacklisting a whole section, validate it's not defined within definition.json or if it is, it's of type EXTERNAL.

When the tenant is using MarfeelCDN, the pattern has to be blacklisted at CDN level.

# boilerEnableSecureConnections

Enables the secure connections on the Boilerpipe for articles.

This flag is not necessary if:

  • The definition already has the hasHttps flag set to true

  • The first section from sectionDefinition uses https.

  • Type: boolean

  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerEnableSecureConnections": "true",
        ...
    },
    ...
}

# boilerpipeFetcher

Adds a custom RSS fetcher for the Boilerpipe. Review the fetchers article to know more about it.

  • Type: string
  • Default: htmlFetcher

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerpipeFetcher": "tenantRssFetcher",
        ...
    },
    ...
}

# boilerpipeIgnoreInlineImageDimensions

Sets the order of the getImageDimension methods during the extraction to:

  1. QueryParam
  2. FromPath
  3. File headers
  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerpipeIgnoreInlineImageDimensions": true,
        ...
    },
    ...
}

WARNING

Remove this flag completely to disable it.

Setting it to false won't work.

# boilerpipeIgnoreImageNameDimensions

Sets the order of the getImageDimension methods during the extraction to:

  1. CustomWidthAndHeightAttr
  2. WidthAndHeightAttr
  3. StylesAttr
  4. File headers
  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerpipeIgnoreImageNameDimensions": "true",
        ...
    },
    ...
}

WARNING

Remove this flag completely to disable it.

Setting it to false won't work.

# boilerpipeUserAgent

Specifies the User Agent that Boilerpipe has to use to browse the site's HTML as rendered in a specified device.

  • Type: string
  • Default: Mozilla/5.0 (Macintosh; In tel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 MarfeelMan
  • Format: must be a valid user-agent, it will be used as-is, appending MarfeelMan at the end.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerpipeUserAgent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36",
        ...
    },
    ...
}

# cleanerFetcherBlacklist

Defines the blacklist for the cleaner fetcher if it's selected as the Boilerpipe fetcher.

  • Type: string
  • Default: undefined
  • Format: comma-separated list of DOMString

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "cleanerFetcherBlacklist": ".aside-inner, .block.comments",
        ...
    },
    ...
}

# cronRefresh

Defines the frequency of section reloads according to cronmaker (opens new window).

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "cronRefresh": "0 0/3 * 1/1 * ? *",
        ...
    },
    ...
}

TIP

To configure for a specific section, refer to this article

# customTagActions

Transforms an HTML tag into another element.

  • Type: string
  • Default: undefined
  • Format: One of:
    • GenericVideoAttrElement
    • ImageElement
    • PinterestElement
    • CustomStyleElement
    • IgnorableElement
    • IframeElement
    • DIVElement
    • ScriptMetadataElement
    • NextPageTagAction

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "customTagActions": "PICTURE:DIVElement"
        ...
    },
    ...
}

In this case, it will transform the PICTURE HTML elements into DIV elements.

# defaultTopMediaMediaSelectorStrategy

Selects the Top Media based on a specified option. The available options are included in MediaSelector.java (opens new window) in Gutenberg.

  • Type: string
  • Default: FORCE_DETAIL
  • Format: One of:
    • HINT_OR_DETAIL - The image is extracted from section pages. If not there, it is extracted from article pages.
    • DETAIL_OR_HINT - This is the default value. With this strategy, Marfeel first tries to extract the image from article pages, before moving on to section pages.
    • FORCE_DETAIL - The image is extracted from article pages.
    • FORCE_HINT - The image is extracted from section pages.
    • TOPMEDIA - The image is extracted from the top media.
    • META_OR_DETAILS - The meta image is extracted. If not there, the image is extracted from article pages
    • HINT_OR_META - The image is extracted from section pages. If not there, the meta image is extracted.
    • DETAIL_OR_HINT_OR_META - The image is extracted first from article pages, then section pages, and then the meta.

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "defaultTopMediaMediaSelectorStrategy": "DETAIL_OR_HINT",
            ...
        },
        ...
    },
    ...
}

# defaultMediaSelectorStrategy

Defines how the image used in section pages is selected. If only a certain group of articles needs it, it is recommended to use forceStrategy in the whiteCollar instead.

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "defaultMediaSelectorStrategy": "META_OR_DETAILS",
            ...
        },
        ...
    },
    ...
}

# detailsExcerpt (MarfeelPress-specific)

Used only on MarfeelPress Tenants. Adds the excerpt returned by the boilerpipePressExtractor.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title" : "Title of the awesome example site",
    "uri" : "www.example.com",
    "configuration" : {
        ...,
        "detailsExcerpt" : "true",
        ...
    },
    ...
}

MarfeelPress specific

This flag is only active with the MarfeelPressFetcher.

# detailItemsProcessor

Allows choosing a different item processor. When a webpage is slow or there is a lot of content to extract, using the throttled processor makes the process more robust.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "detailItemsProcessor": "throttledDetailItemsProcessor",
        ...
    },
    ...
}

# disableAMPCacheForImages

If set to true, the src of the image will be AMP_CACHE_URL_imageURL where the AMP_CACHE_URL is https://cdn.amproject.org/i/ (opens new window)

  • Type: boolean
  • Default: false

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "disableAMPCacheForImages": true,
            ...
        },
        ...
    },
    ...
}

WARNING

Remove this flag completely to disable it.

Setting it to false won't work.

# disabledConsumerInvalidation

  • Type: boolean
  • Default: false

Disables default article invalidation configuration when true.

Default invalidation

When a consumer gets a request for an article older than 24h, it extracts it again.

Not necesary with MarfeelPress since that content is refreshed via API calls.

# pageNumberStartsFromZero

  • Type: boolean
  • Default: false

This pagination flag alters section page number calculation. When set to true, it will consider that the page number starts from zero instead of one.

Example:

page 1 -> https://www.diariodemorelos.com/noticias/categories/virales
page 2 -> https://www.diariodemorelos.com/noticias/categories/virales?page=1
page 3 -> https://www.diariodemorelos.com/noticias/categories/virales?page=2

In this example, by default the pagination links will show the wrong label and trigger a page links inconsistency exception and an alarm. Setting pageNumberStartsFromZero to true allow us to support this behaviour properly.

Example:

{
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        "pageNumberStartsFromZero": "true"
    },
}

# disableDefaultPagePattern

  • Type: boolean
  • Default: false

This flag disables the default page pattern set by Gutenberg: "/page/([0-9]+)/?".

Only page patterns defined, in tenant's definition configuration and specific section definition pagePatterns, will be applied.

# disablePhantomDiskCache

When enabled, disables cache when using PhantomJs.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "disablePhantomDiskCache": true,
        ...
    },
    ...
}

# disableProxyScripts

When set to true, scripts do not go through the cache.

  • Type: boolean
  • Default: false

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "disableProxyScripts": true,
            ...
        },
        ...
    },
    ...
}

# disableSectionValidation

Disables section validation, which would normally avoid duplicated sections.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "disableSectionValidation": true,
        ...
    },
    ...
}

WARNING

Remove this flag completely to disable it.

Setting it to false won't work.

# dynamicItemContentConfiguration

Extracts the specified content block from the DOM of the client. Later it can be consumed from any JSP file that you specify.

  • Type: string
  • Default: undefined
  • Format: ;-separated list of a CSS selector followed by the widget name:
    • .exampleSelector,exampleWidget;#someID > div,otherWidget

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "dynamicItemContentConfiguration": ".generic-widget > .discounts,dynamicContentWidget;.news-related,newsRelatedWidget",
        ...
    },
    ...
}

How to use it: this is an example of how you would get the html of the specified content block previously selected in the jsp file. Following the example .generic-widget > .discounts,dynamicContentWidget:

<%@taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>

<c:set var="detailWidgetDiscounts" value="${item.getDetailItem().getWidget(null, '', 'dynamicContentWidget')}" scope="request" />

<c:if test="${detailWidgetDiscounts != null}">
    ${detailWidgetDiscounts.getHtml()}
</c:if>

For AMP, you will also need to add the jigsaw:ampTranslator tag:

<%@taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<%@taglib prefix="jigsaw" uri="http://dev.marfeel.com/jsp/mrf/jigsaw" %>

<c:set var="detailWidgetDiscounts" value="${item.getDetailItem().getWidget(null, '', 'dynamicContentWidget')}" scope="request" />

<c:if test="${detailWidgetDiscounts != null}">
    <jigsaw:ampTranslator>
            ${detailWidgetDiscounts.getHtml()}
    </jigsaw:ampTranslator>
</c:if>

See a real tenant example on GitHub (opens new window).

# dynamicSectionAllowedQueryParams

Avoids stripping the specified query parameters when extracting a section. Use it if a section path is different depending on query parameters. This flag is expected to only work with dynamic sections, but it is possible to use it with the default section type too.




 





{
  "name" : "section_name",
  "title" : "Section title",
  "configuration" : {
    "dynamicSectionAllowedQueryParams" : "tag"
  },
...
}

Be mindful of the query parameter

Don't use this flag for any query parameter. If you plan to keep a parameter called page or p, if the value is a number, look out! You might be re-creating section pagination!

# enableUnsecureMedia

By default, Marfeel forces all media source to go through https. With this flag, Marfeel keeps http if it is in the original site.

  • Type: boolean
  • Default: false

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "enableUnsecureMedia": true,
            ...
        },
        ...
    },
    ...
}

# extractImagesFromNoScript

Enables the extraction of images located inside <noscript> tags.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "extractImagesFromNoScript": "true",
        ...
    },
    ...
}

# extractionQueryParams

This flag adds parameters to the extraction query.

  • Type: string
  • Default: undefined
  • Format: query string

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "extractionQueryParams": "mrfCacheBuster={timestamp}&key=value",
        ...
    },
    ...
}

{timestamp} is automatically replaced by a timestamp at extraction time.

Partial Deprecation

Using this flag only as a cache buster is deprecated. Use the mrfCacheBuster flag for this purpose.

# mrfCacheBuster

This flag adds the mrfCacheBuster=${actualTimestamp} query parameter to the extraction query. It is a simplified way of using the extractionQueryParams flag and it is recommended that you use mrfCacheBuster instead of extractionQueryParams when you are including the timestamp parameter only to avoid cache issues.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "mrfCacheBuster": true,
        ...
    },
    ...
}

# feedRipper

It defines the way articles are extracted.

In the case of rssRipper, the uri in sectionDefinitions must be in XML format. New tenants should not use this option.

  • Type: string
  • Default: whiteCollarRipper
  • Format: One of:
    • marfeelPressRipper
    • whiteCollarRipper
    • jsoupRipper
    • rssRipper
    • puppeteerRipper

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "feedRipper": "jsoupRipper",
        ...
    },
    ...
}

# galleryBlackList

Prevents an image from being processed as a gallery image in an article page, and treats it as part of the article's textual content. This is especially useful for images that are links or buttons.

  • Type: string
  • Default: undefined
  • Format: DOMString

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "galleryBlackList": ".author img,[src*='gravatar']",
        ...
    },
    ...
}

# greedyWhitelist

It prioritizes whitelist over blacklist. If set to true, an element that is both blacklisted and whitelisted will show in article pages.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "greedyWhitelist": "true",
        ...
    },
    ...
}

# hasHttps

This flag is used when editing the code of a Tenant to avoid discrepancies between HTTPS sites and the local environment.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "hasHttps": "true",
        ...
    },
    ...
}

# ignoreSSLErrors

Ignores SSL errors on the PhantomJS command.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "ignoreSSLErrors": true,
        ...
    },
    ...
}

WARNING

Remove this flag completely to disable it.

Setting it to false won't work.

# imageCaptionFromAttributes

Specifies the attribute name from the HTML element to be used for the image caption.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "imageCaptionFromAttributes": "data-source-name",
        ...
    },
    ...
}

# imageResizer

This flag removes the mrf-detailsMedia and mrf-rDetailsMedia classes from an image and adds mrf-noResizeImage.

  • Type: string
  • Default: undefined
  • Format: DOMString

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "imageResizer": ".journalist-photo",
        ...
    },
    ...
}

It is used with the imageResizer($width, $height); mixin in custom.scss to set a new size. For example:

@include imageResizer(78px, 78px);

# imageRulerSizeAttribute

Custom attribute to get the image dimensions from. Useful for example with some lazy-loading strategies.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "imageRulerSizeAttribute": "data-mrf-width,data-mrf-height",
        ...
    },
    ...
}

# imageSrcAttribute

The attribute to use to get image sources, instead of src. Useful for example with some lazy-loading strategies.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "imageSrcAttribute": "href",
        ...
    },
    ...
}

# imageSrcSetAttribute

If the images of a Tenant have invalid srcset links but have valid links inside a data-srcset attribute, this flag can be used.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "imageSrcSetAttribute": "data-lazy-srcset",
        ...
    },
    ...
}

# itemExtractorType

Chooses between premium (paid content) and Boilerpipe extractors.

  • Type: string
  • Default: boilerpipeExtractor
  • Format: OneOf:
    • boilerpipeExtractor
    • boilerpipePressExtractor

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "itemExtractorType": "boilerpipePressExtractor",
        ...
    },
    ...
}

# boilerpipeCharset

This flag allows you to parametrize the charset used to extract the details HTML. Use it when the Marfeel version is showing wrong text or wrong characters, and a possible cause could be using a different charset that the tenant has on desktop.

Values: String. Defaults to UTF-8.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "boilerpipeCharset": "iso-8859-1",
        ...
    },
    ...
}

# jsoupCharset

Same as boilerpipeCharset, it applies to the JSOUP ripper.

  • Type: string

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "jsoupCharset": "iso-8859-1",
        ...
    },
    ...
}

# jsoupImageSrcAttribute

Same as imageSrcAttribute, it is used when extracting with jsoup instead of the whitecollar.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "jsoupImageSrcAttribute": "src",
        ...
    },
    ...
}

# marfeelPressToken (Deprecated)

Deprecated

This flag is deprecated. No need to add it for new tenants and it can be safely removed from any definition.json.

Used only on MarfeelPress Tenants. It defines the Marfeel API Token that is needed to authenticate with WordPress sites.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "marfeelPressToken" : "12345678ABCDEFGH",
        ...
    },
    ...
}

# maxConcurrentExtractionRequests

Defines the maximum amount of concurrent extraction of article pages. Useful to throttle the extraction.

  • Type: number
  • Default: 3

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "maxConcurrentExtractionRequests": 1,
        ...
    },
    ...
}

# minImageSize

Defines the minimum height and width used to filter images to keep in the Boilerpipe MinSizeFilter.java (opens new window).

  • Type: number
  • Default: 90

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "minImageSize": 40,
        ...
    },
    ...
}

# minWordsToConsiderFar

The minimum amount of words defined to include an image in the article body used as top media, to be duplicated displayed within the body of the text as well.

  • Type: string
  • Default: 70

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "minWordsToConsiderFar": "300",
            ...
        },
        ...
    },
    ...
}

# multipageGenerator

Defines the query selector multipage generator for the tenant.

  • Type: string
  • Default: undefined
  • Format: DOMString

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "multipageGenerator": ".md-item-media,.swiper-slide",
        ...
    },
    ...
}

# multipageTitleSelector

Defines the query selector for the multipage title.

  • Type: string
  • Default: undefined
  • Format: DOMSTring

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "multipageTitleSelector": ".titleRanking",
        ...
    },
    ...
}

# multipageUriGenerator

Defines a URI generator according to the string entered.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "multipageUriGenerator": "pageIndexUriGenerator",
        ...
    },
    ...
}

# nextArticlesStrategy

Defines how the next articles are selected and filtered.

  • Type: string
  • Default: VALID_ITEM
  • Format: One of:
    • NO_FILTER
    • NO_WIDGET
    • HAS_DETAILS
    • VALID_ITEM
    • HAS_VALID_ITEMS
    • WIDGET_ITEM

Example: If the nextArticles were to only use specific widget items, it would resemble the following:







 









{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "nextArticlesStrategy" : "WIDGET_ITEM,envivoIframe",
            ...
        },
        ...
    },
    ...
}

# nextPageBlacklist

Defines the elements to omit from the subsequent pages of an article.

Works like the blacklist.

"nextPageBlacklist": "next_pages_elements_to_remove"

# nextPageWhitelist

Define the elements to include from subsequent pages of an article.

Works like the whitelist.

"nextPageWhitelist": "next_pages_elements_to_include",

# nextPageLimit

Defines the maximum number of next pages to be extracted.

  • Type: number
  • Default: 35

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "nextPageLimit": 100,
        ...
    },
    ...
}

# notSelectableImages

Defines the images not to be used as Top Media (for example, images used in a photo slider or avatars for authors).

  • Type: string
  • Default: undefined
  • Format: DOMString

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "notSelectableImages": ".rslides img",
        ...
    },
    ...
}

See a usage example in the embed Gallery guide.

# pagePattern

Sets up a global pagePattern that will be applied to all the sections. It needs to contain the path that defines a page and the matching group needed in order to find the page number.

This global pagePattern won't be applied on sections with "enablePagination": "false".

  • Type: string
  • Default: "/page/([0-9]+)/?"
  • Format: regular expression

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "pagePattern": "/p/([0-9]+)",
        ...
    },
    ...
}

WARNING

The global pagePattern of a definition's configuration only applies to dynamic sections.

WARNING

The global pagePattern of a definition's configuration is not applied on sections that have more than one feedDefinition

# quartzInvalidation

Enables / Disables the invalidation scheduler (scheduleSectionInvalidationTasks).

  • Type: boolean
  • Default: true

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "quartzInvalidation": "false",
        ...
    },
    ...
}

TIP

This flag should be false if tenant is using the invalidation API. Tenants using MarfeelPress Plugin use the invalidation API by default.

# queryParamsWhitelist

Defines the allowed, but not mandatory, query params for a URL on article extraction. The rest of the query params will be excluded.

For example it can be used to extract article pages that use URL query params. In the following URL the query param is page:

https://www.example.com/example-url.html?page=0%2C2

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "queryParamsWhitelist": "page",
        ...
    },
    ...
}

TIP

This flag can be used in order to allow blacklisted query params by Gutenberg, like utm query params (for example, utm_source).

# respectTopMediaRatio

Forces Top Media to have the same ratio as the original image.

  • Type: boolean
  • Default: false

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "respectTopMediaRatio": true,
            ...
        },
        ...
    },
    ...
}

# sanitizeContent

When enabled,the HTMLDocumentProcessor class (opens new window) sanitizes HTML.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "sanitizeContent": "true",
        ...
    },
    ...
}

# showCategoriesInDetails

If set to true, categories will show on article pages.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "showCategoriesInDetails": true,
        ...
    },
    ...
}

MarfeelPress specific

This flag is only active with the MarfeelPressFetcher.

# showBreadcrumbsInDetails (MarfeelPress-specific)

If set to true, breadcrumbs will be generated on article pages.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "showBreadcrumbsInDetails": true,
        ...
    },
    ...
}

MarfeelPress specific

This flag is only active with the MarfeelPressFetcher.

# skipAmpCssCheck

Deactivates the AMP file size check (AMP has a 50,000 bytes limit (opens new window)).

  • Type: boolean
  • Default: false

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "skipAmpCssCheck": true,
            ...
        },
        ...
    },
    ...
}

Invalid AMP Pages

This flag leads to invalid AMP pages.

# skipDate (MarfeelPress-specific)

Used only on MarfeelPress Tenants. When set to true the date does not appear in the article details.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title" : "Title of the awesome example site",
    "uri" : "www.example.com",
    "configuration" : {
        ...,
        "skipDate" : "true",
        ...
    },
    ...
}

MarfeelPress specific

This flag is only active with the MarfeelPressFetcher.

# skipSubtitle (MarfeelPress-specific)

By default, MarfeelPress always displays article tags as subtitles. If this flag is on, article tags are not extracted and never displayed in an article.

tags as subtitles

  • Type: boolean
  • Default: false

Usage:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "skipSubtitle": true,
        ...
    },
    ...
}

# useLegacyAlibaba

When enabled, the old Alibaba version is used.

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "useLegacyAlibaba": true,
        ...
    },
    ...
}

# useSniVerifier

Enables Server Name Indication verifications (that is, it uses the HTMLfetcher SNI verification).

  • Type: boolean
  • Default: false

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "useSniVerifier": true,
        ...
    },
    ...
}

# videoProviders

List of the video providers useful for the current tenant.

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "videoProviders": "brightcoveAllYou,brightcoveAds",
        ...
    },
    ...
}

# whiteCollarUserAgent

Specifies the User-Agent that whiteCollar has to use to browse the site's HTML as rendered in a specified device.

  • Type: string
  • Default: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 MarfeelMan
  • Format: One of:
    • mobile: translates to "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4 MarfeelMan".
    • marfeel: translates to "Marfeel-crawler".

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "whiteCollarUserAgent": "mobile",
        ...
    },
    ...
}

Different values

Some exisiting definitions set different values to this flag.

Those cases will always fallback to the default.

# whitelist

Enables the extraction of elements from article pages. See more information in the documentation about blacklist and whitelist.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "whitelist": "[href=/author/],slideshow-subtitle",
        ...
    },
    ...
}

# whiteCollarScript

Establishes the path of the default WhiteCollar file to be used by the sections on sectionDefinitions.

Needs to be placed in the configuration of the definition.json.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "whiteCollarScript": "index/src/whiteCollar/main.js",
        ...
    },
    ...
}

TIP

To configure for a specific section, refer to this article

# widgets

Defines the widgets to be used.

  • Type: string
  • Default: undefined

Example:









 







{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "userInterface":{
        ...,
        "features":{
            ...,
            "widgets": "mostRead",
            ...
        },
        ...
    },
    ...
}

# validArticleQueryParams

Some Tenants have articles that are built with query parameters. To replicate these articles on the customer's Marfeel PWA, this flag has to be used with the definitions to identify a valid article.

  • Type: string
  • Default: undefined

Example:







 





{
    ...,
    "title":"Title of the awesome example site",
    "uri":"www.example.com",
    "configuration":{
        ...,
        "validArticleQueryParams": "&aid=,&MAID=,&MFID=",
        ...
    },
    ...
}