Marfeel crawlers

xavi.marti · June 27, 2022, 2:53pm

Marfeel tracker does not send any metadata about the page in every request, it just sends the essential. This allows it to be lightweight and consume a minimal amount of bandwidth. All the rest of data needed is obtained through our crawlers.

Many urls may point to the same content, Marfeel crawlers only crawl canonical urls and their amphtml counterparts. All urls pointing to the same canonical will be stored as aliases.

Make sure both canonical and amphtml link rel elements are correctly set in all your content for Marfeel crawling to work perfectly.

Learn more about how Marfeel reads your pages metadata.

Good citizen practices

All Marfeel bots follow the following rules in order to be good web citizens:

Sites are not proactively crawled to identify new content. Marfeel only crawls urls with active users.
Marfeel limits the number of concurrent requests to each of our client’s servers. Re-crawls are rate limited to 1000 requests every 5 min
All assets are centrally cached so different bots may reuse them without having to fetch them separately.
Redirects are not followed unless necessary.

Whenever a domain starts using Marfeel, crawling during first days may be more intense, as there is a lot of content to be discovered. It will however respect the pace of the servers and slow down over time.

Marfeel crawlers

Marfeel currently uses 3 types of crawlers.

Editorial crawler

The Marfeel Editorial Crawler crawls a url and builds the editorial profile of a page using its metadata. It crawls urls when they first get a hit and every time the content is modified.

The user agent used by the editorial crawler is:

Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html)

Audits crawler

In order to detect structured data, meta tags and many other potential issues in our client’s HTML, Marfeel periodically crawls all relevant urls (the ones that have traffic) using the following user agents:

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; mrfCompass-Marshall/1.0)

mrfCompass-Booldog will crawl each url initially using a mobile user agent, and if a vary: User-Agent header is received in the response, it will crawl it using a desktop user agent as well.
mrfCompass-Marshall will crawl all amphtml links found by mrfCompass-Booldog.

Flowcards crawler

Flowcards that load content directly from specific urls will also use a bot to fetch mentioned content. This bot identifies itself with the following user agent:

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Jukebox/1.0)

The recurrency of the crawling respects the cache-control header returned.

Social experiences

Social experiences like Facebook, Twitter(X), Telegram, Reddit and LinkedIn will use the following user agent.

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 (compatible; mrfCompass-Social/1.0)

These experiences/services use Marfeel’s public IPs when crawiling your site.

Whitelisting Marfeel crawlers

Many hosting and CDN providers include WAF services that may consider Marfeel bots to be potentially malicious and block them.
To make sure Marfeel can access and monitor your website, you can either whitelist User Agents mentioned above or whitelist our list of static IPs available here.

In case you have content behind a hard paywall, make sure that requests from these IPs have access to it. Otherwise many modules like content metrics, recommender, audits... might lack the needed information to function properly.

Cloudflare

If you are using Cloudflare as your CDN provider, you can whitelist Marfeel crawlers’ IPs following these steps:

On your Cloudflare console, click on the firewall icon on Tools tab.
List Marfeel’s crawlers IP addresses under the IP Access Rules.
a. Enter the IP address
b. Choose Whitelistas the action to apply
c. Choose the website where to apply whitelisting rules
Click add
Repeat for each IP

Verifying Marfeel Crawlers

All Marfeel Crawler IP addresses offer a reverse DNS lookup pointing to crawler.marfeel.com.
You can use it to verify Marfeel bots authenticity. You can do it following these steps:

Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
Verify that the domain name is crawler.marfeel.com.
Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name.
Verify that it’s the same as the original accessing IP address from your logs.

$ host 162.55.235.182
182.235.55.162.in-addr.arpa domain name pointer crawler.marfeel.com.

$ host crawler.marfeel.com
crawler.marfeel.com is an alias for vampiresquid.het.mrf.io.
vampiresquid.het.mrf.io has address 162.55.235.186
vampiresquid.het.mrf.io has address 162.55.235.182