# Crawling

Compass crawler is always using the same user-agent:

Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html

Make sure you whitelist it in any firewall and WAF you have.

# Crawling Logic

Our crawler does not go through your entire site when you activate Compass but only discovers each page once someone has visited it.

As soon as the first reader since Compass activation enters a page, the crawler starts working. It detects the canonical URL of the page and uses it to gather all the information Compass needs.

This information inludes:

  • Publication date,
  • Title,
  • Main image,
  • Author(s),
  • Section(s)
  • ...

Once the canonical URL of a page has been successfully registered, it is not crawled again. All the page's information always comes from the canonical URL.

TIP

Only the canonical version of a page is analyzed by Compass to retrieve all the information, but our crawler looks at every page of your site that receives visits and searches for its canonical URL every time.

Google AMP pages, Facebook Instant Articles, and Native apps must all have a valid canonical link pointing to the original HTML page. Compass never extracts information directly from those kinds of pages.

# In case of error

If the canonical URL of a page is not accessible, the crawler retries automatically to extract information from the page one hour later.

If the canonical is still not available after 10 attempts, the crawler does not attempt to read that page again.