# Crawling

Compass crawler is always using the same user-agent:

Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html

Make sure you whitelist it in any firewall and WAF you have.

# Crawling Logic

Our crawler does not go through your entire site when you activate Compass but only discovers each page once someone has visited it.

As soon as the first reader since Compass activation enters a page, the crawler starts working. It detects the canonical URL of the page and uses it to gather all the information Compass needs.

This information inludes:

  • Publication date,
  • Title,
  • Main image,
  • Author(s),
  • Section(s)
  • ...

Once the canonical URL of a page has been successfully registered, it is not crawled again. All the page's information always comes from the canonical URL.

TIP

Only the canonical version of a page is analyzed by Compass to retrieve all the information, but our crawler looks at every page of your site that receives visits and searches for its canonical URL every time.

Google AMP pages, Facebook Instant Articles, and Native apps must all have a valid canonical link pointing to the original HTML page. Compass never extracts information directly from those kinds of pages.

# Publication date

The publication date is extracted from multiple different sources following the next chain. Take into account that if multiple publication dates exist, no publication date is extracted.

  1. JSON+LD (For more details visit https://schema.org/datePublished)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "datePublished": "2021-08-01"
}
</script>
  1. Meta article type article:publish_time
<meta property="article:published_time" content="2021-08-01T17:41:45+00:00" />
  1. Meta item property type
<meta itemprop="datePublished" content="2021-08-01" id="date">
  1. Time item property type as datetime
<time itemprop="datePublished" datetime="2021-08-01T09:00Z">
  1. Time item property type as content
<time itemprop="datePublished" content="2021-08-01T09:00Z">
  1. Time item property type as node value
<time itemprop="datePublished">2021-08-01T09:00Z</time>

# Title

The title attribute is extrated from the HTML title tag:

<title>Article title</title>

# Main image

Main article image is extracted either from JSON+LD data (For more details visit https://schema.org/image)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "image": "mainImage.jpg"
}
</script>

Or from Meta OG types:

<meta property="og:image" content="https://mywebsite.com/images/mainImage.jpg" />

# Author(s)

Authors are extracted from multiple tags in the following order.

  1. Metag tag nrbi
<meta property="nrbi:authors" content="Author One;Author Two">
  1. JSON+LD (For more details visit https://schema.org/author)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "author": [
    {
      "@type":"Person",
      "name":"Author One"
    },
    {
      "@type":"Person",
      "name":"Author Two"
    }
  ]
}
</script>
<script type="application/ld+json">
{
  "author": "Author One"
}
</script>
  1. Meta tag article:author
<meta property="article:author" content="Author One">
  1. Meta tag name="author"
<meta name="author" content="Author One">

# Section(s)

Sections are extracted from multiple tags in the following order.

  1. Meta tag nrbi
<meta property="nrbi:sections" content="Parent section;Child Section">
  1. JSON+LD (For more details visit https://schema.org/articleSection)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "articleSection": "News"
}
</script>
  1. Meta tag article
<meta property="article:section" content="Parent section;Child Section">

# In case of error

If the canonical URL of a page is not accessible, the crawler retries automatically to extract information from the page one hour later.

If the canonical is still not available after 10 attempts, the crawler does not attempt to read that page again.