Skip to content
Login Contact

Load data exports from Google Cloud Storage into BigQuery

This guide walks through loading Marfeel data exports from Google Cloud Storage into BigQuery. It covers creating a compatible table, defining the schema with JSON or DDL, and configuring recurring transfers through the BigQuery console.

Before starting, make sure the data stream to Google Cloud Storage is already configured and exporting CSV files to your bucket.

In the dataset where the data will be imported, create a table with “Empty table” as the source. The schema, edited as text, should be compatible with the following:

[
{
"mode": "REQUIRED",
"name": "user_id",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "registered_user_id",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "user_status",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "rfv",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "has_consent",
"type": "BOOLEAN"
},
{
"mode": "REQUIRED",
"name": "session_id",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "page_id",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "site_id",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "editorial_id",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "editorial_title",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "editorial_publish_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "editorial_update_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "sections_array",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "authors_array",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "entities_array",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "tags_array",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "page_type",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "start_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "end_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "event_date",
"type": "DATE"
},
{
"mode": "REQUIRED",
"name": "event_local_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "ip_v4",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "ip_v6",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "user_country",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "user_region",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "user_city",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "operating_system",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "user_agent",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "full_url",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "canonical_url",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "referrer",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "browser",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "browser_version",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "device_category",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "url_params",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "search_phrase",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "client_event_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "utm_source",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "utm_medium",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "utm_campaign",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "utm_content",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "utm_term",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "duration",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "scroll",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "dns_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "connect_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "response_start_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "fetch_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "dom_interactive_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "dom_content_loaded_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "dom_complete_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "fcp_timing",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "PageVars",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "SessionVars",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "UserVars",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "has_web_vitals",
"type": "BOOLEAN"
},
{
"mode": "REQUIRED",
"name": "lcp",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "fid",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "cls",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "recirculation_source",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "r",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "f",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "v",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "engagement_time",
"type": "INTEGER"
}
]

The corresponding DDL (name of the table suppressed) is:

CREATE TABLE `...`
(
user_id STRING NOT NULL,
registered_user_id STRING NOT NULL,
user_status INT64 NOT NULL,
rfv INT64 NOT NULL,
has_consent BOOL NOT NULL,
session_id STRING NOT NULL,
page_id STRING NOT NULL,
site_id INT64 NOT NULL,
editorial_id INT64 NOT NULL,
editorial_title STRING NOT NULL,
editorial_publish_time TIMESTAMP NOT NULL,
editorial_update_time TIMESTAMP NOT NULL,
sections_array STRING NOT NULL,
authors_array STRING NOT NULL,
entities_array STRING NOT NULL,
tags_array STRING NOT NULL,
page_type INT64 NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL,
event_date DATE NOT NULL,
event_local_time TIMESTAMP NOT NULL,
ip_v4 STRING NOT NULL,
ip_v6 STRING NOT NULL,
user_country STRING NOT NULL,
user_region STRING NOT NULL,
user_city STRING NOT NULL,
operating_system STRING NOT NULL,
user_agent STRING NOT NULL,
full_url STRING NOT NULL,
canonical_url STRING NOT NULL,
referrer STRING NOT NULL,
browser STRING NOT NULL,
browser_version STRING NOT NULL,
device_category INT64 NOT NULL,
url_params STRING NOT NULL,
search_phrase STRING NOT NULL,
client_event_time TIMESTAMP NOT NULL,
utm_source STRING NOT NULL,
utm_medium STRING NOT NULL,
utm_campaign STRING NOT NULL,
utm_content STRING NOT NULL,
utm_term STRING NOT NULL,
duration INT64 NOT NULL,
scroll INT64 NOT NULL,
dns_timing INT64 NOT NULL,
connect_timing INT64 NOT NULL,
response_start_timing INT64 NOT NULL,
fetch_timing INT64 NOT NULL,
dom_interactive_timing INT64 NOT NULL,
dom_content_loaded_timing INT64 NOT NULL,
dom_complete_timing INT64 NOT NULL,
fcp_timing INT64 NOT NULL,
PageVars STRING NOT NULL,
SessionVars STRING NOT NULL,
UserVars STRING NOT NULL,
has_web_vitals BOOL NOT NULL,
lcp INT64 NOT NULL,
fid INT64 NOT NULL,
cls INT64 NOT NULL,
recirculation_source STRING NOT NULL,
r INT64 NOT NULL,
f INT64 NOT NULL,
v INT64 NOT NULL,
engagement_time INT64 NOT NULL
);

Recurring loads of data from Cloud Storage to BigQuery can be configured in several ways as described in the documentation.

This configuration only needs to be done once. Programmatic approaches like the API or Java SDK are recommended only for users with expertise in those tools.

The bq command-line tool does not allow configuring the frequency of transfers; they default to daily. The procedure also requires opening a separate browser window for authorization and copying a code, which can be confusing for non-technical users.

The recommended approach is to create the transfer through the BigQuery console:

  1. In the BigQuery menu, click on “Data transfers”.
  2. Click on ”+ CREATE TRANSFER”.
  3. Select “Google Cloud Storage” as the source.
  4. Enter a descriptive name for the transfer.
  5. In the schedule options, choose how often the data should be transferred. For transfers more frequent than daily, choose “Custom” in the “Repeats” field. For example, to run every half hour, enter “every 30 minutes” in the “Custom Schedule” field.
  6. Select the destination dataset and table.
  7. In the Cloud Storage URI, click on “BROWSE”, select the bucket where the CSV files are being exported, and type “*.csv” in the “Filename” field. Click on “SELECT”.
  8. Choose whether to delete or keep the files after they have been transferred.
  9. Make sure that the file format is “CSV”.
  10. In the CSV transfer options, set “Header rows to skip” to 1 and check the “Allow quoted newlines” option, since data fields like titles may contain line feeds.

If you run into issues during the transfer process, see troubleshooting raw data exports for common problems and solutions.