logo

When, why and how you should use nested Maps in BigQuery (GA4 custom dimensions as example).

No… Nested maps don’t exist by default in BigQuery but we can create
  • simple schema conventions to implement nested maps
  • and powerful functions to exploit them.
This article aims at describing what is a nested maps, why, when and how you should use them.

What is a nested map ?

GA4 custom dimensions and metrics are great examples of nested maps.
GA4 enables you to send events with parameters that are not originally provided by Google. For example, the author and the number_of_pages in the following event can be used to create respectively a custom dimension and a custom metric
javascript
gtag('event','read_article', { "author":"Bill Q", "title":"How to Build a Backpack", "number_of_pages":2, });
You can create this way, up to 50 custom dimensions and metrics.
When you extract GA4 events to BigQuery, each line represents an event with the same schema.
Here are the two columns that host custom dimensions and metrics
  • the event_params column has the following schema
    • Image without caption
  • and user_properties
    • Image without caption
They both have more or less the same shema. The schema of a nested map !
We define a nested map as a nested array containing a “key” and a “value” column with no duplicated keys in the array.
sql
SELECT 'read_article' as event_name, [struct("author" as key, struct("Bill Q" as string_value, cast(null as INT64) as int_value, cast(null as FLOAT64) as float_value) as value), struct("title" as key, struct("How to Build a Backpack" as string_value, cast(null as INT64) as int_value, cast(null as FLOAT64) as float_value) as value), struct("number_of_page" as key, struct(cast(null as string) as string_value, 2 as int_value, cast(null as FLOAT64) as float_value) as value) ] as event_params UNION ALL SELECT 'potions' as event_name, [struct("experience_id" as key, struct("X23" as string_value, cast(null as INT64) as int_value, cast(null as FLOAT64) as float_value) as value), struct("action" as key, struct("click" as string_value, cast(null as INT64) as int_value, cast(null as FLOAT64) as float_value) as value), struct("sku" as key, struct("202ADFH5" as string_value, cast(null as INT64) as int_value, cast(null as FLOAT64) as float_value) as value), struct("price" as key, struct(cast(null as string) as string_value, cast(null as INT64) as int_value, 25 as float_value ) as value) ] as event_params

Why nested maps ?

Nested maps offer many benefits :
  • schema simplicity
    • Nested map enables to store two objects with various keys without creating as may columns as keys .
  • values indexation
Keys are unique by convention which provides an additional useful information compared to regular nested arrays
  • many utility
    • functions
      • isMap(map ARRAY<STRUCT<key STRING, value ANY TYPE>>)
        • returns true if map column contains a key property and unique values in it
      • get(map ARRAY<STRUCT<key STRING, value ANY TYPE>>, key)
    • procedures
      • toMap(table, [columnsToMap])
      • keys(table, map_column)
        • returns a list of keys observed in the map_columnn
      • pivot(table, map_column , aggregateFunction STRING, forKeysIn ARRAY<STRING> = [])
        • returns the table with one column per key in forKeysIn
      • toTree(table, map_column, [keysOrder])
        • returns a json listing the values of each keys in the given order It can be useful for instance to get all the options of a given parent product, given all its variants.
It can also be very powerful when linked to a common type of value like
STRUCT<string_value STRING, float_value FLOAT64>
It enables using various aggregate functions.
javascript
{event_timestamp:XXX, [...], event_name:'read_article', event_params:{ "author":"Bill Q", "title":"How to Build a Backpack", "number_of_pages":2} } {event_timestamp:XXX, [...], event_name:'potions', event_params:{ "experience_id":"X23", "action":"click", "sku":"202ADFH5", "price":25} }

When and how to use Map

One common problem is that arrays can represent many different concepts.
There are in fact two categories of arrays :
  • regular arrays where we expect possible duplicated lines
  • maps where we expect only one line per “key”
There is no easy way to distinguish the two in Bigquery, although it can be very useful to know that you are manipulating a map and not a regular array.
That is why at Potions we added the convention that if a nested array represents a map, one of the record property should be named key
We suggest the following decision tree …
graph TD simple -->|No| array{Array ?} simple{Simple type ?} -->|Yes| column[COLUMN WITH TYPE FLOAT STRING...] array -->|No| complexity{Hide complexity ?} complexity -->|No| columns[Multiple columns] complexity -->|Yes| record[NULLABLE RECORD] array -->|Yes| repeated[REPEATED RECORD]
graph TD repeated[REPEATED RECORD] --> ap{Map ?} ap -->|Yes| key[One property named 'key'] ap -->|No| column[Avoid using 'key' property] key --> sameschema{Same schema\n for every key?} sameschema --> |Yes| schema[Use the schema for the other keys] sameschema --> |No| simple{Exclusive simple types ?} simple -->|Yes| name[Name the properties valueString, valueFloat...] simple -->|No| common{Common properties ?} common -->|No| repeatedMap[name the property value and set it as key/Value]

Why repeated records

Very often it is a way for your schema to remain very flexible and avoid changing it when you add a new property for instance.

Why conventions ?

Because it unlocks amazing features !
For instance using repeated records with key, valueString, valueFloat properties enables :
  • creating and using many helper functions like
    • sql
      CREATE TEMP FUNCTION getValueStringByKey(properties ARRAY<STRUCT<key STRING, valueString STRING, valueFloat FLOAT64>>, key STRING) AS ( ( SELECT valueString FROM UNNEST(properties) WHERE key=key LIMIT 1 ) ); WITH events AS( SELECT 1 as id, [ struct("purchase_id" as key, "X24" as valueString, cast(null as float64) as valueFloat), struct("revenue" as key, cast(null as string) as valueString, 1.8 as valueFloat) ] as params ) SELECT *, getValueStringByKey(params, "purchase_id") AS purchase_id FROM events
  • checking automatically and asserting that keys are unique in nested maps