Database Schema Reference

This document describes the schema of the tables within the DuckDB database located at input_data/conversion_flow.db. The schema is based on the columns present in the source CSV files used by the data processing pipeline.

Database Overview

The database contains two primary tables that store the essential data for the conversion flow analysis:

  1. ga4_events: Contains event data from Google Analytics 4

  2. backend_leads: Contains information about leads generated, including their final status

Table: ga4_events

Purpose: Stores individual events recorded by Google Analytics 4, linked to a user identifier. Used to determine which journey steps a user has taken.

Source CSV: input_data/ga4_selected_events_leads.csv

Column Name

Data Type

Description

Notes

EVENT_TIMESTAMP

TIMESTAMP / TEXT

Timestamp of the GA4 event.

USER_PSEUDO_ID

VARCHAR / TEXT

The unique identifier for the user/client.

Used for grouping and joining.

SESSION_ID

VARCHAR / TEXT

Identifier for the specific session.

GA_SESSION_ID

VARCHAR / TEXT

Google Analytics specific session identifier.

EVENT_NAME

VARCHAR / TEXT

The name of the GA4 event recorded (e.g., ‘session_start’, ‘download’, ‘car_configuration’, etc.).

Used to create binary flags for model nodes.

PAGE_LOCATION

VARCHAR / TEXT

The URL of the page where the event occurred.

MODEL_NAME

VARCHAR / TEXT

Name of the vehicle model associated with the event (if applicable).

MODEL_ID

VARCHAR / TEXT

ID of the vehicle model associated with the event (if applicable).

FORM_NAME

VARCHAR / TEXT

Name of the form interacted with (if applicable).

FORM_STEP

VARCHAR / TEXT

Step within a form interaction (if applicable).

FORM_CONVERSION

BOOLEAN / INTEGER

Flag indicating form conversion (if applicable).

FLOW_NAME

VARCHAR / TEXT

Name of a defined user flow (if applicable).

FLOW_STEP

VARCHAR / TEXT

Step within a defined user flow (if applicable).

FLOW_CONVERSION

BOOLEAN / INTEGER

Flag indicating flow conversion (if applicable).

CAMPAIGN_NAME

VARCHAR / TEXT

Name of the marketing campaign associated with the session.

MEDIUM

VARCHAR / TEXT

Marketing medium (e.g., ‘cpc’, ‘organic’).

SOURCE

VARCHAR / TEXT

Marketing source (e.g., ‘google’, ‘direct’).

HOSTNAME

VARCHAR / TEXT

Hostname from which the event was triggered.

DEVICE_CATEGORY

VARCHAR / TEXT

Category of the device used (e.g., ‘desktop’, ‘mobile’).

SESSION_CHANNEL_GROUP

VARCHAR / TEXT

Default channel grouping assigned by GA4 to the session.

Table: backend_leads

Purpose: Stores information about leads generated, including their final status (used to determine purchase outcome) and the associated user identifier.

Source CSV: input_data/backend_all_leads.csv

Column Name

Data Type

Description

Notes

USER_PSEUDO_ID

VARCHAR / TEXT

The unique identifier for the user/client.

Used to link leads back to GA4 events.

SID

VARCHAR / TEXT

An identifier, possibly related to the lead or system.

BRAND

VARCHAR / TEXT

Brand associated with the lead (e.g., ‘Toyota’).

MODELNAME

VARCHAR / TEXT

Name of the vehicle model associated with the lead.

MODELCODE

VARCHAR / TEXT

Code of the vehicle model associated with the lead.

SALES_FUNNEL

VARCHAR / TEXT

The status of the lead in the sales funnel.

Used: Value 'ClosedSuccessfully' determines purchase outcome.

LEADRELATIONSHIP

VARCHAR / TEXT

Type or nature of the lead relationship.

MAINLEADTYPE

VARCHAR / TEXT

Primary classification of the lead type.

LEADCATEGORY

VARCHAR / TEXT

Category assigned to the lead.

GA4_EVENT_TIMESTAMP

TIMESTAMP / TEXT

Timestamp potentially related to the originating GA4 event.

GA4_HOSTNAME

VARCHAR / TEXT

Hostname potentially related to the originating GA4 event.

GA4_DEVICE_CATEGORY

VARCHAR / TEXT

Device category potentially related to the originating GA4 event.

GA4_SOURCE

VARCHAR / TEXT

Source potentially related to the originating GA4 session.

GA4_MEDIUM

VARCHAR / TEXT

Medium potentially related to the originating GA4 session.

GA4_SESSION_CHANNEL_GROUP

VARCHAR / TEXT

Channel group potentially related to the originating GA4 session.

LEAD_JOURNEY_START

TIMESTAMP / TEXT

Timestamp indicating the start of the lead journey.

LEAD_TRANSFERRED

TIMESTAMP / TEXT

Timestamp indicating when the lead was transferred.

Schema Usage in the Model

  • The final DataFrame used as input for the Bayesian Network model primarily uses USER_PSEUDO_ID, EVENT_NAME, and SALES_FUNNEL to create binary flags for each model node.

  • Columns marked Used are explicitly referenced in the data processing pipeline.

  • The exact data types (VARCHAR, INTEGER, TIMESTAMP, BOOLEAN etc.) are automatically inferred by DuckDB during the data loading process.

Data Processing Flow

  1. Raw CSV files are loaded into the DuckDB database using the load_to_duckdb.py script.

  2. The load_data.py script then queries these tables to create a unified dataset for model training.

  3. Events from ga4_events are transformed into binary flags indicating whether each user has performed specific actions.

  4. Purchase outcomes from backend_leads are merged based on USER_PSEUDO_ID.

  5. The resulting dataset is used to train the Bayesian Network model.

Data Considerations

  • The USER_PSEUDO_ID is the key linking field between the two tables.

  • The EVENT_NAME values determine which nodes are created in the Bayesian Network.

  • The SALES_FUNNEL value 'ClosedSuccessfully' is used as a proxy for purchase completion.

  • Time-based analysis can be performed using the timestamp columns, although the current model does not explicitly incorporate temporal information.

Extended Schema for Cross-Country Analysis

When adapting the model for different country websites (e.g., from Toyota GB to Toyota Germany), ensure that the data schema remains consistent while accounting for country-specific variations in event names and sales funnel states.