Database Schema Reference
This document describes the schema of the tables within the DuckDB database located at input_data/conversion_flow.db. The schema is based on the columns present in the source CSV files used by the data processing pipeline.
Database Overview
The database contains two primary tables that store the essential data for the conversion flow analysis:
ga4_events: Contains event data from Google Analytics 4backend_leads: Contains information about leads generated, including their final status
Table: ga4_events
Purpose: Stores individual events recorded by Google Analytics 4, linked to a user identifier. Used to determine which journey steps a user has taken.
Source CSV: input_data/ga4_selected_events_leads.csv
Column Name |
Data Type |
Description |
Notes |
|---|---|---|---|
|
|
Timestamp of the GA4 event. |
|
|
|
The unique identifier for the user/client. |
Used for grouping and joining. |
|
|
Identifier for the specific session. |
|
|
|
Google Analytics specific session identifier. |
|
|
|
The name of the GA4 event recorded (e.g., ‘session_start’, ‘download’, ‘car_configuration’, etc.). |
Used to create binary flags for model nodes. |
|
|
The URL of the page where the event occurred. |
|
|
|
Name of the vehicle model associated with the event (if applicable). |
|
|
|
ID of the vehicle model associated with the event (if applicable). |
|
|
|
Name of the form interacted with (if applicable). |
|
|
|
Step within a form interaction (if applicable). |
|
|
|
Flag indicating form conversion (if applicable). |
|
|
|
Name of a defined user flow (if applicable). |
|
|
|
Step within a defined user flow (if applicable). |
|
|
|
Flag indicating flow conversion (if applicable). |
|
|
|
Name of the marketing campaign associated with the session. |
|
|
|
Marketing medium (e.g., ‘cpc’, ‘organic’). |
|
|
|
Marketing source (e.g., ‘google’, ‘direct’). |
|
|
|
Hostname from which the event was triggered. |
|
|
|
Category of the device used (e.g., ‘desktop’, ‘mobile’). |
|
|
|
Default channel grouping assigned by GA4 to the session. |
Table: backend_leads
Purpose: Stores information about leads generated, including their final status (used to determine purchase outcome) and the associated user identifier.
Source CSV: input_data/backend_all_leads.csv
Column Name |
Data Type |
Description |
Notes |
|---|---|---|---|
|
|
The unique identifier for the user/client. |
Used to link leads back to GA4 events. |
|
|
An identifier, possibly related to the lead or system. |
|
|
|
Brand associated with the lead (e.g., ‘Toyota’). |
|
|
|
Name of the vehicle model associated with the lead. |
|
|
|
Code of the vehicle model associated with the lead. |
|
|
|
The status of the lead in the sales funnel. |
Used: Value |
|
|
Type or nature of the lead relationship. |
|
|
|
Primary classification of the lead type. |
|
|
|
Category assigned to the lead. |
|
|
|
Timestamp potentially related to the originating GA4 event. |
|
|
|
Hostname potentially related to the originating GA4 event. |
|
|
|
Device category potentially related to the originating GA4 event. |
|
|
|
Source potentially related to the originating GA4 session. |
|
|
|
Medium potentially related to the originating GA4 session. |
|
|
|
Channel group potentially related to the originating GA4 session. |
|
|
|
Timestamp indicating the start of the lead journey. |
|
|
|
Timestamp indicating when the lead was transferred. |
Schema Usage in the Model
The final DataFrame used as input for the Bayesian Network model primarily uses
USER_PSEUDO_ID,EVENT_NAME, andSALES_FUNNELto create binary flags for each model node.Columns marked Used are explicitly referenced in the data processing pipeline.
The exact data types (
VARCHAR,INTEGER,TIMESTAMP,BOOLEANetc.) are automatically inferred by DuckDB during the data loading process.
Data Processing Flow
Raw CSV files are loaded into the DuckDB database using the
load_to_duckdb.pyscript.The
load_data.pyscript then queries these tables to create a unified dataset for model training.Events from
ga4_eventsare transformed into binary flags indicating whether each user has performed specific actions.Purchase outcomes from
backend_leadsare merged based onUSER_PSEUDO_ID.The resulting dataset is used to train the Bayesian Network model.
Data Considerations
The
USER_PSEUDO_IDis the key linking field between the two tables.The
EVENT_NAMEvalues determine which nodes are created in the Bayesian Network.The
SALES_FUNNELvalue'ClosedSuccessfully'is used as a proxy for purchase completion.Time-based analysis can be performed using the timestamp columns, although the current model does not explicitly incorporate temporal information.
Extended Schema for Cross-Country Analysis
When adapting the model for different country websites (e.g., from Toyota GB to Toyota Germany), ensure that the data schema remains consistent while accounting for country-specific variations in event names and sales funnel states.