# Database Schema Reference This document describes the schema of the tables within the DuckDB database located at `input_data/conversion_flow.db`. The schema is based on the columns present in the source CSV files used by the data processing pipeline. ## Database Overview The database contains two primary tables that store the essential data for the conversion flow analysis: 1. `ga4_events`: Contains event data from Google Analytics 4 2. `backend_leads`: Contains information about leads generated, including their final status ## Table: `ga4_events` **Purpose:** Stores individual events recorded by Google Analytics 4, linked to a user identifier. Used to determine which journey steps a user has taken. **Source CSV:** `input_data/ga4_selected_events_leads.csv` | Column Name | Data Type | Description | Notes | |-------------|-----------|-------------|-------| | `EVENT_TIMESTAMP` | `TIMESTAMP` / `TEXT` | Timestamp of the GA4 event. | | | `USER_PSEUDO_ID` | `VARCHAR` / `TEXT` | The unique identifier for the user/client. | **Used** for grouping and joining. | | `SESSION_ID` | `VARCHAR` / `TEXT` | Identifier for the specific session. | | | `GA_SESSION_ID` | `VARCHAR` / `TEXT` | Google Analytics specific session identifier. | | | `EVENT_NAME` | `VARCHAR` / `TEXT` | The name of the GA4 event recorded (e.g., 'session_start', 'download', 'car_configuration', etc.). | **Used** to create binary flags for model nodes. | | `PAGE_LOCATION` | `VARCHAR` / `TEXT` | The URL of the page where the event occurred. | | | `MODEL_NAME` | `VARCHAR` / `TEXT` | Name of the vehicle model associated with the event (if applicable). | | | `MODEL_ID` | `VARCHAR` / `TEXT` | ID of the vehicle model associated with the event (if applicable). | | | `FORM_NAME` | `VARCHAR` / `TEXT` | Name of the form interacted with (if applicable). | | | `FORM_STEP` | `VARCHAR` / `TEXT` | Step within a form interaction (if applicable). | | | `FORM_CONVERSION` | `BOOLEAN` / `INTEGER` | Flag indicating form conversion (if applicable). | | | `FLOW_NAME` | `VARCHAR` / `TEXT` | Name of a defined user flow (if applicable). | | | `FLOW_STEP` | `VARCHAR` / `TEXT` | Step within a defined user flow (if applicable). | | | `FLOW_CONVERSION` | `BOOLEAN` / `INTEGER` | Flag indicating flow conversion (if applicable). | | | `CAMPAIGN_NAME` | `VARCHAR` / `TEXT` | Name of the marketing campaign associated with the session. | | | `MEDIUM` | `VARCHAR` / `TEXT` | Marketing medium (e.g., 'cpc', 'organic'). | | | `SOURCE` | `VARCHAR` / `TEXT` | Marketing source (e.g., 'google', 'direct'). | | | `HOSTNAME` | `VARCHAR` / `TEXT` | Hostname from which the event was triggered. | | | `DEVICE_CATEGORY` | `VARCHAR` / `TEXT` | Category of the device used (e.g., 'desktop', 'mobile'). | | | `SESSION_CHANNEL_GROUP` | `VARCHAR` / `TEXT` | Default channel grouping assigned by GA4 to the session. | | ## Table: `backend_leads` **Purpose:** Stores information about leads generated, including their final status (used to determine purchase outcome) and the associated user identifier. **Source CSV:** `input_data/backend_all_leads.csv` | Column Name | Data Type | Description | Notes | |-------------|-----------|-------------|-------| | `USER_PSEUDO_ID` | `VARCHAR` / `TEXT` | The unique identifier for the user/client. | **Used** to link leads back to GA4 events. | | `SID` | `VARCHAR` / `TEXT` | An identifier, possibly related to the lead or system. | | | `BRAND` | `VARCHAR` / `TEXT` | Brand associated with the lead (e.g., 'Toyota'). | | | `MODELNAME` | `VARCHAR` / `TEXT` | Name of the vehicle model associated with the lead. | | | `MODELCODE` | `VARCHAR` / `TEXT` | Code of the vehicle model associated with the lead. | | | `SALES_FUNNEL` | `VARCHAR` / `TEXT` | The status of the lead in the sales funnel. | **Used**: Value `'ClosedSuccessfully'` determines purchase outcome. | | `LEADRELATIONSHIP` | `VARCHAR` / `TEXT` | Type or nature of the lead relationship. | | | `MAINLEADTYPE` | `VARCHAR` / `TEXT` | Primary classification of the lead type. | | | `LEADCATEGORY` | `VARCHAR` / `TEXT` | Category assigned to the lead. | | | `GA4_EVENT_TIMESTAMP` | `TIMESTAMP` / `TEXT` | Timestamp potentially related to the originating GA4 event. | | | `GA4_HOSTNAME` | `VARCHAR` / `TEXT` | Hostname potentially related to the originating GA4 event. | | | `GA4_DEVICE_CATEGORY` | `VARCHAR` / `TEXT` | Device category potentially related to the originating GA4 event. | | | `GA4_SOURCE` | `VARCHAR` / `TEXT` | Source potentially related to the originating GA4 session. | | | `GA4_MEDIUM` | `VARCHAR` / `TEXT` | Medium potentially related to the originating GA4 session. | | | `GA4_SESSION_CHANNEL_GROUP` | `VARCHAR` / `TEXT` | Channel group potentially related to the originating GA4 session. | | | `LEAD_JOURNEY_START` | `TIMESTAMP` / `TEXT` | Timestamp indicating the start of the lead journey. | | | `LEAD_TRANSFERRED` | `TIMESTAMP` / `TEXT` | Timestamp indicating when the lead was transferred. | | ## Schema Usage in the Model - The final DataFrame used as input for the Bayesian Network model primarily uses `USER_PSEUDO_ID`, `EVENT_NAME`, and `SALES_FUNNEL` to create binary flags for each model node. - Columns marked **Used** are explicitly referenced in the data processing pipeline. - The exact data types (`VARCHAR`, `INTEGER`, `TIMESTAMP`, `BOOLEAN` etc.) are automatically inferred by DuckDB during the data loading process. ## Data Processing Flow 1. Raw CSV files are loaded into the DuckDB database using the `load_to_duckdb.py` script. 2. The `load_data.py` script then queries these tables to create a unified dataset for model training. 3. Events from `ga4_events` are transformed into binary flags indicating whether each user has performed specific actions. 4. Purchase outcomes from `backend_leads` are merged based on `USER_PSEUDO_ID`. 5. The resulting dataset is used to train the Bayesian Network model. ## Data Considerations - The `USER_PSEUDO_ID` is the key linking field between the two tables. - The `EVENT_NAME` values determine which nodes are created in the Bayesian Network. - The `SALES_FUNNEL` value `'ClosedSuccessfully'` is used as a proxy for purchase completion. - Time-based analysis can be performed using the timestamp columns, although the current model does not explicitly incorporate temporal information. ## Extended Schema for Cross-Country Analysis When adapting the model for different country websites (e.g., from Toyota GB to Toyota Germany), ensure that the data schema remains consistent while accounting for country-specific variations in event names and sales funnel states.