Project Architecture

Overview

This document describes the high-level architecture of the ConversionFlow library. The project combines Bayesian Networks with statistical and AI-powered analysis to model and interpret user behaviour and conversion paths.

Key Components

1. Bayesian Network Model

Located in src/models/bayesian_network.py:

  • Implements the BayesianNetworkModel class, which dynamically builds a PyMC model based on the DAG structure defined in the configuration file (config.yml).

  • The current active model structure is refined_model_v3.

  • Uses Half-Cauchy priors by default for edge weights (betas), assuming positive relationships, but this is configurable.

  • Models the probability of each node occurring conditional on its parents using a logistic link function and Bernoulli likelihood.

2. Inference Engine

Located in src/models/inference.py:

  • Configures Markov Chain Monte Carlo (MCMC) sampling using the NUTS sampler for efficient Bayesian inference.

  • Implements parallel chain execution as configured in config.yml (default is 4 chains).

  • Generates posterior distributions for model parameters and posterior predictive samples.

  • Includes diagnostics and error handling:

    • R-hat convergence metric to assess chain convergence.

    • Effective sample size (ESS) for bulk and tail statistics to evaluate sampling efficiency.

    • Chain acceptance rates to monitor sampler performance.

    • Divergence detection to identify problematic samples.

    • Automatic retry mechanism on inference failures with exponential backoff.

    • Preprocessing of input data to handle potential issues like infinite values and scaling.

3. Data Processing Pipeline

Located in src/data/processor.py:

  • Implements a modular data processing workflow with configurable preprocessing steps.

  • Handles missing value imputation with configurable strategies.

  • Performs feature scaling with options for standardization, min-max scaling, or robust scaling.

  • Manages train-test splitting for model validation.

  • Implements efficient data transformations using pandas and numpy operations.

4. Optimization Engine

Located in src/optimization/optimizer.py:

  • Implements a genetic algorithm in the GeneticOptimizer class for budget allocation optimization.

  • Uses parameter summaries from the Bayesian model to inform optimization decisions.

  • Includes fitness evaluation based on expected conversion value.

  • Implements constraints handling for budget limitations and business rules.

  • Provides alternative optimization methods through SciPy’s optimization library.

5. Configuration Management

Located in src/config/config_manager.py:

  • Handles loading and validation of configuration from YAML files.

  • Provides a singleton configuration object accessible throughout the application.

  • Implements schema validation to ensure configuration correctness.

  • Supports environment-specific configuration overrides.

6. Visualization Module

Located in src/visualization/visualizer.py:

  • Generates DAG visualizations of the Bayesian Network structure.

  • Creates diagnostic plots for MCMC convergence and parameter distributions.

  • Produces optimization result visualizations for budget allocation.

  • Includes interactive plotting capabilities using Plotly.

  • Implements static high-quality visualization using Matplotlib.

7. Logging and Error Handling

Located in src/utils/logger.py and src/utils/error_handler.py:

  • Implements a centralized logging system with configurable verbosity.

  • Provides structured error handling with appropriate error classes.

  • Includes error recovery mechanisms where possible.

  • Generates detailed error reports for debugging.

Application Flow

  1. Configuration Loading: The application begins by loading configuration parameters from YAML files.

  2. Data Processing: Raw data is loaded and preprocessed according to configuration settings.

  3. Model Building: A Bayesian Network model is constructed based on the DAG structure defined in the configuration.

  4. Model Inference: MCMC sampling is performed to estimate model parameters.

  5. Results Analysis: Diagnostic checks are performed, and parameter summaries are generated.

  6. Optimization: Budget allocation is optimized based on model parameters.

  7. Output Generation: Reports, visualizations, and data files are produced.

Directory Structure

conversionflow/
├── src/
│   ├── data/              # Data processing components
│   ├── models/            # Bayesian Network implementation
│   ├── optimization/      # Optimization algorithms
│   ├── visualization/     # Plotting and visualization tools
│   ├── config/            # Configuration management
│   └── utils/             # Utility functions and helpers
├── tests/                 # Test suite
├── examples/              # Example scripts and notebooks
├── documentation/         # Documentation files
└── assets/                # Static assets and configuration files

Dependencies

The project relies on several key libraries:

  • PyMC: For Bayesian modeling and MCMC inference

  • ArviZ: For Bayesian diagnostics and visualization

  • NumPy/Pandas: For data manipulation

  • Matplotlib/Plotly: For visualization

  • PyYAML: For configuration management

  • SciPy: For optimization algorithms (alternative to genetic algorithm)

Extension Points

The architecture is designed with several extension points:

  • Custom prior distributions in the Bayesian model

  • Additional preprocessing steps in the data pipeline

  • Alternative optimization algorithms

  • Custom visualization styles and formats

  • New node types in the Bayesian Network model