Project Architecture

Overview

This document describes the high-level architecture of the ConversionFlow library. The project combines Bayesian Networks with statistical and AI-powered analysis to model and interpret user behaviour and conversion paths.

Key Components

1. Bayesian Network Model

Located in src/models/bayesian_network.py:

Implements the BayesianNetworkModel class, which dynamically builds a PyMC model based on the DAG structure defined in the configuration file (config.yml).
The current active model structure is refined_model_v3.
Uses Half-Cauchy priors by default for edge weights (betas), assuming positive relationships, but this is configurable.
Models the probability of each node occurring conditional on its parents using a logistic link function and Bernoulli likelihood.

2. Inference Engine

Located in src/models/inference.py:

Configures Markov Chain Monte Carlo (MCMC) sampling using the NUTS sampler for efficient Bayesian inference.
Implements parallel chain execution as configured in config.yml (default is 4 chains).
Generates posterior distributions for model parameters and posterior predictive samples.
Includes diagnostics and error handling:
- R-hat convergence metric to assess chain convergence.
- Effective sample size (ESS) for bulk and tail statistics to evaluate sampling efficiency.
- Chain acceptance rates to monitor sampler performance.
- Divergence detection to identify problematic samples.
- Automatic retry mechanism on inference failures with exponential backoff.
- Preprocessing of input data to handle potential issues like infinite values and scaling.

3. Data Processing Pipeline

Located in src/data/processor.py:

Implements a modular data processing workflow with configurable preprocessing steps.
Handles missing value imputation with configurable strategies.
Performs feature scaling with options for standardization, min-max scaling, or robust scaling.
Manages train-test splitting for model validation.
Implements efficient data transformations using pandas and numpy operations.

4. Optimization Engine

Located in src/optimization/optimizer.py:

Implements a genetic algorithm in the GeneticOptimizer class for budget allocation optimization.
Uses parameter summaries from the Bayesian model to inform optimization decisions.
Includes fitness evaluation based on expected conversion value.
Implements constraints handling for budget limitations and business rules.
Provides alternative optimization methods through SciPy’s optimization library.

5. Configuration Management

Located in src/config/config_manager.py:

Handles loading and validation of configuration from YAML files.
Provides a singleton configuration object accessible throughout the application.
Implements schema validation to ensure configuration correctness.
Supports environment-specific configuration overrides.

6. Visualization Module

Located in src/visualization/visualizer.py:

Generates DAG visualizations of the Bayesian Network structure.
Creates diagnostic plots for MCMC convergence and parameter distributions.
Produces optimization result visualizations for budget allocation.
Includes interactive plotting capabilities using Plotly.
Implements static high-quality visualization using Matplotlib.

7. Logging and Error Handling

Located in src/utils/logger.py and src/utils/error_handler.py:

Implements a centralized logging system with configurable verbosity.
Provides structured error handling with appropriate error classes.
Includes error recovery mechanisms where possible.
Generates detailed error reports for debugging.

Application Flow

Configuration Loading: The application begins by loading configuration parameters from YAML files.
Data Processing: Raw data is loaded and preprocessed according to configuration settings.
Model Building: A Bayesian Network model is constructed based on the DAG structure defined in the configuration.
Model Inference: MCMC sampling is performed to estimate model parameters.
Results Analysis: Diagnostic checks are performed, and parameter summaries are generated.
Optimization: Budget allocation is optimized based on model parameters.
Output Generation: Reports, visualizations, and data files are produced.

Directory Structure

conversionflow/
├── src/
│   ├── data/              # Data processing components
│   ├── models/            # Bayesian Network implementation
│   ├── optimization/      # Optimization algorithms
│   ├── visualization/     # Plotting and visualization tools
│   ├── config/            # Configuration management
│   └── utils/             # Utility functions and helpers
├── tests/                 # Test suite
├── examples/              # Example scripts and notebooks
├── documentation/         # Documentation files
└── assets/                # Static assets and configuration files

Dependencies

The project relies on several key libraries:

PyMC: For Bayesian modeling and MCMC inference
ArviZ: For Bayesian diagnostics and visualization
NumPy/Pandas: For data manipulation
Matplotlib/Plotly: For visualization
PyYAML: For configuration management
SciPy: For optimization algorithms (alternative to genetic algorithm)

Extension Points

The architecture is designed with several extension points:

Custom prior distributions in the Bayesian model
Additional preprocessing steps in the data pipeline
Alternative optimization algorithms
Custom visualization styles and formats
New node types in the Bayesian Network model