# Project Architecture ## Overview This document describes the high-level architecture of the ConversionFlow library. The project combines Bayesian Networks with statistical and AI-powered analysis to model and interpret user behaviour and conversion paths. ## Key Components ### 1. Bayesian Network Model Located in `src/models/bayesian_network.py`: - Implements the `BayesianNetworkModel` class, which dynamically builds a PyMC model based on the DAG structure defined in the configuration file (`config.yml`). - The current active model structure is `refined_model_v3`. - Uses Half-Cauchy priors by default for edge weights (betas), assuming positive relationships, but this is configurable. - Models the probability of each node occurring conditional on its parents using a logistic link function and Bernoulli likelihood. ### 2. Inference Engine Located in `src/models/inference.py`: - Configures Markov Chain Monte Carlo (MCMC) sampling using the NUTS sampler for efficient Bayesian inference. - Implements parallel chain execution as configured in `config.yml` (default is 4 chains). - Generates posterior distributions for model parameters and posterior predictive samples. - Includes diagnostics and error handling: * R-hat convergence metric to assess chain convergence. * Effective sample size (ESS) for bulk and tail statistics to evaluate sampling efficiency. * Chain acceptance rates to monitor sampler performance. * Divergence detection to identify problematic samples. * Automatic retry mechanism on inference failures with exponential backoff. * Preprocessing of input data to handle potential issues like infinite values and scaling. ### 3. Data Processing Pipeline Located in `src/data/processor.py`: - Implements a modular data processing workflow with configurable preprocessing steps. - Handles missing value imputation with configurable strategies. - Performs feature scaling with options for standardization, min-max scaling, or robust scaling. - Manages train-test splitting for model validation. - Implements efficient data transformations using pandas and numpy operations. ### 4. Optimization Engine Located in `src/optimization/optimizer.py`: - Implements a genetic algorithm in the `GeneticOptimizer` class for budget allocation optimization. - Uses parameter summaries from the Bayesian model to inform optimization decisions. - Includes fitness evaluation based on expected conversion value. - Implements constraints handling for budget limitations and business rules. - Provides alternative optimization methods through SciPy's optimization library. ### 5. Configuration Management Located in `src/config/config_manager.py`: - Handles loading and validation of configuration from YAML files. - Provides a singleton configuration object accessible throughout the application. - Implements schema validation to ensure configuration correctness. - Supports environment-specific configuration overrides. ### 6. Visualization Module Located in `src/visualization/visualizer.py`: - Generates DAG visualizations of the Bayesian Network structure. - Creates diagnostic plots for MCMC convergence and parameter distributions. - Produces optimization result visualizations for budget allocation. - Includes interactive plotting capabilities using Plotly. - Implements static high-quality visualization using Matplotlib. ### 7. Logging and Error Handling Located in `src/utils/logger.py` and `src/utils/error_handler.py`: - Implements a centralized logging system with configurable verbosity. - Provides structured error handling with appropriate error classes. - Includes error recovery mechanisms where possible. - Generates detailed error reports for debugging. ## Application Flow 1. **Configuration Loading**: The application begins by loading configuration parameters from YAML files. 2. **Data Processing**: Raw data is loaded and preprocessed according to configuration settings. 3. **Model Building**: A Bayesian Network model is constructed based on the DAG structure defined in the configuration. 4. **Model Inference**: MCMC sampling is performed to estimate model parameters. 5. **Results Analysis**: Diagnostic checks are performed, and parameter summaries are generated. 6. **Optimization**: Budget allocation is optimized based on model parameters. 7. **Output Generation**: Reports, visualizations, and data files are produced. ## Directory Structure ``` conversionflow/ ├── src/ │ ├── data/ # Data processing components │ ├── models/ # Bayesian Network implementation │ ├── optimization/ # Optimization algorithms │ ├── visualization/ # Plotting and visualization tools │ ├── config/ # Configuration management │ └── utils/ # Utility functions and helpers ├── tests/ # Test suite ├── examples/ # Example scripts and notebooks ├── documentation/ # Documentation files └── assets/ # Static assets and configuration files ``` ## Dependencies The project relies on several key libraries: - PyMC: For Bayesian modeling and MCMC inference - ArviZ: For Bayesian diagnostics and visualization - NumPy/Pandas: For data manipulation - Matplotlib/Plotly: For visualization - PyYAML: For configuration management - SciPy: For optimization algorithms (alternative to genetic algorithm) ## Extension Points The architecture is designed with several extension points: - Custom prior distributions in the Bayesian model - Additional preprocessing steps in the data pipeline - Alternative optimization algorithms - Custom visualization styles and formats - New node types in the Bayesian Network model