# Project Architecture

## Overview

This document describes the high-level architecture of the ConversionFlow library. The project combines Bayesian Networks with statistical and AI-powered analysis to model and interpret user behaviour and conversion paths.

## Key Components

### 1. Bayesian Network Model

Located in `src/models/bayesian_network.py`:
- Implements the `BayesianNetworkModel` class, which dynamically builds a PyMC model based on the DAG structure defined in the configuration file (`config.yml`).
- The current active model structure is `refined_model_v3`.
- Uses Half-Cauchy priors by default for edge weights (betas), assuming positive relationships, but this is configurable.
- Models the probability of each node occurring conditional on its parents using a logistic link function and Bernoulli likelihood.

### 2. Inference Engine

Located in `src/models/inference.py`:
- Configures Markov Chain Monte Carlo (MCMC) sampling using the NUTS sampler for efficient Bayesian inference.
- Implements parallel chain execution as configured in `config.yml` (default is 4 chains).
- Generates posterior distributions for model parameters and posterior predictive samples.
- Includes diagnostics and error handling:
  * R-hat convergence metric to assess chain convergence.
  * Effective sample size (ESS) for bulk and tail statistics to evaluate sampling efficiency.
  * Chain acceptance rates to monitor sampler performance.
  * Divergence detection to identify problematic samples.
  * Automatic retry mechanism on inference failures with exponential backoff.
  * Preprocessing of input data to handle potential issues like infinite values and scaling.

### 3. Data Processing Pipeline

Located in `src/data/processor.py`:
- Implements a modular data processing workflow with configurable preprocessing steps.
- Handles missing value imputation with configurable strategies.
- Performs feature scaling with options for standardization, min-max scaling, or robust scaling.
- Manages train-test splitting for model validation.
- Implements efficient data transformations using pandas and numpy operations.

### 4. Optimization Engine

Located in `src/optimization/optimizer.py`:
- Implements a genetic algorithm in the `GeneticOptimizer` class for budget allocation optimization.
- Uses parameter summaries from the Bayesian model to inform optimization decisions.
- Includes fitness evaluation based on expected conversion value.
- Implements constraints handling for budget limitations and business rules.
- Provides alternative optimization methods through SciPy's optimization library.

### 5. Configuration Management

Located in `src/config/config_manager.py`:
- Handles loading and validation of configuration from YAML files.
- Provides a singleton configuration object accessible throughout the application.
- Implements schema validation to ensure configuration correctness.
- Supports environment-specific configuration overrides.

### 6. Visualization Module

Located in `src/visualization/visualizer.py`:
- Generates DAG visualizations of the Bayesian Network structure.
- Creates diagnostic plots for MCMC convergence and parameter distributions.
- Produces optimization result visualizations for budget allocation.
- Includes interactive plotting capabilities using Plotly.
- Implements static high-quality visualization using Matplotlib.

### 7. Logging and Error Handling

Located in `src/utils/logger.py` and `src/utils/error_handler.py`:
- Implements a centralized logging system with configurable verbosity.
- Provides structured error handling with appropriate error classes.
- Includes error recovery mechanisms where possible.
- Generates detailed error reports for debugging.

## Application Flow

1. **Configuration Loading**: The application begins by loading configuration parameters from YAML files.
2. **Data Processing**: Raw data is loaded and preprocessed according to configuration settings.
3. **Model Building**: A Bayesian Network model is constructed based on the DAG structure defined in the configuration.
4. **Model Inference**: MCMC sampling is performed to estimate model parameters.
5. **Results Analysis**: Diagnostic checks are performed, and parameter summaries are generated.
6. **Optimization**: Budget allocation is optimized based on model parameters.
7. **Output Generation**: Reports, visualizations, and data files are produced.

## Directory Structure

```
conversionflow/
├── src/
│   ├── data/              # Data processing components
│   ├── models/            # Bayesian Network implementation
│   ├── optimization/      # Optimization algorithms
│   ├── visualization/     # Plotting and visualization tools
│   ├── config/            # Configuration management
│   └── utils/             # Utility functions and helpers
├── tests/                 # Test suite
├── examples/              # Example scripts and notebooks
├── documentation/         # Documentation files
└── assets/                # Static assets and configuration files
```

## Dependencies

The project relies on several key libraries:
- PyMC: For Bayesian modeling and MCMC inference
- ArviZ: For Bayesian diagnostics and visualization
- NumPy/Pandas: For data manipulation
- Matplotlib/Plotly: For visualization
- PyYAML: For configuration management
- SciPy: For optimization algorithms (alternative to genetic algorithm)

## Extension Points

The architecture is designed with several extension points:
- Custom prior distributions in the Bayesian model
- Additional preprocessing steps in the data pipeline
- Alternative optimization algorithms
- Custom visualization styles and formats
- New node types in the Bayesian Network model