# Codebase Explanation & Walkthrough This document provides a detailed technical explanation of the "AI Stock Prediction System" (Stockker). Use this to understand the underlying logic and architecture. ## 1. System Architecture (The Big Picture) The system operates as a **Microservices-based Pipeline** with four distinct stages: 1. **Ingestion Layer**: Fetches raw market data (Open, High, Low, Close, Volume) from Alpha Vantage. 2. **Processing Layer**: Transforms raw data into technical indicators (Features). 3. **Training Orchestration**: A Prefect pipeline that trains, evaluates, and saves models. 4. **Inference API**: A FastAPI server that loads these models to provide real-time predictions. ## 2. Key Components Explained ### A. Data Ingestion & Processing **File:** `src/processing/features.py` Feature engineering is critical for financial ML. We treat the market data as a time-series problem. - **SMA (Simple Moving Average)**: Calculates the trend over 20 and 50 days. - **RSI (Relative Strength Index)**: A momentum oscillator (0-100) to identify overbought/oversold conditions. - **MACD (Moving Average Convergence Divergence)**: Tracks momentum changes. - **Target Variables**: - `target_price`: Next day's closing price (Regression). - `target_direction`: 1 (Up) or 0 (Down) for next day (Classification). ### B. Machine Learning Models (Ensemble Approach) **File:** `src/models/train.py` Instead of relying on a single algorithm, we use **Ensemble Learning** for robustness. #### 1. Regression (Predicting Price) We use a **Voting Regressor**, which averages predictions from three models: - **Linear Regression**: Captures simple linear trends. - **Random Forest Regressor**: Captures complex, non-linear patterns (100 Decision Trees). - **SVR (Support Vector Regressor)**: Uses an RBF kernel to find the optimal hyperplane in high-dimensional space. *Why?* Combining these reduces the variance and error of any single model. #### 2. Classification (Predicting Direction) We use a **Voting Classifier** (Soft Voting) combining: - **Random Forest Classifier**: Robust against overfitting. - **SVC (Support Vector Classifier)**: Good at separating classes with clear margins. *Soft Voting* means we average the *probabilities* of each model class, not just their final votes, leading to more nuanced predictions. #### 3. Unsupervised Learning (Market Analysis) - **K-Means Clustering**: Groups market days into 3 clusters (e.g., Low Volatility, High Volatility) based on volatility and RSI. - **PCA**: Reduces our 4 dimensions (SMA20, SMA50, RSI, MACD) into 2 principal components for 2D visualization. ### C. Orchestration (The Pipeline) **File:** `src/orchestration/flows.py` We use **Prefect** to manage the workflow. - **`main_pipeline`**: Loops through our stock list (`AAPL`, `GOOGL`, `MSFT`, `AMZN`, `TSLA`, `NVDA`). - For each stock, it sequentially runs: Fetch -> Process -> Train -> Evaluate -> Notify Discord. - **Error Handling**: If one stock fails, the pipeline logs the error (via Discord) and continues to the next, ensuring resilience. ### D. The API (Model Serving) **File:** `src/api/main.py` - **Dynamic Loading**: On startup (`@app.on_event("startup")`), the API scans the `models/` directory. It dynamically loads whatever models it finds (e.g., `models/NVDA/regression_model.pkl`), making the system easily extensible to new stocks without changing code. - **Enpoints**: Exposes REST endpoints (`/predict/price`, `/predict/direction`) that the frontend consumes. ## 3. Infrastructure & DevOps ### Docker **File:** `Dockerfile` & `docker-compose.yml` - We containerize the application to ensure it runs identically on your laptop and the cloud. - The `docker-compose` setup includes a Postgres service, which is used **exclusively by Prefect** to store flow run history. The main app uses a **file-based system** (CSVs/PKLs) for simplicity and portability. ### CI/CD (GitHub Actions) **File:** `.github/workflows/deploy_to_hf.yml` - On every push to `main`, GitHub Actions automatically: 1. Runs `pytest` to verify code correctness. 2. Pushes the code to the Hugging Face Space, triggering a new deployment. ## 4. Why This Architecture? - **Modularity**: Separation of concerns (Ingestion vs Training vs Serving) makes debugging easy. - **Scalability**: Adding a new stock (like NVDA) only required adding a string to the list; the pipeline handled the rest. - **Reliability**: Ensembles prevent "putting all eggs in one basket" model-wise.