File size: 4,570 Bytes
068b3e6 ce3d808 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# Codebase Explanation & Walkthrough
This document provides a detailed technical explanation of the "AI Stock Prediction System" (Stockker). Use this to understand the underlying logic and architecture.
## 1. System Architecture (The Big Picture)
The system operates as a **Microservices-based Pipeline** with four distinct stages:
1. **Ingestion Layer**: Fetches raw market data (Open, High, Low, Close, Volume) from Alpha Vantage.
2. **Processing Layer**: Transforms raw data into technical indicators (Features).
3. **Training Orchestration**: A Prefect pipeline that trains, evaluates, and saves models.
4. **Inference API**: A FastAPI server that loads these models to provide real-time predictions.
## 2. Key Components Explained
### A. Data Ingestion & Processing
**File:** `src/processing/features.py`
Feature engineering is critical for financial ML. We treat the market data as a time-series problem.
- **SMA (Simple Moving Average)**: Calculates the trend over 20 and 50 days.
- **RSI (Relative Strength Index)**: A momentum oscillator (0-100) to identify overbought/oversold conditions.
- **MACD (Moving Average Convergence Divergence)**: Tracks momentum changes.
- **Target Variables**:
- `target_price`: Next day's closing price (Regression).
- `target_direction`: 1 (Up) or 0 (Down) for next day (Classification).
### B. Machine Learning Models (Ensemble Approach)
**File:** `src/models/train.py`
Instead of relying on a single algorithm, we use **Ensemble Learning** for robustness.
#### 1. Regression (Predicting Price)
We use a **Voting Regressor**, which averages predictions from three models:
- **Linear Regression**: Captures simple linear trends.
- **Random Forest Regressor**: Captures complex, non-linear patterns (100 Decision Trees).
- **SVR (Support Vector Regressor)**: Uses an RBF kernel to find the optimal hyperplane in high-dimensional space.
*Why?* Combining these reduces the variance and error of any single model.
#### 2. Classification (Predicting Direction)
We use a **Voting Classifier** (Soft Voting) combining:
- **Random Forest Classifier**: Robust against overfitting.
- **SVC (Support Vector Classifier)**: Good at separating classes with clear margins.
*Soft Voting* means we average the *probabilities* of each model class, not just their final votes, leading to more nuanced predictions.
#### 3. Unsupervised Learning (Market Analysis)
- **K-Means Clustering**: Groups market days into 3 clusters (e.g., Low Volatility, High Volatility) based on volatility and RSI.
- **PCA**: Reduces our 4 dimensions (SMA20, SMA50, RSI, MACD) into 2 principal components for 2D visualization.
### C. Orchestration (The Pipeline)
**File:** `src/orchestration/flows.py`
We use **Prefect** to manage the workflow.
- **`main_pipeline`**: Loops through our stock list (`AAPL`, `GOOGL`, `MSFT`, `AMZN`, `TSLA`, `NVDA`).
- For each stock, it sequentially runs: Fetch -> Process -> Train -> Evaluate -> Notify Discord.
- **Error Handling**: If one stock fails, the pipeline logs the error (via Discord) and continues to the next, ensuring resilience.
### D. The API (Model Serving)
**File:** `src/api/main.py`
- **Dynamic Loading**: On startup (`@app.on_event("startup")`), the API scans the `models/` directory. It dynamically loads whatever models it finds (e.g., `models/NVDA/regression_model.pkl`), making the system easily extensible to new stocks without changing code.
- **Enpoints**: Exposes REST endpoints (`/predict/price`, `/predict/direction`) that the frontend consumes.
## 3. Infrastructure & DevOps
### Docker
**File:** `Dockerfile` & `docker-compose.yml`
- We containerize the application to ensure it runs identically on your laptop and the cloud.
- The `docker-compose` setup includes a Postgres service, which is used **exclusively by Prefect** to store flow run history. The main app uses a **file-based system** (CSVs/PKLs) for simplicity and portability.
### CI/CD (GitHub Actions)
**File:** `.github/workflows/deploy_to_hf.yml`
- On every push to `main`, GitHub Actions automatically:
1. Runs `pytest` to verify code correctness.
2. Pushes the code to the Hugging Face Space, triggering a new deployment.
## 4. Why This Architecture?
- **Modularity**: Separation of concerns (Ingestion vs Training vs Serving) makes debugging easy.
- **Scalability**: Adding a new stock (like NVDA) only required adding a string to the list; the pipeline handled the rest.
- **Reliability**: Ensembles prevent "putting all eggs in one basket" model-wise.
|