File size: 4,570 Bytes
068b3e6
 
ce3d808
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Codebase Explanation & Walkthrough

This document provides a detailed technical explanation of the "AI Stock Prediction System" (Stockker). Use this to understand the underlying logic and architecture.

## 1. System Architecture (The Big Picture)
The system operates as a **Microservices-based Pipeline** with four distinct stages:
1.  **Ingestion Layer**: Fetches raw market data (Open, High, Low, Close, Volume) from Alpha Vantage.
2.  **Processing Layer**: Transforms raw data into technical indicators (Features).
3.  **Training Orchestration**: A Prefect pipeline that trains, evaluates, and saves models.
4.  **Inference API**: A FastAPI server that loads these models to provide real-time predictions.

## 2. Key Components Explained

### A. Data Ingestion & Processing
**File:** `src/processing/features.py`
Feature engineering is critical for financial ML. We treat the market data as a time-series problem.
-   **SMA (Simple Moving Average)**: Calculates the trend over 20 and 50 days.
-   **RSI (Relative Strength Index)**: A momentum oscillator (0-100) to identify overbought/oversold conditions.
-   **MACD (Moving Average Convergence Divergence)**: Tracks momentum changes.
-   **Target Variables**:
    -   `target_price`: Next day's closing price (Regression).
    -   `target_direction`: 1 (Up) or 0 (Down) for next day (Classification).

### B. Machine Learning Models (Ensemble Approach)
**File:** `src/models/train.py`
Instead of relying on a single algorithm, we use **Ensemble Learning** for robustness.

#### 1. Regression (Predicting Price)
We use a **Voting Regressor**, which averages predictions from three models:
-   **Linear Regression**: Captures simple linear trends.
-   **Random Forest Regressor**: Captures complex, non-linear patterns (100 Decision Trees).
-   **SVR (Support Vector Regressor)**: Uses an RBF kernel to find the optimal hyperplane in high-dimensional space.
*Why?* Combining these reduces the variance and error of any single model.

#### 2. Classification (Predicting Direction)
We use a **Voting Classifier** (Soft Voting) combining:
-   **Random Forest Classifier**: Robust against overfitting.
-   **SVC (Support Vector Classifier)**: Good at separating classes with clear margins.
*Soft Voting* means we average the *probabilities* of each model class, not just their final votes, leading to more nuanced predictions.

#### 3. Unsupervised Learning (Market Analysis)
-   **K-Means Clustering**: Groups market days into 3 clusters (e.g., Low Volatility, High Volatility) based on volatility and RSI.
-   **PCA**: Reduces our 4 dimensions (SMA20, SMA50, RSI, MACD) into 2 principal components for 2D visualization.

### C. Orchestration (The Pipeline)
**File:** `src/orchestration/flows.py`
We use **Prefect** to manage the workflow.
-   **`main_pipeline`**: Loops through our stock list (`AAPL`, `GOOGL`, `MSFT`, `AMZN`, `TSLA`, `NVDA`).
-   For each stock, it sequentially runs: Fetch -> Process -> Train -> Evaluate -> Notify Discord.
-   **Error Handling**: If one stock fails, the pipeline logs the error (via Discord) and continues to the next, ensuring resilience.

### D. The API (Model Serving)
**File:** `src/api/main.py`
-   **Dynamic Loading**: On startup (`@app.on_event("startup")`), the API scans the `models/` directory. It dynamically loads whatever models it finds (e.g., `models/NVDA/regression_model.pkl`), making the system easily extensible to new stocks without changing code.
-   **Enpoints**: Exposes REST endpoints (`/predict/price`, `/predict/direction`) that the frontend consumes.

## 3. Infrastructure & DevOps

### Docker
**File:** `Dockerfile` & `docker-compose.yml`
-   We containerize the application to ensure it runs identically on your laptop and the cloud.
-   The `docker-compose` setup includes a Postgres service, which is used **exclusively by Prefect** to store flow run history. The main app uses a **file-based system** (CSVs/PKLs) for simplicity and portability.

### CI/CD (GitHub Actions)
**File:** `.github/workflows/deploy_to_hf.yml`
-   On every push to `main`, GitHub Actions automatically:
    1.  Runs `pytest` to verify code correctness.
    2.  Pushes the code to the Hugging Face Space, triggering a new deployment.

## 4. Why This Architecture?
-   **Modularity**: Separation of concerns (Ingestion vs Training vs Serving) makes debugging easy.
-   **Scalability**: Adding a new stock (like NVDA) only required adding a string to the list; the pipeline handled the rest.
-   **Reliability**: Ensembles prevent "putting all eggs in one basket" model-wise.