Stockker / docs /project_report.md
umer6016's picture
Upload folder using huggingface_hub
ce3d808 verified
# AI Stock Prediction & Analysis System - Project Report
## 1. Introduction
The **AI Stock Prediction & Analysis System** is an end-to-end machine learning solution designed to predict stock market prices and analyze market regimes in real-time. By leveraging a combination of ensemble machine learning models and unsupervised learning techniques, the system provides users with actionable insights into stock trends and volatility.
### 1.1 Problem Statement
Stock market prediction is inherently challenging due to the stochastic nature of financial data. Traditional methods often fail to capture complex non-linear patterns or adapt to changing market conditions. This project aims to address these challenges by building a robust, automated pipeline that integrates real-time data ingestion, advanced feature engineering, and ensemble modeling to improve prediction accuracy and market understanding.
### 1.2 Objectives
* Develop an automated pipeline for fetching and processing daily stock data.
* Implement ensemble learning models (Linear Regression, Random Forest, SVM) for price prediction.
* Apply unsupervised learning (Clustering, PCA) to identify market volatility regimes.
* Deploy a user-friendly interactive dashboard using Streamlit.
* Ensure system reliability through CI/CD pipelines and automated testing.
---
## 2. System Architecture
The system follows a modular microservices-like architecture, ensuring scalability and maintainability.
### 2.1 Core Components
* **Frontend (User Interface):** Built with **Streamlit**, providing an interactive dashboard for users to select stocks, view real-time metrics, and visualize predictions.
* **Backend & Orchestration:**
* **Prefect:** Orchestrates the entire ML workflow, from data ingestion to model inference, ensuring reproducible and scheduled runs.
* **FastAPI:** (Integrated) Serves as the backend framework for handling API requests and model serving.
* **Data Layer:**
* **Alpha Vantage API:** The primary source for real-time and historical stock market data (Daily Time Series).
* **Local Storage/Database:** Stores raw CSVs and processed datasets for training and inference.
* **Notification Service:** A custom Discord notification module maintains system observability, alerting administrators of pipeline status or errors, featuring a custom DNS bypass for restricted network environments.
### 2.2 Infrastructure & DevOps
* **Docker:** The entire application is containerized using Docker to ensure consistent environments across development and production.
* **CI/CD Pipeline:** Hosted on **GitHub Actions**, the pipeline automatically tests the code (pytest, ruff) and deploys changes.
* **Deployment:** The application is deployed on **Hugging Face Spaces**, providing a publicly accessible interface.
---
## 3. Methodology
### 3.1 Data Ingestion
The system utilizes the `Alpha Vantage` API to fetch daily historical data.
* **Source:** `src/ingestion/ingest.py`
* **Process:** The `fetch_daily_data` function retrieves `TIME_SERIES_DAILY` in CSV format, capturing open, high, low, close, and volume data for the last 100 data points (compact mode) or full history.
### 3.2 Feature Engineering
Raw data is transformed into meaningful features to capture market momentum and trends.
* **Source:** `src/processing/features.py`
* **Key Indicators:**
* **Simple Moving Average (SMA):** Calculated for 20-day and 50-day windows to identify trend direction.
* **Relative Strength Index (RSI):** A 14-day momentum oscillator to detect overbought or oversold conditions.
* **MACD (Moving Average Convergence Divergence):** Captures changes in the strength, direction, momentum, and duration of a trend.
* **Lagged Features:** (Implicit in time-series modeling) used to predict future values.
* **Target Variables:**
* `target_direction`: Binary classification (1 if Price goes Up, 0 if Down).
* `target_price`: Regression target (Next day's closing price).
### 3.3 Machine Learning Models
The system employs an **Ensemble Learning** strategy to improve generalization and reduce overfitting.
* **Regression Models:** Predict the exact future price.
* *Linear Regression:* Captures linear relationships.
* *Random Forest Regressor:* Handles non-linearities and feature interactions.
* *Support Vector Regressor (SVR):* Effective in high-dimensional spaces.
* **Classification Models:** Predict the directional movement (Up/Down).
* **Unsupervised Learning:**
* **PCA & Clustering:** Used to analyze market regimes, grouping market states based on volatility and price action patterns (e.g., "High Volatility", "Bullish Trend").
### 3.4 Data Validation
To ensure data quality and model reliability, the system integrates **DeepChecks**.
* **Data Integrity:** automated checks for missing values, duplicates, and conflicting labels.
* **Drift Detection:** Validates that the training and testing data distributions remain consistent (`train_test_validation`), alerting to potential concept drift.
---
## 4. Implementation & Testing
### 4.1 Development
The project is structured within the `src/` directory, separating concerns into `ingestion`, `processing`, `models`, and `orchestration`.
### 4.2 Quality Assurance
* **Unit Testing:** Implemented using `pytest` (located in `tests/`) to verify individual components.
* **Data Validation:** Integrated **DeepChecks** to perform automated integrity checks and detect data drift between training and testing datasets.
* **Linting:** Code quality is maintained using `ruff` to enforce PEP 8 standards.
* **Automated Workflows:**
* `ci.yml`: Triggers on push/pull request to `main`, running tests and linter.
* `deploy_to_hf.yml`: Automatically syncs the repository to Hugging Face Spaces upon successful merge.
---
## 5. Results & Conclusion
The system successfully demonstrates a complete end-to-end ML lifecycle. The Streamlit dashboard provides a seamless user experience, allowing for real-time stock analysis. The integration of Discord notifications ensures that the system is monitored effectively.
### 5.1 Key Achievements
* Fully automated data pipeline.
* Robust ensemble model implementation.
* Resilient deployment on Hugging Face Spaces.
* High code quality standards enforced via CI/CD.
This project serves as a comprehensive template for scalable financial machine learning applications.