AI Stock Prediction & Analysis System - Project Report
1. Introduction
The AI Stock Prediction & Analysis System is an end-to-end machine learning solution designed to predict stock market prices and analyze market regimes in real-time. By leveraging a combination of ensemble machine learning models and unsupervised learning techniques, the system provides users with actionable insights into stock trends and volatility.
1.1 Problem Statement
Stock market prediction is inherently challenging due to the stochastic nature of financial data. Traditional methods often fail to capture complex non-linear patterns or adapt to changing market conditions. This project aims to address these challenges by building a robust, automated pipeline that integrates real-time data ingestion, advanced feature engineering, and ensemble modeling to improve prediction accuracy and market understanding.
1.2 Objectives
- Develop an automated pipeline for fetching and processing daily stock data.
- Implement ensemble learning models (Linear Regression, Random Forest, SVM) for price prediction.
- Apply unsupervised learning (Clustering, PCA) to identify market volatility regimes.
- Deploy a user-friendly interactive dashboard using Streamlit.
- Ensure system reliability through CI/CD pipelines and automated testing.
2. System Architecture
The system follows a modular microservices-like architecture, ensuring scalability and maintainability.
2.1 Core Components
- Frontend (User Interface): Built with Streamlit, providing an interactive dashboard for users to select stocks, view real-time metrics, and visualize predictions.
- Backend & Orchestration:
- Prefect: Orchestrates the entire ML workflow, from data ingestion to model inference, ensuring reproducible and scheduled runs.
- FastAPI: (Integrated) Serves as the backend framework for handling API requests and model serving.
- Data Layer:
- Alpha Vantage API: The primary source for real-time and historical stock market data (Daily Time Series).
- Local Storage/Database: Stores raw CSVs and processed datasets for training and inference.
- Notification Service: A custom Discord notification module maintains system observability, alerting administrators of pipeline status or errors, featuring a custom DNS bypass for restricted network environments.
2.2 Infrastructure & DevOps
- Docker: The entire application is containerized using Docker to ensure consistent environments across development and production.
- CI/CD Pipeline: Hosted on GitHub Actions, the pipeline automatically tests the code (pytest, ruff) and deploys changes.
- Deployment: The application is deployed on Hugging Face Spaces, providing a publicly accessible interface.
3. Methodology
3.1 Data Ingestion
The system utilizes the Alpha Vantage API to fetch daily historical data.
- Source:
src/ingestion/ingest.py - Process: The
fetch_daily_datafunction retrievesTIME_SERIES_DAILYin CSV format, capturing open, high, low, close, and volume data for the last 100 data points (compact mode) or full history.
3.2 Feature Engineering
Raw data is transformed into meaningful features to capture market momentum and trends.
- Source:
src/processing/features.py - Key Indicators:
- Simple Moving Average (SMA): Calculated for 20-day and 50-day windows to identify trend direction.
- Relative Strength Index (RSI): A 14-day momentum oscillator to detect overbought or oversold conditions.
- MACD (Moving Average Convergence Divergence): Captures changes in the strength, direction, momentum, and duration of a trend.
- Lagged Features: (Implicit in time-series modeling) used to predict future values.
- Target Variables:
target_direction: Binary classification (1 if Price goes Up, 0 if Down).target_price: Regression target (Next day's closing price).
3.3 Machine Learning Models
The system employs an Ensemble Learning strategy to improve generalization and reduce overfitting.
- Regression Models: Predict the exact future price.
- Linear Regression: Captures linear relationships.
- Random Forest Regressor: Handles non-linearities and feature interactions.
- Support Vector Regressor (SVR): Effective in high-dimensional spaces.
- Classification Models: Predict the directional movement (Up/Down).
- Unsupervised Learning:
- PCA & Clustering: Used to analyze market regimes, grouping market states based on volatility and price action patterns (e.g., "High Volatility", "Bullish Trend").
3.4 Data Validation
To ensure data quality and model reliability, the system integrates DeepChecks.
- Data Integrity: automated checks for missing values, duplicates, and conflicting labels.
- Drift Detection: Validates that the training and testing data distributions remain consistent (
train_test_validation), alerting to potential concept drift.
4. Implementation & Testing
4.1 Development
The project is structured within the src/ directory, separating concerns into ingestion, processing, models, and orchestration.
4.2 Quality Assurance
- Unit Testing: Implemented using
pytest(located intests/) to verify individual components. - Data Validation: Integrated DeepChecks to perform automated integrity checks and detect data drift between training and testing datasets.
- Linting: Code quality is maintained using
ruffto enforce PEP 8 standards. - Automated Workflows:
ci.yml: Triggers on push/pull request tomain, running tests and linter.deploy_to_hf.yml: Automatically syncs the repository to Hugging Face Spaces upon successful merge.
5. Results & Conclusion
The system successfully demonstrates a complete end-to-end ML lifecycle. The Streamlit dashboard provides a seamless user experience, allowing for real-time stock analysis. The integration of Discord notifications ensures that the system is monitored effectively.
5.1 Key Achievements
- Fully automated data pipeline.
- Robust ensemble model implementation.
- Resilient deployment on Hugging Face Spaces.
- High code quality standards enforced via CI/CD.
This project serves as a comprehensive template for scalable financial machine learning applications.