Spaces:

umer6016
/

Stockker

Sleeping

App Files Files Community

Stockker / docs /project_report.md

umer6016

Upload folder using huggingface_hub

ce3d808 verified 17 days ago

preview code

raw

history blame contribute delete

6.5 kB

	# AI Stock Prediction & Analysis System - Project Report

	## 1. Introduction
	The AI Stock Prediction & Analysis System is an end-to-end machine learning solution designed to predict stock market prices and analyze market regimes in real-time. By leveraging a combination of ensemble machine learning models and unsupervised learning techniques, the system provides users with actionable insights into stock trends and volatility.

	### 1.1 Problem Statement
	Stock market prediction is inherently challenging due to the stochastic nature of financial data. Traditional methods often fail to capture complex non-linear patterns or adapt to changing market conditions. This project aims to address these challenges by building a robust, automated pipeline that integrates real-time data ingestion, advanced feature engineering, and ensemble modeling to improve prediction accuracy and market understanding.

	### 1.2 Objectives
	* Develop an automated pipeline for fetching and processing daily stock data.
	* Implement ensemble learning models (Linear Regression, Random Forest, SVM) for price prediction.
	* Apply unsupervised learning (Clustering, PCA) to identify market volatility regimes.
	* Deploy a user-friendly interactive dashboard using Streamlit.
	* Ensure system reliability through CI/CD pipelines and automated testing.

	---

	## 2. System Architecture
	The system follows a modular microservices-like architecture, ensuring scalability and maintainability.

	### 2.1 Core Components
	* Frontend (User Interface): Built with Streamlit, providing an interactive dashboard for users to select stocks, view real-time metrics, and visualize predictions.
	* Backend & Orchestration:
	* Prefect: Orchestrates the entire ML workflow, from data ingestion to model inference, ensuring reproducible and scheduled runs.
	* FastAPI: (Integrated) Serves as the backend framework for handling API requests and model serving.
	* Data Layer:
	* Alpha Vantage API: The primary source for real-time and historical stock market data (Daily Time Series).
	* Local Storage/Database: Stores raw CSVs and processed datasets for training and inference.
	* Notification Service: A custom Discord notification module maintains system observability, alerting administrators of pipeline status or errors, featuring a custom DNS bypass for restricted network environments.

	### 2.2 Infrastructure & DevOps
	* Docker: The entire application is containerized using Docker to ensure consistent environments across development and production.
	* CI/CD Pipeline: Hosted on GitHub Actions, the pipeline automatically tests the code (pytest, ruff) and deploys changes.
	* Deployment: The application is deployed on Hugging Face Spaces, providing a publicly accessible interface.

	---

	## 3. Methodology

	### 3.1 Data Ingestion
	The system utilizes the `Alpha Vantage` API to fetch daily historical data.
	* Source: `src/ingestion/ingest.py`
	* Process: The `fetch_daily_data` function retrieves `TIME_SERIES_DAILY` in CSV format, capturing open, high, low, close, and volume data for the last 100 data points (compact mode) or full history.

	### 3.2 Feature Engineering
	Raw data is transformed into meaningful features to capture market momentum and trends.
	* Source: `src/processing/features.py`
	* Key Indicators:
	* Simple Moving Average (SMA): Calculated for 20-day and 50-day windows to identify trend direction.
	* Relative Strength Index (RSI): A 14-day momentum oscillator to detect overbought or oversold conditions.
	* MACD (Moving Average Convergence Divergence): Captures changes in the strength, direction, momentum, and duration of a trend.
	* Lagged Features: (Implicit in time-series modeling) used to predict future values.
	* Target Variables:
	* `target_direction`: Binary classification (1 if Price goes Up, 0 if Down).
	* `target_price`: Regression target (Next day's closing price).

	### 3.3 Machine Learning Models
	The system employs an Ensemble Learning strategy to improve generalization and reduce overfitting.
	* Regression Models: Predict the exact future price.
	* Linear Regression: Captures linear relationships.
	* Random Forest Regressor: Handles non-linearities and feature interactions.
	* Support Vector Regressor (SVR): Effective in high-dimensional spaces.
	* Classification Models: Predict the directional movement (Up/Down).
	* Unsupervised Learning:
	* PCA & Clustering: Used to analyze market regimes, grouping market states based on volatility and price action patterns (e.g., "High Volatility", "Bullish Trend").

	### 3.4 Data Validation
	To ensure data quality and model reliability, the system integrates DeepChecks.
	* Data Integrity: automated checks for missing values, duplicates, and conflicting labels.
	* Drift Detection: Validates that the training and testing data distributions remain consistent (`train_test_validation`), alerting to potential concept drift.

	---

	## 4. Implementation & Testing

	### 4.1 Development
	The project is structured within the `src/` directory, separating concerns into `ingestion`, `processing`, `models`, and `orchestration`.

	### 4.2 Quality Assurance
	* Unit Testing: Implemented using `pytest` (located in `tests/`) to verify individual components.
	* Data Validation: Integrated DeepChecks to perform automated integrity checks and detect data drift between training and testing datasets.
	* Linting: Code quality is maintained using `ruff` to enforce PEP 8 standards.
	* Automated Workflows:
	* `ci.yml`: Triggers on push/pull request to `main`, running tests and linter.
	* `deploy_to_hf.yml`: Automatically syncs the repository to Hugging Face Spaces upon successful merge.

	---

	## 5. Results & Conclusion
	The system successfully demonstrates a complete end-to-end ML lifecycle. The Streamlit dashboard provides a seamless user experience, allowing for real-time stock analysis. The integration of Discord notifications ensures that the system is monitored effectively.

	### 5.1 Key Achievements
	* Fully automated data pipeline.
	* Robust ensemble model implementation.
	* Resilient deployment on Hugging Face Spaces.
	* High code quality standards enforced via CI/CD.

	This project serves as a comprehensive template for scalable financial machine learning applications.