# AI Stock Prediction & Analysis System - Project Report ## 1. Introduction The **AI Stock Prediction & Analysis System** is an end-to-end machine learning solution designed to predict stock market prices and analyze market regimes in real-time. By leveraging a combination of ensemble machine learning models and unsupervised learning techniques, the system provides users with actionable insights into stock trends and volatility. ### 1.1 Problem Statement Stock market prediction is inherently challenging due to the stochastic nature of financial data. Traditional methods often fail to capture complex non-linear patterns or adapt to changing market conditions. This project aims to address these challenges by building a robust, automated pipeline that integrates real-time data ingestion, advanced feature engineering, and ensemble modeling to improve prediction accuracy and market understanding. ### 1.2 Objectives * Develop an automated pipeline for fetching and processing daily stock data. * Implement ensemble learning models (Linear Regression, Random Forest, SVM) for price prediction. * Apply unsupervised learning (Clustering, PCA) to identify market volatility regimes. * Deploy a user-friendly interactive dashboard using Streamlit. * Ensure system reliability through CI/CD pipelines and automated testing. --- ## 2. System Architecture The system follows a modular microservices-like architecture, ensuring scalability and maintainability. ### 2.1 Core Components * **Frontend (User Interface):** Built with **Streamlit**, providing an interactive dashboard for users to select stocks, view real-time metrics, and visualize predictions. * **Backend & Orchestration:** * **Prefect:** Orchestrates the entire ML workflow, from data ingestion to model inference, ensuring reproducible and scheduled runs. * **FastAPI:** (Integrated) Serves as the backend framework for handling API requests and model serving. * **Data Layer:** * **Alpha Vantage API:** The primary source for real-time and historical stock market data (Daily Time Series). * **Local Storage/Database:** Stores raw CSVs and processed datasets for training and inference. * **Notification Service:** A custom Discord notification module maintains system observability, alerting administrators of pipeline status or errors, featuring a custom DNS bypass for restricted network environments. ### 2.2 Infrastructure & DevOps * **Docker:** The entire application is containerized using Docker to ensure consistent environments across development and production. * **CI/CD Pipeline:** Hosted on **GitHub Actions**, the pipeline automatically tests the code (pytest, ruff) and deploys changes. * **Deployment:** The application is deployed on **Hugging Face Spaces**, providing a publicly accessible interface. --- ## 3. Methodology ### 3.1 Data Ingestion The system utilizes the `Alpha Vantage` API to fetch daily historical data. * **Source:** `src/ingestion/ingest.py` * **Process:** The `fetch_daily_data` function retrieves `TIME_SERIES_DAILY` in CSV format, capturing open, high, low, close, and volume data for the last 100 data points (compact mode) or full history. ### 3.2 Feature Engineering Raw data is transformed into meaningful features to capture market momentum and trends. * **Source:** `src/processing/features.py` * **Key Indicators:** * **Simple Moving Average (SMA):** Calculated for 20-day and 50-day windows to identify trend direction. * **Relative Strength Index (RSI):** A 14-day momentum oscillator to detect overbought or oversold conditions. * **MACD (Moving Average Convergence Divergence):** Captures changes in the strength, direction, momentum, and duration of a trend. * **Lagged Features:** (Implicit in time-series modeling) used to predict future values. * **Target Variables:** * `target_direction`: Binary classification (1 if Price goes Up, 0 if Down). * `target_price`: Regression target (Next day's closing price). ### 3.3 Machine Learning Models The system employs an **Ensemble Learning** strategy to improve generalization and reduce overfitting. * **Regression Models:** Predict the exact future price. * *Linear Regression:* Captures linear relationships. * *Random Forest Regressor:* Handles non-linearities and feature interactions. * *Support Vector Regressor (SVR):* Effective in high-dimensional spaces. * **Classification Models:** Predict the directional movement (Up/Down). * **Unsupervised Learning:** * **PCA & Clustering:** Used to analyze market regimes, grouping market states based on volatility and price action patterns (e.g., "High Volatility", "Bullish Trend"). ### 3.4 Data Validation To ensure data quality and model reliability, the system integrates **DeepChecks**. * **Data Integrity:** automated checks for missing values, duplicates, and conflicting labels. * **Drift Detection:** Validates that the training and testing data distributions remain consistent (`train_test_validation`), alerting to potential concept drift. --- ## 4. Implementation & Testing ### 4.1 Development The project is structured within the `src/` directory, separating concerns into `ingestion`, `processing`, `models`, and `orchestration`. ### 4.2 Quality Assurance * **Unit Testing:** Implemented using `pytest` (located in `tests/`) to verify individual components. * **Data Validation:** Integrated **DeepChecks** to perform automated integrity checks and detect data drift between training and testing datasets. * **Linting:** Code quality is maintained using `ruff` to enforce PEP 8 standards. * **Automated Workflows:** * `ci.yml`: Triggers on push/pull request to `main`, running tests and linter. * `deploy_to_hf.yml`: Automatically syncs the repository to Hugging Face Spaces upon successful merge. --- ## 5. Results & Conclusion The system successfully demonstrates a complete end-to-end ML lifecycle. The Streamlit dashboard provides a seamless user experience, allowing for real-time stock analysis. The integration of Discord notifications ensures that the system is monitored effectively. ### 5.1 Key Achievements * Fully automated data pipeline. * Robust ensemble model implementation. * Resilient deployment on Hugging Face Spaces. * High code quality standards enforced via CI/CD. This project serves as a comprehensive template for scalable financial machine learning applications.