Stockker / docs /project_report.tex
umer6016's picture
Upload folder using huggingface_hub
ce3d808 verified
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. if that is unneeded, please comment it out.
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}
\title{AI Stock Prediction \& Analysis System}
\author{\IEEEauthorblockN{Muhammad Umer Farooq}
\IEEEauthorblockA{\textit{Faculty of Computer Science and Engineering} \\
\textit{Ghulam Ishaq Khan Institute of Engineering Sciences and Technology}\\
Topi, Pakistan \\
u2023540@giki.edu.pk}
}
\maketitle
\begin{abstract}
The AI Stock Prediction \& Analysis System aka *Stockker* is an end-to-end machine learning solution designed to predict stock market prices in real-time. By leveraging a combination of ensemble machine learning models (Linear Regression, Random Forest, SVR) and unsupervised learning techniques (PCA, Clustering), the system provides users with assistance in investing in stocks. The system includes a simple Streamlit frontend, a Prefect-orchestrated backend running on FastAPI, and a robust CI/CD pipeline deployed on Hugging Face Spaces.
\end{abstract}
\begin{IEEEkeywords}
Stock Prediction, Machine Learning, Ensemble Learning, Real-time Analysis, MLOps, Streamlit, Prefect, DevOps
\end{IEEEkeywords}
\section{Introduction}
Stock market prediction is inherently challenging due to the stochastic nature of financial data. Traditional methods often fail to capture complex non-linear patterns or adapt to changing market conditions. This project aims to address these challenges by building a robust, automated pipeline that integrates real-time data ingestion, advanced feature engineering, and ensemble modeling to improve prediction accuracy and market understanding.
The primary objectives of this system are to:
\begin{itemize}
\item Develop an automated pipeline for fetching and processing daily stock data.
\item Implement ensemble learning models for robust price prediction.
\item Apply unsupervised learning to identify market volatility regimes.
\item Deploy a user-friendly interactive dashboard.
\item Ensure system reliability through strictly automated testing and CI/CD pipelines.
\end{itemize}
\section{System Architecture}
The system follows a modular microservices-like architecture, ensuring scalability and maintainability.
\subsection{Core Components}
\subsubsection{Frontend (User Interface)}
Built with \textbf{Streamlit}, the frontend provides an interactive dashboard for users to select stocks, view real-time metrics, and visualize predictions. It serves as the primary consumption layer for the model's outputs.
\subsubsection{Backend \& Orchestration}
\begin{itemize}
\item \textbf{Prefect:} Orchestrates the entire ML workflow, from data ingestion to model inference, ensuring reproducible and scheduled runs.
\item \textbf{FastAPI:} Serves as the backend framework for handling API requests and serving the model predictions.
\end{itemize}
\subsubsection{Data Layer}
\begin{itemize}
\item \textbf{Alpha Vantage API:} The primary source for real-time and historical stock market data (Daily Time Series).
\item \textbf{Local Storage/Database:} Stores raw CSVs and processed datasets for training and inference, managing the data lifecycle.
\end{itemize}
\subsubsection{Notification Service}
A custom Discord notification module maintains system observability, alerting administrators of pipeline status or errors. It features a custom DNS bypass to ensure connectivity in restricted network environments such as Hugging Face Spaces.
\subsection{Infrastructure \& DevOps}
\begin{itemize}
\item \textbf{Docker:} The entire application is containerized to ensure consistent environments across development and production.
\item \textbf{CI/CD Pipeline:} Hosted on \textbf{GitHub Actions}, the pipeline automatically runs unit tests (pytest), linting (ruff), and performs continuous deployment.
\item \textbf{Deployment:} The application is deployed on \textbf{Hugging Face Spaces}, providing a publicly accessible and scalable interface.
\end{itemize}
\section{Methodology}
\subsection{Data Ingestion}
The system utilizes the \texttt{Alpha Vantage} API to fetch daily historical data. The ingestion module (\texttt{src/ingestion/ingest.py}) retrieves \texttt{TIME\_SERIES\_DAILY} data in CSV format, capturing open, high, low, close, and volume metrics. It supports both compact mode (last 100 data points) and full historical fetch.
\subsection{Feature Engineering}
Raw data is transformed into meaningful features to capture market momentum and trends (\texttt{src/processing/features.py}). Key indicators include:
\begin{itemize}
\item \textbf{Simple Moving Average (SMA):} Calculated for 20-day and 50-day windows to identify trend direction.
\item \textbf{Relative Strength Index (RSI):} A 14-day momentum oscillator to detect overbought or oversold conditions.
\item \textbf{MACD:} Captures changes in trend strength, direction, momentum, and duration.
\item \textbf{Lagged Features:} Implicitly used in time-series modeling to predict future values based on past performance.
\end{itemize}
Target variables include \texttt{target\_direction} (Binary classification: Up/Down) and \texttt{target\_price} (Regression: Next day's closing price).
\subsection{Machine Learning Models}
The system employs an \textbf{Ensemble Learning} strategy to improve generalization and reduce overfitting.
\subsubsection{Regression Models}
Used to predict the exact future price (\texttt{target\_price}). The system utilizes a \textbf{Voting Regressor} that combines the predictions of three distinct base learners:
\begin{itemize}
\item \textbf{Linear Regression:} Captures base linear relationships in the data.
\item \textbf{Random Forest Regressor:} Handles non-linearities and feature interactions effectively (100 estimators).
\item \textbf{Support Vector Regressor (SVR):} Effective in high-dimensional spaces using the RBF kernel.
\end{itemize}
\subsubsection{Classification Models}
Used to predict the directional movement of the stock price (\texttt{target\_direction}). A \textbf{Voting Classifier} (Soft Voting) aggregates the probabilities from:
\begin{itemize}
\item \textbf{Random Forest Classifier:} A robust ensemble method (100 estimators).
\item \textbf{Support Vector Classifier (SVC):} Configured with probability estimates to contribute to the soft voting mechanism.
\end{itemize}
\subsubsection{Unsupervised Learning}
\begin{itemize}
\item \textbf{Clustering (K-Means):} Applied to identifying market regimes based on \textit{volatility} (rolling standard deviation) and \textit{RSI}. This groups the market into 3 distinct clusters (e.g., Low, Medium, High Volatility).
\item \textbf{PCA (Principal Component Analysis):} Reduces the dimensionality of the feature set (\texttt{sma\_20}, \texttt{sma\_50}, \texttt{rsi}, \texttt{macd}) into 2 principal components for visualization and analysis.
\end{itemize}
\subsection{Data Validation}
To ensure data quality and model reliability, the system integrates \textbf{DeepChecks}, performing automated checks for missing values, duplicates, and conflicting labels. It also monitors for data drift between training and production environments.
\section{Implementation \& Testing}
\subsection{Development}
The project fits within a structured \texttt{src/} directory, strictly checking separation of concerns between \texttt{ingestion}, \texttt{processing}, \texttt{models}, and \texttt{orchestration} modules.
\subsection{Quality Assurance}
\begin{itemize}
\item \textbf{Unit Testing:} Implemented using \texttt{pytest} in the \texttt{tests/} directory.
\item \textbf{Linting:} Enforced via \texttt{ruff} for PEP 8 compliance.
\item \textbf{Automated Workflows:}
\begin{itemize}
\item \texttt{ci.yml}: Triggers on push/pull requests to run tests and linters.
\item \texttt{deploy\_to\_hf.yml}: Syncs the repository to Hugging Face Spaces upon successful merge.
\end{itemize}
\end{itemize}
\section{Results \& Conclusion}
The generic AI Stock Prediction \& Analysis System successfully demonstrates a complete end-to-end ML lifecycle. The Streamlit dashboard offers a seamless user experience for real-time analysis, while the backend orchestration ensures data freshness and model reliability.
Key achievements include a fully automated data pipeline, robust ensemble model implementation (both regression and classification), and resilient deployment on Hugging Face Spaces. The integration of Discord notifications provides critical observability. This project serves as a comprehensive template for scalable financial machine learning applications.
\begin{thebibliography}{00}
\bibitem{b1} Alpha Vantage, ``Alpha Vantage API Documentation,'' https://www.alphavantage.co/documentation/.
\bibitem{b2} Streamlit, ``Streamlit Documentation,'' https://docs.streamlit.io/.
\bibitem{b3} Prefect, ``Prefect Core Documentation,'' https://docs.prefect.io/.
\bibitem{b4} Hugging Face, ``Hugging Face Spaces,'' https://huggingface.co/docs/hub/spaces.
\end{thebibliography}
\end{document}