Spaces:

OliverPerrin
/

LexiMind

Sleeping

App Files Files Community

OliverPerrin commited on Mar 10

Commit

aca8362

1 Parent(s): a6468cc

Added Bert Baseline comparison training script and ignored unneccessary files

Browse files

Files changed (7) hide show

.gitignore +10 -1
docs/paper.tex +0 -1214
docs/research_paper.tex +0 -574
kaggle.json +0 -1
outputs/evaluation_report.json +0 -260
outputs/training_history.json +0 -210
scripts/train_bert_baseline.py +1137 -0

.gitignore CHANGED Viewed

@@ -51,6 +51,11 @@ results/
 .ipynb_checkpoints/
 *.ipynb
 # OS - Windows specific
 .DS_Store
 Thumbs.db
@@ -65,6 +70,10 @@ ehthumbs_vista.db
 configs/local/*.png
 *.pt
 # Backup/private files
-scripts/demo_gradio_old.py
 mlruns.db

 .ipynb_checkpoints/
 *.ipynb
+# Docs/Paper
+docs/paper.*
+docs/paper_old.*
+docs/research_paper.*
 # OS - Windows specific
 .DS_Store
 Thumbs.db
 configs/local/*.png
 *.pt
+# Environment variables
+.env
+.env.*
 # Backup/private files
 mlruns.db
+kaggle.json

docs/paper.tex DELETED Viewed

@@ -1,1214 +0,0 @@
-% LexiMind: A Hybrid Transformer Architecture for Multi-Task NLP
-% IEEE Conference Style Paper
-% Author: Oliver Perrin
-\documentclass[conference]{IEEEtran}
-\IEEEoverridecommandlockouts
-% Essential packages
-\usepackage{cite}
-\usepackage{amsmath,amssymb,amsfonts}
-\usepackage{algorithmic}
-\usepackage{graphicx}
-\usepackage{textcomp}
-\usepackage{xcolor}
-\usepackage{hyperref}
-\usepackage{listings}
-\usepackage{booktabs}
-\usepackage{multirow}
-\usepackage{array}
-\usepackage{caption}
-% TikZ for diagrams
-\usepackage{tikz}
-\usetikzlibrary{shapes.geometric, arrows, positioning, fit, calc, backgrounds, decorations.pathreplacing}
-% Code listings style
-\lstset{
-    basicstyle=\ttfamily\small,
-    breaklines=true,
-    frame=single,
-    language=Python,
-    keywordstyle=\color{blue},
-    commentstyle=\color{green!50!black},
-    stringstyle=\color{red!60!black},
-    showstringspaces=false
-}
-% Hyperref setup
-\hypersetup{
-    colorlinks=true,
-    linkcolor=blue,
-    citecolor=blue,
-    urlcolor=blue
-}
-\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
-    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
-\begin{document}
-\title{LexiMind: A Hybrid Transformer Architecture\\for Multi-Task Natural Language Processing}
-\author{\IEEEauthorblockN{Oliver Perrin}\\
-\IEEEauthorblockA{Department of Computer Science\\
-Appalachian State University\\
-Bachelor of Science in Computer Science\\
-Email: perrinot@appstate.edu}}
-\maketitle
-\begin{abstract}
-This paper presents LexiMind, a multi-task Natural Language Processing (NLP) system that combines a custom-built Transformer architecture with pre-trained weights from Google's FLAN-T5 (Fine-tuned Language Net Text-to-Text Transfer Transformer). The system performs three fundamental NLP tasks simultaneously: abstractive text summarization, multi-label emotion classification, and single-label topic classification. Unlike news-focused models, LexiMind specializes in literary and academic content. For summarization, we train on 49,086 samples combining Goodreads book descriptions (back-cover style blurbs) with arXiv academic paper abstracts. Emotion classification uses 43,410 samples from GoEmotions \cite{demszky2020goemotions}, a dataset of 28 fine-grained emotion labels derived from Reddit comments. Topic classification spans 3,402 samples from 20 Newsgroups, Project Gutenberg literary texts, and scientific papers across 7 categories (Arts, Business, Fiction, History, Philosophy, Science, Technology). By implementing modern architectural innovations including Pre-Layer Normalization (Pre-LN) with Root Mean Square Layer Normalization (RMSNorm), T5-style relative position bias, FlashAttention via PyTorch 2.0's Scaled Dot-Product Attention (SDPA), gradient checkpointing, and torch.compile optimization, LexiMind achieves efficient training on consumer GPUs while maintaining strong performance. Our final model achieves a BERTScore F1 of 0.83 and ROUGE-1 of 0.31 for summarization, 85.2\% accuracy for topic classification, and F1 of 0.20 for 28-class multi-label emotion detection. The 272M-parameter architecture is constructed from first principles in a bottom-up fashion, with each component (attention mechanisms, feed-forward networks, encoder/decoder blocks) implemented as standalone modules. A factory pattern enables seamless weight transfer from FLAN-T5-base, allowing the system to leverage Google's pre-trained knowledge while maintaining full architectural transparency and customization capability.
-\end{abstract}
-\begin{IEEEkeywords}
-Transformer, Multi-Task Learning, Natural Language Processing, FLAN-T5, Transfer Learning, Text Summarization, Emotion Classification, Academic Papers, Literary Text
-\end{IEEEkeywords}
-%=============================================================================
-\section{Introduction}
-%=============================================================================
-The Transformer architecture \cite{vaswani2017attention} has fundamentally reshaped Natural Language Processing (NLP), establishing itself as the foundation for state-of-the-art models across virtually all language understanding and generation tasks. Building upon this foundation, the T5 (Text-to-Text Transfer Transformer) model \cite{raffel2020exploring} introduced a unified framework that casts all NLP problems as text-to-text transformations. FLAN-T5 (Fine-tuned Language Net) \cite{chung2022scaling} further enhanced T5's capabilities through instruction fine-tuning on over 1,000 diverse tasks.
-While pre-trained models like FLAN-T5 offer impressive zero-shot and few-shot capabilities, they are often treated as black boxes—their internal mechanisms obscured by framework abstractions. This opacity hinders both understanding and customization. Furthermore, multi-task learning scenarios often require architectural modifications that pre-built models do not easily accommodate.
-LexiMind addresses these challenges through a hybrid approach: implementing a complete Transformer architecture from scratch while maintaining compatibility with FLAN-T5's pre-trained weights. This provides several key advantages:
-\begin{enumerate}
-    \item \textbf{Architectural Transparency}: Every component—from attention mechanisms to normalization layers—is explicitly implemented and documented.
-    \item \textbf{Customization Flexibility}: Task-specific heads and routing logic can be freely modified without framework constraints.
-    \item \textbf{Transfer Learning}: FLAN-T5's linguistic knowledge is transferred through careful weight mapping in the factory module.
-    \item \textbf{Modern Optimizations}: Integration of FlashAttention, bfloat16 training, and gradient accumulation ensures efficient resource utilization.
-\end{enumerate}
-A key design decision in LexiMind is the focus on literary and academic domains rather than news articles, which are overrepresented in existing summarization benchmarks. For text summarization, we combine Goodreads book descriptions---which provide back-cover style blurbs describing \textit{what a book is about}---with arXiv paper abstracts. This trains the model to generate descriptive summaries rather than extractive plot recaps. Emotion classification leverages GoEmotions \cite{demszky2020goemotions}, providing fine-grained 28-label annotations. Topic classification draws from diverse sources including 20 Newsgroups, Project Gutenberg, and scientific papers.
-The contributions of this work include:
-\begin{itemize}
-    \item A custom Transformer implementation compatible with T5/FLAN-T5 weight loading
-    \item A multi-task architecture supporting both generative (summarization) and discriminative (classification) tasks
-    \item A curated dataset of 95,898 training samples across literary, academic, and conversational domains
-    \item Detailed documentation of weight transfer mechanisms between pre-trained models and custom implementations
-    \item Comprehensive training infrastructure with mixed-precision support, gradient monitoring, and MLflow experiment tracking
-\end{itemize}
-%=============================================================================
-\section{Related Work}
-%=============================================================================
-\subsection{Transformer Architectures}
-The original Transformer \cite{vaswani2017attention} introduced the self-attention mechanism, enabling parallel processing of sequences and effective capture of long-range dependencies. The architecture consists of stacked encoder and decoder blocks, each containing Multi-Head Attention (MHA) and position-wise Feed-Forward Networks (FFN).
-\textbf{Layer Normalization Placement}: The original Transformer applied Layer Normalization \cite{ba2016layer} after residual connections (Post-LN). Subsequent research \cite{xiong2020layer} demonstrated that applying normalization before sublayers (Pre-LN) significantly improves training stability, particularly for deep networks. LexiMind adopts the Pre-LN configuration used by T5 and modern large language models.
-\textbf{RMSNorm}: Zhang and Sennrich \cite{zhang2019root} proposed Root Mean Square Layer Normalization (RMSNorm), which removes the mean-centering operation of standard LayerNorm while maintaining comparable performance. T5 \cite{raffel2020exploring} adopts this approach, and LexiMind follows suit for compatibility.
-\subsection{Pre-trained Language Models}
-\textbf{T5}: Raffel et al. \cite{raffel2020exploring} introduced the T5 model, which frames all NLP tasks as text-to-text problems. T5 uses a Transformer encoder-decoder architecture with several distinctive features: relative position bias instead of absolute positional embeddings, RMSNorm for layer normalization, and a gated feed-forward network.
-\textbf{FLAN-T5}: Chung et al. \cite{chung2022scaling} enhanced T5 through instruction fine-tuning, creating FLAN-T5. By training on diverse task instructions, FLAN-T5 demonstrates improved zero-shot and few-shot capabilities compared to the original T5.
-\subsection{Multi-Task Learning}
-Multi-Task Learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, promoting parameter sharing and implicit data augmentation. Hard parameter sharing—where lower layers are shared across tasks while task-specific heads branch from shared representations—remains the dominant approach for Transformer-based MTL systems.
-%=============================================================================
-\section{Architecture}
-%=============================================================================
-LexiMind implements a complete encoder-decoder Transformer with task-specific heads, constructed using a bottom-up approach where each component is implemented as a standalone module. Figure \ref{fig:architecture} illustrates the high-level system architecture.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.75,
-    transform shape,
-    box/.style={draw, rectangle, minimum width=2cm, minimum height=0.7cm, align=center, rounded corners=2pt},
-    smallbox/.style={draw, rectangle, minimum width=1.4cm, minimum height=0.5cm, align=center, rounded corners=2pt, font=\scriptsize},
-    head/.style={draw, rectangle, minimum width=1.5cm, minimum height=0.6cm, align=center, rounded corners=2pt, fill=blue!20},
-    arrow/.style={->, >=stealth, thick},
-    dashedarrow/.style={->, >=stealth, dashed}
-]
-% Input
-\node[box, fill=gray!20] (input) at (0, 0) {Input Text};
-% Tokenizer
-\node[box, fill=yellow!30] (tokenizer) at (0, 1.2) {Tokenizer\\(SentencePiece)};
-% Encoder
-\node[box, fill=green!30, minimum height=2cm] (encoder) at (0, 3.2) {Encoder\\$N=12$ layers};
-% Task routing
-\node[box, fill=orange!30] (router) at (0, 5.2) {Task Router};
-% Decoder branch
-\node[box, fill=green!30, minimum height=1.5cm] (decoder) at (-2.5, 7) {Decoder\\$N=12$ layers};
-\node[head] (lmhead) at (-2.5, 8.8) {LM Head};
-\node[smallbox, fill=purple!20] (summ) at (-2.5, 9.8) {Summary};
-% Classification branch
-\node[head] (emotionhead) at (1.2, 7) {Emotion\\Head};
-\node[head] (topichead) at (2.8, 7) {Topic\\Head};
-\node[smallbox, fill=purple!20] (emotion) at (1.2, 8.2) {Emotions\\(28 classes)};
-\node[smallbox, fill=purple!20] (topic) at (2.8, 8.2) {Topics\\(7 classes)};
-% Arrows
-\draw[arrow] (input) -- (tokenizer);
-\draw[arrow] (tokenizer) -- (encoder);
-\draw[arrow] (encoder) -- (router);
-\draw[arrow] (router) -- (decoder);
-\draw[arrow] (router) -- (emotionhead);
-\draw[arrow] (router) -- (topichead);
-\draw[arrow] (decoder) -- (lmhead);
-\draw[arrow] (lmhead) -- (summ);
-\draw[arrow] (emotionhead) -- (emotion);
-\draw[arrow] (topichead) -- (topic);
-% Cross-attention arrow
-\draw[dashedarrow] (encoder.west) -- ++(-0.5,0) |- (decoder.south west);
-% Labels
-\node[font=\tiny, align=center] at (-1.8, 4.5) {Cross\\Attention};
-\end{tikzpicture}
-\caption{LexiMind system architecture showing the shared encoder, task-specific routing, decoder for generation, and classification heads for discriminative tasks.}
-\label{fig:architecture}
-\end{figure}
-\subsection{Transformer Block Structure}
-Figure \ref{fig:transformer_block} presents the internal structure of encoder and decoder blocks, following the Pre-LN configuration from T5 \cite{raffel2020exploring}.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.65,
-    transform shape,
-    block/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.6cm, align=center, rounded corners=2pt},
-    norm/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.5cm, align=center, fill=yellow!30, rounded corners=2pt},
-    attn/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.6cm, align=center, fill=blue!25, rounded corners=2pt},
-    ffn/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.6cm, align=center, fill=green!25, rounded corners=2pt},
-    add/.style={draw, circle, minimum size=0.4cm, fill=red!20, inner sep=0pt, font=\small},
-    arrow/.style={->, >=stealth},
-]
-% === ENCODER BLOCK ===
-\node[font=\bfseries] at (0, 8) {Encoder Block};
-% Input
-\node (enc_in) at (0, 7) {};
-\draw[arrow] (0, 6.5) -- (enc_in);
-% RMSNorm 1
-\node[norm] (enc_norm1) at (0, 6) {RMSNorm};
-% Self-Attention
-\node[attn] (enc_attn) at (0, 5) {Multi-Head\\Self-Attention};
-% Add 1
-\node[add] (enc_add1) at (0, 4) {+};
-% RMSNorm 2
-\node[norm] (enc_norm2) at (0, 3) {RMSNorm};
-% FFN
-\node[ffn] (enc_ffn) at (0, 2) {Gated FFN\\(GELU)};
-% Add 2
-\node[add] (enc_add2) at (0, 1) {+};
-% Output
-\node (enc_out) at (0, 0.3) {};
-% Connections
-\draw[arrow] (enc_in) -- (enc_norm1);
-\draw[arrow] (enc_norm1) -- (enc_attn);
-\draw[arrow] (enc_attn) -- (enc_add1);
-\draw[arrow] (enc_add1) -- (enc_norm2);
-\draw[arrow] (enc_norm2) -- (enc_ffn);
-\draw[arrow] (enc_ffn) -- (enc_add2);
-\draw[arrow] (enc_add2) -- (enc_out);
-% Residual connections
-\draw[arrow] (0, 6.5) -- (-1.5, 6.5) -- (-1.5, 4) -- (enc_add1.west);
-\draw[arrow] (enc_add1.east) -- (1.5, 4) -- (1.5, 1) -- (enc_add2.east);
-% === DECODER BLOCK ===
-\node[font=\bfseries] at (5.5, 8) {Decoder Block};
-% Input
-\node (dec_in) at (5.5, 7) {};
-\draw[arrow] (5.5, 6.5) -- (dec_in);
-% RMSNorm 1
-\node[norm] (dec_norm1) at (5.5, 6) {RMSNorm};
-% Masked Self-Attention
-\node[attn] (dec_attn1) at (5.5, 5) {Masked\\Self-Attention};
-% Add 1
-\node[add] (dec_add1) at (5.5, 4.2) {+};
-% RMSNorm 2
-\node[norm] (dec_norm2) at (5.5, 3.4) {RMSNorm};
-% Cross-Attention
-\node[attn, fill=cyan!25] (dec_attn2) at (5.5, 2.4) {Cross-Attention};
-% Add 2
-\node[add] (dec_add2) at (5.5, 1.5) {+};
-% RMSNorm 3
-\node[norm] (dec_norm3) at (5.5, 0.7) {RMSNorm};
-% FFN
-\node[ffn] (dec_ffn) at (5.5, -0.3) {Gated FFN\\(GELU)};
-% Add 3
-\node[add] (dec_add3) at (5.5, -1.2) {+};
-% Connections
-\draw[arrow] (dec_in) -- (dec_norm1);
-\draw[arrow] (dec_norm1) -- (dec_attn1);
-\draw[arrow] (dec_attn1) -- (dec_add1);
-\draw[arrow] (dec_add1) -- (dec_norm2);
-\draw[arrow] (dec_norm2) -- (dec_attn2);
-\draw[arrow] (dec_attn2) -- (dec_add2);
-\draw[arrow] (dec_add2) -- (dec_norm3);
-\draw[arrow] (dec_norm3) -- (dec_ffn);
-\draw[arrow] (dec_ffn) -- (dec_add3);
-% Encoder memory input
-\node[block, fill=gray!20, minimum width=1.2cm, font=\scriptsize] (memory) at (8, 2.4) {Encoder\\Memory};
-\draw[arrow] (memory) -- (dec_attn2);
-% Residual connections (simplified)
-\draw[arrow] (5.5, 6.5) -- (4, 6.5) -- (4, 4.2) -- (dec_add1.west);
-\draw[arrow] (dec_add1.east) -- (7, 4.2) -- (7, 1.5) -- (dec_add2.east);
-\draw[arrow] (dec_add2.west) -- (4, 1.5) -- (4, -1.2) -- (dec_add3.west);
-\end{tikzpicture}
-\caption{Pre-LN Transformer blocks. Left: Encoder block with self-attention and FFN. Right: Decoder block with masked self-attention, cross-attention to encoder memory, and FFN. RMSNorm is applied \emph{before} each sublayer (Pre-LN).}
-\label{fig:transformer_block}
-\end{figure}
-\subsection{Multi-Head Attention Mechanism}
-The attention mechanism is the cornerstone of the Transformer architecture. LexiMind implements Multi-Head Attention with support for T5-style relative position bias and FlashAttention optimization. Figure \ref{fig:attention} illustrates the attention computation.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.6,
-    transform shape,
-    box/.style={draw, rectangle, minimum width=1.5cm, minimum height=0.6cm, align=center, rounded corners=2pt},
-    proj/.style={draw, rectangle, minimum width=1.2cm, minimum height=0.5cm, align=center, fill=blue!20, rounded corners=2pt, font=\scriptsize},
-    op/.style={draw, rectangle, minimum width=1.2cm, minimum height=0.5cm, align=center, fill=orange!30, rounded corners=2pt, font=\scriptsize},
-    arrow/.style={->, >=stealth},
-]
-% Input
-\node[box, fill=gray!20] (input) at (0, 0) {Input $X$};
-% Projections
-\node[proj] (wq) at (-2.5, 1.5) {$W_Q$};
-\node[proj] (wk) at (0, 1.5) {$W_K$};
-\node[proj] (wv) at (2.5, 1.5) {$W_V$};
-% Q, K, V
-\node[box, fill=green!20] (q) at (-2.5, 2.8) {$Q$};
-\node[box, fill=green!20] (k) at (0, 2.8) {$K$};
-\node[box, fill=green!20] (v) at (2.5, 2.8) {$V$};
-% Split heads
-\node[op] (split) at (0, 4) {Split $h$ heads};
-% Attention scores
-\node[op] (matmul1) at (0, 5.2) {$QK^T$};
-% Position bias
-\node[box, fill=yellow!30, font=\scriptsize] (bias) at (3.5, 5.2) {Relative\\Pos Bias};
-% Add bias
-\node[op] (add) at (0, 6.2) {$+ B_{rel}$};
-% Scale (optional)
-\node[op] (scale) at (0, 7.2) {Scale / Mask};
-% Softmax
-\node[op, fill=red!20] (softmax) at (0, 8.2) {Softmax};
-% MatMul with V
-\node[op] (matmul2) at (0, 9.2) {$\times V$};
-% Concat
-\node[op] (concat) at (0, 10.2) {Concat heads};
-% Output projection
-\node[proj] (wo) at (0, 11.2) {$W_O$};
-% Output
-\node[box, fill=purple!20] (output) at (0, 12.2) {Output};
-% Arrows
-\draw[arrow] (input) -- (wq);
-\draw[arrow] (input) -- (wk);
-\draw[arrow] (input) -- (wv);
-\draw[arrow] (wq) -- (q);
-\draw[arrow] (wk) -- (k);
-\draw[arrow] (wv) -- (v);
-\draw[arrow] (q) -- (split);
-\draw[arrow] (k) -- (split);
-\draw[arrow] (v.north) -- ++(0, 0.3) -| (2.5, 9.2) -- (matmul2);
-\draw[arrow] (split) -- (matmul1);
-\draw[arrow] (matmul1) -- (add);
-\draw[arrow] (bias) -- (add);
-\draw[arrow] (add) -- (scale);
-\draw[arrow] (scale) -- (softmax);
-\draw[arrow] (softmax) -- (matmul2);
-\draw[arrow] (matmul2) -- (concat);
-\draw[arrow] (concat) -- (wo);
-\draw[arrow] (wo) -- (output);
-% Annotations
-\node[font=\tiny, align=left] at (-4.5, 5.5) {T5 does NOT\\scale by $\sqrt{d_k}$};
-\end{tikzpicture}
-\caption{Multi-Head Attention with T5-style relative position bias. The attention scores are computed as $QK^T + B_{rel}$, where $B_{rel}$ is the learned relative position bias. Unlike standard Transformers, T5 does not scale by $\sqrt{d_k}$.}
-\label{fig:attention}
-\end{figure}
-The attention computation in LexiMind is implemented in \texttt{src/models/attention.py}. For T5 compatibility, the \texttt{scale\_scores} parameter controls whether to apply $\sqrt{d_k}$ scaling—T5 does not use this scaling \cite{raffel2020exploring}.
-Figure \ref{fig:attention_viz} shows learned attention patterns from the trained model, demonstrating how different heads specialize in capturing various linguistic relationships.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/multihead_attention_visualization.png}
-\caption{Attention weight visualization across multiple heads. Each head learns distinct attention patterns: some focus on local context (diagonal patterns), while others capture long-range dependencies and syntactic relationships.}
-\label{fig:attention_viz}
-\end{figure}
-\subsubsection{T5 Relative Position Bias}
-Unlike absolute positional embeddings that are added to token embeddings, T5 uses relative position bias added directly to attention scores. The \texttt{T5RelativePositionBias} class implements logarithmically-bucketed relative positions:
-\begin{equation}
-B_{ij} = \text{Embed}[\text{bucket}(i - j)]
-\end{equation}
-where $\text{bucket}(\cdot)$ maps relative distances to discrete buckets. Half the buckets encode exact positions for nearby tokens; the remaining half use logarithmic spacing for distant tokens. As documented in the code:
-\begin{quote}
-\emph{``T5 uses a combination of exact positions (for nearby tokens) and logarithmically-spaced buckets (for distant tokens).''} — \texttt{attention.py}, lines 46--48
-\end{quote}
-Figure \ref{fig:position_bias} visualizes the learned relative position bias, showing how the model encodes positional relationships between tokens.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/positional_encoding_heatmap.png}
-\caption{Heatmap of relative position bias values. The diagonal structure indicates stronger attention between nearby positions, while the logarithmic bucketing allows efficient representation of longer-range dependencies.}
-\label{fig:position_bias}
-\end{figure}
-\subsubsection{FlashAttention Integration}
-LexiMind leverages PyTorch 2.0's \texttt{scaled\_dot\_product\_attention} function, which automatically selects the optimal attention kernel:
-\begin{quote}
-\emph{``Uses F.scaled\_dot\_product\_attention which automatically selects the best available kernel (FlashAttention v2, Memory-Efficient Attention, or math fallback) based on hardware and input shapes.''} — \texttt{attention.py}, lines 134--137
-\end{quote}
-This provides O(N) memory complexity instead of O(N²) when FlashAttention is available.
-\subsection{Feed-Forward Network}
-Following T5, LexiMind implements a gated feed-forward network with GELU activation:
-\begin{equation}
-\text{FFN}(x) = (\text{GELU}(xW_g) \odot xW_1) W_2
-\end{equation}
-where $W_g$ is the gating projection, $W_1$ is the up-projection, $W_2$ is the down-projection, and $\odot$ denotes element-wise multiplication.
-\subsection{RMSNorm}
-RMSNorm \cite{zhang2019root} normalizes inputs using only the root mean square:
-\begin{equation}
-\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d}x_i^2 + \epsilon}} \cdot \gamma
-\end{equation}
-The implementation in \texttt{src/models/t5\_layer\_norm.py} follows T5's convention, using only a learned scale parameter $\gamma$ with no bias term.
-%=============================================================================
-\section{Tokenization}
-\label{sec:tokenization}
-%=============================================================================
-LexiMind wraps HuggingFace's AutoTokenizer with a simplified façade that handles T5-specific conventions. The implementation in \texttt{src/data/tokenization.py} manages special token handling and decoder input preparation.
-\subsection{T5 Tokenizer Characteristics}
-T5 uses SentencePiece \cite{kudo2018sentencepiece} with unigram tokenization:
-\begin{itemize}
-    \item \textbf{Vocabulary Size}: 32,128 tokens (padded to multiple of 128 for efficiency)
-    \item \textbf{Special Tokens}: \texttt{pad\_token\_id=0}, \texttt{eos\_token\_id=1}
-    \item \textbf{No Explicit BOS}: T5 uses the pad token as the decoder start token
-\end{itemize}
-As noted in the tokenizer implementation:
-\begin{quote}
-\emph{``T5 uses different special tokens than BART: T5: pad=0, eos=1, no explicit bos (uses pad or eos as decoder start); BART: bos=0, pad=1, eos=2.''} — \texttt{tokenization.py}, lines 42--44
-\end{quote}
-\subsection{Decoder Input Preparation}
-For seq2seq training, decoder inputs must be shifted right from labels. The \texttt{prepare\_decoder\_inputs} method handles this:
-\begin{lstlisting}[caption={Decoder input preparation from tokenization.py}]
-def prepare_decoder_inputs(
-    self, labels: torch.Tensor
-) -> torch.Tensor:
-    """Shift decoder labels to create
-    input ids prefixed by BOS."""
-    bos = self.bos_token_id
-    pad = self.pad_token_id
-    decoder_inputs = torch.full_like(labels, pad)
-    decoder_inputs[:, 0] = bos
-    decoder_inputs[:, 1:] = labels[:, :-1]
-    return decoder_inputs
-\end{lstlisting}
-%=============================================================================
-\section{The Factory Module: Weight Transfer from FLAN-T5}
-\label{sec:factory}
-%=============================================================================
-The \texttt{factory.py} module is central to LexiMind's hybrid approach, providing model construction and weight loading utilities. Figure \ref{fig:factory_flow} illustrates the model construction pipeline.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.7,
-    transform shape,
-    box/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.7cm, align=center, rounded corners=3pt},
-    config/.style={draw, rectangle, minimum width=2cm, minimum height=0.6cm, align=center, fill=yellow!30, rounded corners=2pt, font=\small},
-    model/.style={draw, rectangle, minimum width=2.2cm, minimum height=0.6cm, align=center, fill=green!30, rounded corners=2pt, font=\small},
-    arrow/.style={->, >=stealth, thick},
-]
-% Config loading
-\node[config] (yaml) at (0, 0) {config.yaml};
-\node[box, fill=blue!20] (loadconfig) at (0, 1.3) {load\_model\_config()};
-\node[config] (modelconfig) at (0, 2.6) {ModelConfig};
-% Model building
-\node[box, fill=blue!20] (build) at (0, 4.2) {build\_multitask\_model()};
-% Components
-\node[model] (encoder) at (-2.5, 5.8) {Encoder};
-\node[model] (decoder) at (0, 5.8) {Decoder};
-\node[model] (heads) at (2.5, 5.8) {Task Heads};
-% Weight loading
-\node[box, fill=orange!30] (loadweights) at (-1.2, 7.4) {\_load\_pretrained\_weights()};
-% FLAN-T5
-\node[box, fill=purple!20] (flant5) at (-4, 7.4) {FLAN-T5\\(HuggingFace)};
-% Final model
-\node[box, fill=red!20, minimum width=3cm] (mtmodel) at (0, 9) {MultiTaskModel};
-% Arrows
-\draw[arrow] (yaml) -- (loadconfig);
-\draw[arrow] (loadconfig) -- (modelconfig);
-\draw[arrow] (modelconfig) -- (build);
-\draw[arrow] (build) -- (encoder);
-\draw[arrow] (build) -- (decoder);
-\draw[arrow] (build) -- (heads);
-\draw[arrow] (encoder) -- (loadweights);
-\draw[arrow] (decoder) -- (loadweights);
-\draw[arrow] (flant5) -- (loadweights);
-\draw[arrow] (loadweights) -- (mtmodel);
-\draw[arrow] (heads) -- (mtmodel);
-\end{tikzpicture}
-\caption{Model construction pipeline in \texttt{factory.py}. Configuration is loaded from YAML, components are instantiated, FLAN-T5 weights are transferred, and the final MultiTaskModel is assembled.}
-\label{fig:factory_flow}
-\end{figure}
-\subsection{Configuration Management}
-The \texttt{ModelConfig} dataclass defines all architecture hyperparameters:
-\begin{lstlisting}[caption={ModelConfig from factory.py}]
-@dataclass
-class ModelConfig:
-    d_model: int = 768
-    vocab_size: Optional[int] = None
-    num_encoder_layers: int = 12
-    num_decoder_layers: int = 12
-    num_attention_heads: int = 12
-    ffn_dim: int = 2048
-    dropout: float = 0.1
-    use_pretrained: bool = False
-    pretrained_model_name: str =
-        "google/flan-t5-base"
-    activation: str = "gated-gelu"
-    use_relative_position_bias: bool = False
-\end{lstlisting}
-\subsection{Weight Transfer Mechanism}
-The \texttt{\_load\_pretrained\_weights} function performs careful weight mapping between FLAN-T5 and LexiMind's custom architecture. Key considerations documented in the code:
-\begin{quote}
-\emph{``T5 architecture compatibility with our custom Transformer: T5 uses Pre-LN (RMSNorm before sublayers) --- matches our design; T5 uses relative position bias instead of absolute embeddings; T5 uses gated FFN (wi\_0, wi\_1, wo); T5 attention has no bias, our attention has bias --- we zero-initialize the bias terms.''} --- \texttt{factory.py}, lines 100--108
-\end{quote}
-Table \ref{tab:weight_mapping} shows the parameter correspondence:
-\begin{table}[htbp]
-\centering
-\caption{FLAN-T5 to LexiMind Weight Mapping}
-\label{tab:weight_mapping}
-\begin{tabular}{ll}
-\toprule
-\textbf{FLAN-T5 Parameter} & \textbf{LexiMind Parameter} \\
-\midrule
-\texttt{shared} & \texttt{encoder.embedding} \\
-\texttt{encoder.block.*.SelfAttention.q} & \texttt{encoder.layers.*.self\_attn.W\_Q} \\
-\texttt{encoder.block.*.SelfAttention.k} & \texttt{encoder.layers.*.self\_attn.W\_K} \\
-\texttt{encoder.block.*.SelfAttention.v} & \texttt{encoder.layers.*.self\_attn.W\_V} \\
-\texttt{encoder.block.*.SelfAttention.o} & \texttt{encoder.layers.*.self\_attn.W\_O} \\
-\texttt{*.layer\_norm} & \texttt{*.norm*} \\
-\texttt{*.DenseReluDense.wi\_0} & \texttt{*.ffn.linear\_gate} \\
-\texttt{*.DenseReluDense.wi\_1} & \texttt{*.ffn.linear1} \\
-\texttt{*.DenseReluDense.wo} & \texttt{*.ffn.linear2} \\
-\texttt{lm\_head} & \texttt{decoder.output\_projection} \\
-\bottomrule
-\end{tabular}
-\end{table}
-\subsection{Vocabulary Size Handling}
-T5 pads its vocabulary to multiples of 128 for computational efficiency (32,100 → 32,128). LexiMind handles this mismatch:
-\begin{quote}
-\emph{``Note: T5's vocab is padded to multiple of 128 for efficiency (32100 → 32128). [...] Copy only the tokens that exist in both. Initialize any extra tokens with small random values.''} — \texttt{factory.py}, lines 116--131
-\end{quote}
-\subsection{Model Assembly}
-The \texttt{build\_multitask\_model} function assembles the complete system:
-\begin{lstlisting}[caption={Model assembly from factory.py}]
-model = MultiTaskModel(
-    encoder=encoder,
-    decoder=decoder,
-    decoder_outputs_logits=True
-)
-model.add_head(
-    "summarization",
-    LMHead(d_model=cfg.d_model,
-           vocab_size=vocab_size,
-           tie_embedding=decoder.embedding)
-)
-model.add_head(
-    "emotion",
-    ClassificationHead(
-        d_model=cfg.d_model,
-        num_labels=28,  # GoEmotions
-        pooler="mean",
-        hidden_dim=cfg.d_model // 2)
-)
-model.add_head(
-    "topic",
-    ClassificationHead(
-        d_model=cfg.d_model,
-        num_labels=7,  # 7 topic categories
-        pooler="mean")
-)
-\end{lstlisting}
-%=============================================================================
-\section{Multi-Task Model Architecture}
-\label{sec:multitask}
-%=============================================================================
-The \texttt{MultiTaskModel} class in \texttt{src/models/multitask.py} provides the routing infrastructure for multi-task learning. Figure \ref{fig:multitask_routing} illustrates the task routing mechanism.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.7,
-    transform shape,
-    box/.style={draw, rectangle, minimum width=2cm, minimum height=0.6cm, align=center, rounded corners=2pt},
-    decision/.style={draw, diamond, aspect=2, minimum width=1.5cm, align=center, fill=yellow!30},
-    arrow/.style={->, >=stealth, thick},
-]
-% Forward call
-\node[box, fill=blue!20] (forward) at (0, 0) {forward(task, inputs)};
-% Decision
-\node[decision] (taskcheck) at (0, -1.5) {task type?};
-% Branches
-\node[box, fill=green!20] (encoder) at (-3.5, -3.5) {Encoder\\Only};
-\node[box, fill=green!20] (seq2seq) at (3.5, -3.5) {Encoder\\+ Decoder};
-% Heads
-\node[box, fill=orange!20] (classhead) at (-3.5, -5) {Classification\\Head};
-\node[box, fill=orange!20] (lmhead) at (3.5, -5) {LM Head};
-% Tasks
-\node[box, fill=purple!20, font=\scriptsize] (emotion) at (-5, -6.5) {Emotion};
-\node[box, fill=purple!20, font=\scriptsize] (topic) at (-2, -6.5) {Topic};
-\node[box, fill=purple!20, font=\scriptsize] (summ) at (3.5, -6.5) {Summarization};
-% Arrows
-\draw[arrow] (forward) -- (taskcheck);
-\draw[arrow] (taskcheck) -- node[above, font=\scriptsize] {Classification} (encoder);
-\draw[arrow] (taskcheck) -- node[above, font=\scriptsize] {Generation} (seq2seq);
-\draw[arrow] (encoder) -- (classhead);
-\draw[arrow] (seq2seq) -- (lmhead);
-\draw[arrow] (classhead) -- (emotion);
-\draw[arrow] (classhead) -- (topic);
-\draw[arrow] (lmhead) -- (summ);
-\end{tikzpicture}
-\caption{Task routing in MultiTaskModel. Classification tasks use encoder-only processing with mean pooling, while generation tasks use the full encoder-decoder pipeline.}
-\label{fig:multitask_routing}
-\end{figure}
-\subsection{Task-Specific Head Selection}
-The forward method routes inputs based on head type:
-\begin{quote}
-\emph{``Encoder-only heads expect encoder outputs [...] LM/seq2seq head: run encoder → decoder → lm head''} — \texttt{multitask.py}, lines 108--148
-\end{quote}
-\subsection{Classification Head}
-Classification tasks (emotion, topic) use mean pooling over encoder outputs:
-\begin{equation}
-h_{cls} = \frac{\sum_{i=1}^{L} h_i \cdot m_i}{\sum_{i=1}^{L} m_i}
-\end{equation}
-where $m_i$ is the attention mask (1 for valid tokens, 0 for padding). The pooled representation is projected through a linear layer to class logits.
-%=============================================================================
-\section{Training Pipeline}
-\label{sec:training}
-%=============================================================================
-The training infrastructure in \texttt{src/training/trainer.py} implements a comprehensive multi-task training loop with modern deep learning practices.
-\subsection{Training Configuration}
-The \texttt{TrainerConfig} dataclass encapsulates all hyperparameters:
-\begin{lstlisting}[caption={TrainerConfig from trainer.py}]
-@dataclass
-class TrainerConfig:
-    max_epochs: int = 1
-    gradient_clip_norm: float = 1.0
-    task_weights: Dict[str, float] | None = None
-    label_smoothing: float = 0.0
-    gradient_accumulation_steps: int = 1
-    scheduler_type: str = "cosine"
-    warmup_steps: int = 0
-    early_stopping_patience: int | None = None
-    gradient_checkpointing: bool = False
-    compile_model: bool = False
-\end{lstlisting}
-\subsection{Mixed-Precision Training}
-LexiMind uses Automatic Mixed Precision (AMP) with automatic dtype selection:
-\begin{quote}
-\emph{``AMP setup: bfloat16 for Ampere+ GPUs, float16 otherwise''} — \texttt{trainer.py}, line 102
-\end{quote}
-BFloat16 provides better numerical stability for training while maintaining the memory and speed benefits of reduced precision.
-\subsection{Learning Rate Scheduling}
-A cosine schedule with linear warmup is implemented:
-\begin{equation}
-lr(t) = \begin{cases}
-lr_{max} \cdot \frac{t}{t_{warmup}} & t < t_{warmup} \\
-lr_{min} + \frac{1}{2}(lr_{max} - lr_{min})(1 + \cos(\frac{\pi(t-t_{warmup})}{T-t_{warmup}})) & t \geq t_{warmup}
-\end{cases}
-\end{equation}
-Figure \ref{fig:lr_schedule} visualizes the learning rate schedule over training, showing the 300-step linear warmup followed by cosine decay.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/learning_rate_schedule.png}
-\caption{Learning rate schedule with linear warmup (300 steps) followed by cosine annealing. The warmup prevents early training instability while cosine decay ensures smooth convergence.}
-\label{fig:lr_schedule}
-\end{figure}
-\subsection{Multi-Task Loss Computation}
-The total loss combines task-specific losses with optional weighting:
-\begin{equation}
-\mathcal{L}_{total} = \sum_{t \in \text{tasks}} \lambda_t \mathcal{L}_t
-\end{equation}
-\begin{itemize}
-    \item \textbf{Summarization}: Cross-entropy with label smoothing and \texttt{ignore\_index=-100}
-    \item \textbf{Emotion}: Binary Cross-Entropy with Logits (multi-label)
-    \item \textbf{Topic}: Standard Cross-Entropy (single-label)
-\end{itemize}
-\subsection{Gradient Handling}
-The trainer includes gradient clipping and early stopping:
-\begin{quote}
-\emph{``Gradient clipping to prevent exploding gradients [...] Early stopping based on validation loss''} — \texttt{trainer.py}
-\end{quote}
-\subsection{Training Loop}
-Figure \ref{fig:training_loop} illustrates the training loop structure.
-\begin{figure}[htbp]
-\centering
-\begin{tikzpicture}[
-    scale=0.65,
-    transform shape,
-    box/.style={draw, rectangle, minimum width=2.5cm, minimum height=0.5cm, align=center, rounded corners=2pt, font=\small},
-    arrow/.style={->, >=stealth},
-]
-% Epoch loop
-\node[box, fill=blue!20] (epoch) at (0, 0) {For each epoch};
-% Batch loop
-\node[box, fill=green!20] (batch) at (0, -1.2) {For each batch};
-% Task loop
-\node[box, fill=yellow!20] (task) at (0, -2.4) {For each task};
-% Forward
-\node[box, fill=orange!20] (forward) at (0, -3.6) {Forward + Loss};
-% AMP context
-\node[box, fill=purple!20] (amp) at (0, -4.8) {AMP autocast};
-% Backward
-\node[box, fill=red!20] (backward) at (0, -6) {Backward (scaled)};
-% Accumulate check
-\node[box, fill=cyan!20] (accum) at (0, -7.2) {Accumulation step?};
-% Optimizer step
-\node[box, fill=gray!20] (optim) at (0, -8.4) {Clip + Step + Zero};
-% Validation
-\node[box, fill=blue!20] (val) at (3.5, -1.2) {Validation};
-% Checkpoint
-\node[box, fill=green!20] (ckpt) at (3.5, -2.4) {Checkpoint};
-% Early stopping
-\node[box, fill=red!20] (early) at (3.5, -3.6) {Early Stop?};
-% Arrows
-\draw[arrow] (epoch) -- (batch);
-\draw[arrow] (batch) -- (task);
-\draw[arrow] (task) -- (forward);
-\draw[arrow] (forward) -- (amp);
-\draw[arrow] (amp) -- (backward);
-\draw[arrow] (backward) -- (accum);
-\draw[arrow] (accum) -- (optim);
-\draw[arrow] (optim.south) -- ++(0, -0.3) -| ++(-2, 0) |- (task.west);
-\draw[arrow] (epoch.east) -- ++(0.5, 0) |- (val);
-\draw[arrow] (val) -- (ckpt);
-\draw[arrow] (ckpt) -- (early);
-\end{tikzpicture}
-\caption{Training loop structure showing nested iteration over epochs, batches, and tasks, with gradient accumulation and validation checkpoints.}
-\label{fig:training_loop}
-\end{figure}
-%=============================================================================
-\section{Tasks and Datasets}
-%=============================================================================
-LexiMind addresses three complementary NLP tasks:
-\subsection{Text Summarization}
-\textbf{Task}: Generate concise abstractive summaries from longer documents, focusing on back-cover style book descriptions rather than plot summaries.
-\textbf{Datasets}: The summarization corpus comprises 49,086 training samples, 2,727 validation samples, and 2,727 test samples. Literary content consists of Goodreads book descriptions (back-cover blurbs) matched with full texts from Project Gutenberg. Academic content includes arXiv paper abstracts paired with introduction sections. Unlike news-focused summarization models, LexiMind specializes in literary and academic long-form content.
-\textbf{Approach}: Encoder-decoder generation with greedy decoding (beam search available). The decoder uses causal masking and cross-attention to encoder representations, with a maximum generation length of 128 tokens.
-\textbf{Evaluation}: ROUGE-1/2/L for n-gram overlap, BLEU-4 for fluency, and BERTScore (using RoBERTa-large) for semantic similarity between generated and reference summaries.
-\subsection{Emotion Classification}
-\textbf{Task}: Multi-label classification identifying emotions expressed in text, where each sample may have multiple emotion labels.
-\textbf{Dataset}: Google's GoEmotions \cite{demszky2020goemotions}, comprising 43,410 training samples, 5,426 validation samples, and 5,427 test samples sourced from Reddit comments.
-\textbf{Classes}: 28 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, neutral, optimism, pride, realization, relief, remorse, sadness, and surprise.
-\textbf{Approach}: Encoder-only processing with mean pooling over token representations, followed by a two-layer classification head with hidden dimension 384. Binary Cross-Entropy with Logits loss enables independent multi-label prediction.
-\subsection{Topic Classification}
-\textbf{Task}: Single-label classification assigning documents to one of seven topic categories.
-\textbf{Datasets}: A curated collection of 3,402 training samples, 189 validation samples, and 189 test samples drawn from arXiv paper categories and Project Gutenberg book metadata.
-\textbf{Classes}: 7 mutually exclusive topics: Arts, Business, Fiction, History, Philosophy, Science, and Technology.
-\textbf{Approach}: Encoder-only architecture with mean pooling, identical to emotion classification but using standard Cross-Entropy loss for mutually exclusive classes. Due to the significantly smaller dataset (3.4K vs 43K for emotion), the topic loss weight is reduced to 0.3 during training to prevent overfitting while maintaining balanced multi-task learning.
-%=============================================================================
-\section{Model Specifications}
-%=============================================================================
-Table \ref{tab:dataset_summary} summarizes the dataset splits used for training and evaluation. Table \ref{tab:model_specs} details the model architecture.
-\begin{table}[htbp]
-\centering
-\caption{Dataset Summary}
-\label{tab:dataset_summary}
-\begin{tabular}{lccc}
-\toprule
-\textbf{Task} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
-\midrule
-Summarization & 49,086 & 2,727 & 2,727 \\
-Emotion & 43,410 & 5,426 & 5,427 \\
-Topic & 3,402 & 189 & 189 \\
-\midrule
-\textbf{Total} & 95,898 & 8,342 & 8,343 \\
-\bottomrule
-\end{tabular}
-\end{table}
-\begin{table}[htbp]
-\centering
-\caption{LexiMind Model Specifications}
-\label{tab:model_specs}
-\begin{tabular}{lc}
-\toprule
-\textbf{Parameter} & \textbf{Value} \\
-\midrule
-Hidden dimension ($d_{model}$) & 768 \\
-FFN dimension ($d_{ff}$) & 2048 \\
-Attention heads & 12 \\
-Head dimension & 64 \\
-Encoder layers & 12 \\
-Decoder layers & 12 \\
-Vocabulary size & 32,128 \\
-Max sequence length & 512 \\
-Dropout & 0.1 \\
-Activation & Gated-GELU \\
-Normalization & RMSNorm (Pre-LN) \\
-Position encoding & Relative bias \\
-\midrule
-Total parameters & $\sim$272M \\
-\bottomrule
-\end{tabular}
-\end{table}
-%=============================================================================
-\section{Implementation Details}
-%=============================================================================
-\subsection{Project Structure}
-LexiMind follows a modular architecture:
-\begin{verbatim}
-src/
-+-- models/
-|   +-- attention.py     # MHA, RelPosBias
-|   +-- encoder.py       # Encoder blocks
-|   +-- decoder.py       # Decoder blocks
-|   +-- heads.py         # Task heads
-|   +-- multitask.py     # MTL routing
-|   +-- factory.py       # Construction
-+-- data/
-|   +-- tokenization.py  # Tokenizer
-|   +-- dataset.py       # Dataset classes
-|   +-- dataloader.py    # Collators
-+-- training/
-    +-- trainer.py       # Training loop
-    +-- metrics.py       # Evaluation
-scripts/
-+-- train.py             # Main training
-+-- download_data.py     # Dataset download
-+-- inference.py         # CLI inference
-+-- demo_gradio.py       # Web demo
-\end{verbatim}
-\subsection{FlashAttention and CUDA Optimizations}
-The trainer enables comprehensive hardware-specific optimizations:
-\begin{lstlisting}[caption={CUDA optimizations from train.py}]
-if device.type == "cuda":
-    torch.backends.cudnn.benchmark = True
-    torch.backends.cuda.matmul.allow_tf32 = True
-    torch.backends.cudnn.allow_tf32 = True
-    torch.backends.cuda.enable_flash_sdp(True)
-    torch.backends.cuda.enable_mem_efficient_sdp(
-        True)
-\end{lstlisting}
-Note that T5-style relative position bias is incompatible with FlashAttention, as FlashAttention requires adding bias tensors to attention scores which breaks the fused kernel. The development configuration disables relative position bias to enable FlashAttention for faster iteration, while production configurations retain relative position bias for better quality.
-\subsection{Numerical Stability}
-To prevent overflow during mixed-precision training, hidden states are clamped after each sublayer:
-\begin{quote}
-\emph{``Clamp inf values for fp16/bf16 training stability (like HuggingFace T5)''} — \texttt{encoder.py}, lines 103--105
-\end{quote}
-%=============================================================================
-\section{Experimental Setup}
-%=============================================================================
-\subsection{Training Configuration}
-The final training configuration was optimized for quality and efficiency on an NVIDIA RTX 4070 with 12GB VRAM:
-\begin{itemize}
-    \item \textbf{Optimizer}: Fused AdamW with weight decay 0.01, $\beta_1=0.9$, $\beta_2=0.98$
-    \item \textbf{Learning Rate}: $3 \times 10^{-5}$ with cosine decay
-    \item \textbf{Warmup}: 300 steps ($\sim$0.5 epochs)
-    \item \textbf{Batch Size}: 10 with 4$\times$ gradient accumulation (effective batch size 40)
-    \item \textbf{Precision}: BFloat16 on Ampere+ GPUs with TF32 enabled
-    \item \textbf{Gradient Clipping}: Max norm 1.0
-    \item \textbf{Gradient Checkpointing}: Enabled for memory efficiency
-    \item \textbf{torch.compile}: Dynamic compilation for encoder and decoder
-    \item \textbf{Task Weights}: Summarization 1.0, Emotion 1.0, Topic 0.3 (reduced due to small dataset)
-    \item \textbf{Early Stopping}: Patience of 3 epochs on validation loss
-    \item \textbf{Encoder Freezing}: Bottom 4 layers frozen for stable transfer learning
-\end{itemize}
-Training completed in 7 epochs ($\sim$6 hours) with early stopping triggered due to validation loss plateau.
-%=============================================================================
-\section{Experimental Results}
-\label{sec:results}
-%=============================================================================
-We evaluate LexiMind on held-out validation sets for each task. Table \ref{tab:summarization_results} presents the summarization metrics, Table \ref{tab:classification_results} shows classification performance.
-\subsection{Summarization Performance}
-\begin{table}[htbp]
-\centering
-\caption{Summarization Evaluation Results}
-\label{tab:summarization_results}
-\begin{tabular}{lc}
-\toprule
-\textbf{Metric} & \textbf{Score} \\
-\midrule
-ROUGE-1 & 0.3064 \\
-ROUGE-2 & 0.0896 \\
-ROUGE-L & 0.1832 \\
-BLEU-4 & 0.0237 \\
-\midrule
-BERTScore Precision & 0.8430 \\
-BERTScore Recall & 0.8179 \\
-\textbf{BERTScore F1} & \textbf{0.8300} \\
-\bottomrule
-\end{tabular}
-\end{table}
-The BERTScore F1 of \textbf{0.83} demonstrates strong semantic similarity between generated descriptions and references, indicating the model captures meaning effectively even when exact wording differs. ROUGE scores are typical for abstractive summarization where the model paraphrases rather than extracts verbatim text.
-\subsection{Classification Performance}
-\begin{table}[htbp]
-\centering
-\caption{Classification Evaluation Results}
-\label{tab:classification_results}
-\begin{tabular}{llc}
-\toprule
-\textbf{Task} & \textbf{Metric} & \textbf{Score} \\
-\midrule
-\multirow{2}{*}{Topic (7 classes)} & Accuracy & \textbf{85.19\%} \\
- & Macro F1 & 0.8474 \\
-\midrule
-Emotion (28 classes) & Multi-label F1 & 0.1987 \\
-\bottomrule
-\end{tabular}
-\end{table}
-Topic classification achieves \textbf{85.2\%} accuracy with balanced per-class performance. The emotion detection task proves more challenging due to the 28-class multi-label setting with inherent label ambiguity in the GoEmotions dataset.
-\subsection{Training Dynamics}
-Figure \ref{fig:training_curves} illustrates the training dynamics over 7 epochs. The model achieves lowest validation loss at epoch 4 (summarization loss: 3.698), with the checkpoint from this epoch saved as the best model. Training continued through epoch 7 due to the early stopping patience of 3, but validation loss plateaued, confirming epoch 4 as optimal. The cosine learning rate schedule with 300-step warmup ensures smooth convergence.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
-\caption{Training and validation loss curves over 7 epochs. Best validation performance achieved at epoch 4 (marked), with subsequent epochs showing slight overfitting on the topic task due to its small dataset size.}
-\label{fig:training_curves}
-\end{figure}
-Figure \ref{fig:task_metrics} presents per-task metrics throughout training, showing the distinct learning trajectories for summarization, emotion detection, and topic classification.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/task_metrics.png}
-\caption{Task-specific metrics during training: ROUGE-1 for summarization, F1 for emotion detection, and accuracy for topic classification.}
-\label{fig:task_metrics}
-\end{figure}
-Figure \ref{fig:training_dynamics} provides a comprehensive view of training dynamics, including loss convergence, per-epoch improvements, cumulative loss reduction, and the train-validation gap which indicates overfitting behavior.
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/training_dynamics.png}
-\caption{Training dynamics overview: (top-left) Loss convergence with smoothing, (top-right) Relative improvement per epoch, (bottom-left) Cumulative loss reduction from initial values, (bottom-right) Train-validation gap showing slight overfitting after epoch 4.}
-\label{fig:training_dynamics}
-\end{figure}
-\subsection{Per-Class Topic Analysis}
-Table \ref{tab:topic_breakdown} shows the per-class performance for topic classification:
-\begin{table}[htbp]
-\centering
-\caption{Per-Class Topic Classification Performance}
-\label{tab:topic_breakdown}
-\begin{tabular}{lccc}
-\toprule
-\textbf{Topic} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} \\
-\midrule
-Arts & 0.93 & 0.76 & 0.84 \\
-Business & 0.97 & 0.97 & 0.97 \\
-Fiction & 0.95 & 1.00 & 0.97 \\
-History & 0.83 & 0.78 & 0.81 \\
-Philosophy & 0.80 & 0.86 & 0.83 \\
-Science & 0.58 & 0.73 & 0.65 \\
-Technology & 0.86 & 0.89 & 0.87 \\
-\bottomrule
-\end{tabular}
-\end{table}
-The model performs best on Fiction and Business categories, while Science shows the most confusion, likely due to overlap with Technology topics.
-%=============================================================================
-\section{Discussion}
-%=============================================================================
-\subsection{Key Findings}
-\textbf{BERTScore vs. ROUGE}: The high BERTScore F1 (0.83) combined with moderate ROUGE-1 (0.31) illustrates a key characteristic of abstractive summarization. The model generates semantically accurate paraphrases rather than extractive copies---behavior that ROUGE undervalues but BERTScore's contextual embeddings capture effectively. This aligns with our goal of generating back-cover style descriptions rather than plot summaries.
-\textbf{Multi-Task Learning Dynamics}: Analysis of training curves reveals distinct learning trajectories across tasks. Topic classification converges rapidly (reaching 99\% training accuracy by epoch 3) due to its smaller dataset, necessitating the reduced weight (0.3) to prevent gradient dominance. Emotion detection shows steady improvement throughout training, with validation F1 increasing from 0.30 to 0.40. Summarization loss decreases monotonically, with the best checkpoint captured at epoch 4.
-\textbf{Transfer Learning Benefits}: Initializing from FLAN-T5-base provides strong linguistic priors, enabling competitive performance with only 7 epochs of fine-tuning ($\sim$6 hours on consumer hardware). Freezing the bottom 4 encoder layers preserves general language understanding while allowing upper layers to specialize for our domain-specific tasks.
-\textbf{Checkpoint Selection}: The best model checkpoint at epoch 4 achieves the lowest validation summarization loss (3.698) while maintaining strong classification performance. Later epochs show slight overfitting on the topic task, validating our early stopping strategy.
-\subsection{Limitations}
-\begin{itemize}
-    \item \textbf{Emotion Detection}: The 28-class multi-label setting remains challenging, with F1 of 0.20 on validation data. GoEmotions' Reddit-sourced training data may not generalize well to the formal register of literary and academic content.
-    \item \textbf{Topic Dataset Imbalance}: With only 3,402 training samples distributed across 7 classes, some categories (notably Science with 0.65 F1) show lower performance due to limited examples and semantic overlap with related categories.
-    \item \textbf{Domain Gap}: While Goodreads descriptions provide quality literary summaries, the model's exposure to contemporary fiction is limited by Project Gutenberg's public domain focus on pre-1928 works.
-\end{itemize}
-\subsection{Future Work}
-Several directions could improve LexiMind's performance:
-\begin{itemize}
-    \item \textbf{Domain-Specific Emotion Data}: Fine-tuning on literary emotion annotations rather than Reddit comments could better capture the emotional nuances of literary and academic text.
-    \item \textbf{Parameter-Efficient Fine-Tuning}: Integrating LoRA \cite{hu2022lora} would reduce memory requirements and enable experimentation with larger base models (FLAN-T5-large, FLAN-T5-xl).
-    \item \textbf{Expanded Topic Dataset}: Augmenting the 3.4K topic samples through back-translation or synthetic data generation could improve classification robustness.
-\end{itemize}
-%=============================================================================
-\section{Conclusion}
-%=============================================================================
-This paper presented LexiMind, a multi-task NLP system combining custom Transformer implementation with FLAN-T5 pre-trained weights. The hybrid approach provides architectural transparency while leveraging transfer learning, achieving:
-\begin{itemize}
-    \item \textbf{Summarization}: BERTScore F1 of 0.83, demonstrating strong semantic fidelity for back-cover style book descriptions
-    \item \textbf{Topic Classification}: 85.2\% accuracy and 0.85 macro F1 across 7 categories
-    \item \textbf{Emotion Detection}: Multi-label F1 of 0.20 on 28 emotion classes
-\end{itemize}
-The complete system trains in approximately 6 hours on a consumer GPU (RTX 4070 12GB), demonstrating that sophisticated multi-task models remain accessible without datacenter-scale resources. The modular codebase serves both as a practical NLP tool for literary and academic content analysis and as an educational resource for understanding Transformer architecture internals.
-All code, trained models, and datasets are publicly available, with a live demonstration hosted on HuggingFace Spaces.\footnote{\url{https://huggingface.co/spaces/OliverPerrin/LexiMind}}
-%=============================================================================
-% References
-%=============================================================================
-\begin{thebibliography}{00}
-\bibitem{vaswani2017attention}
-A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ``Attention is all you need,'' in \textit{Advances in Neural Information Processing Systems (NeurIPS)}, vol. 30, 2017, pp. 5998--6008. [Online]. Available: \url{https://arxiv.org/abs/1706.03762}
-\bibitem{raffel2020exploring}
-C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{Journal of Machine Learning Research}, vol. 21, no. 140, pp. 1--67, 2020. [Online]. Available: \url{https://arxiv.org/abs/1910.10683}
-\bibitem{chung2022scaling}
-H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, ``Scaling instruction-finetuned language models,'' \textit{arXiv preprint arXiv:2210.11416}, 2022. [Online]. Available: \url{https://arxiv.org/abs/2210.11416}
-\bibitem{xiong2020layer}
-R. Xiong, Y. Yang, J. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, ``On layer normalization in the transformer architecture,'' in \textit{International Conference on Machine Learning (ICML)}, 2020, pp. 10524--10533. [Online]. Available: \url{https://arxiv.org/abs/2002.04745}
-\bibitem{zhang2019root}
-B. Zhang and R. Sennrich, ``Root mean square layer normalization,'' in \textit{Advances in Neural Information Processing Systems (NeurIPS)}, vol. 32, 2019, pp. 12360--12371. [Online]. Available: \url{https://arxiv.org/abs/1910.07467}
-\bibitem{ba2016layer}
-J. L. Ba, J. R. Kiros, and G. E. Hinton, ``Layer normalization,'' \textit{arXiv preprint arXiv:1607.06450}, 2016. [Online]. Available: \url{https://arxiv.org/abs/1607.06450}
-\bibitem{caruana1997multitask}
-R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
-\bibitem{kudo2018sentencepiece}
-T. Kudo and J. Richardson, ``SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,'' in \textit{Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, 2018, pp. 66--71. [Online]. Available: \url{https://arxiv.org/abs/1808.06226}
-\bibitem{hu2022lora}
-E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, ``LoRA: Low-rank adaptation of large language models,'' in \textit{International Conference on Learning Representations (ICLR)}, 2022. [Online]. Available: \url{https://arxiv.org/abs/2106.09685}
-\bibitem{dao2022flashattention}
-T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, ``FlashAttention: Fast and memory-efficient exact attention with IO-awareness,'' in \textit{Advances in Neural Information Processing Systems (NeurIPS)}, vol. 35, 2022, pp. 16344--16359. [Online]. Available: \url{https://arxiv.org/abs/2205.14135}
-\bibitem{zhang2019bertscore}
-T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, ``BERTScore: Evaluating text generation with BERT,'' in \textit{International Conference on Learning Representations (ICLR)}, 2020. [Online]. Available: \url{https://arxiv.org/abs/1904.09675}
-\bibitem{demszky2020goemotions}
-D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, 2020, pp. 4040--4054. [Online]. Available: \url{https://arxiv.org/abs/2005.00547}
-\end{thebibliography}
-\end{document}

docs/research_paper.tex DELETED Viewed

@@ -1,574 +0,0 @@
-% LexiMind: Multi-Task Learning for Literary and Academic Text Understanding
-% Research Paper - Revised with Experimental Rigor
-% Author: Oliver Perrin
-\documentclass[conference]{IEEEtran}
-\IEEEoverridecommandlockouts
-% Essential packages
-\usepackage{cite}
-\usepackage{amsmath,amssymb,amsfonts}
-\usepackage{graphicx}
-\usepackage{textcomp}
-\usepackage{xcolor}
-\usepackage{hyperref}
-\usepackage{booktabs}
-\usepackage{multirow}
-\usepackage{array}
-\usepackage{caption}
-% TikZ for diagrams
-\usepackage{tikz}
-\usetikzlibrary{shapes.geometric, arrows, positioning}
-% Hyperref setup
-\hypersetup{
-    colorlinks=true,
-    linkcolor=blue,
-    citecolor=blue,
-    urlcolor=blue
-}
-\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
-    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
-\begin{document}
-\title{Multi-Task Learning for Literary and Academic Text:\\Does Joint Training Help or Hurt?}
-\author{\IEEEauthorblockN{Oliver Perrin}\\
-\IEEEauthorblockA{Department of Computer Science\\
-Appalachian State University\\
-Email: perrinot@appstate.edu}}
-\maketitle
-\begin{abstract}
-Multi-task learning (MTL) promises improved generalization through shared representations, but its benefits depend heavily on task relatedness and domain characteristics. We investigate whether MTL improves performance on literary and academic text understanding---domains underrepresented in existing benchmarks dominated by news articles. Using a FLAN-T5-base encoder-decoder backbone (272M parameters), we jointly train on three tasks: abstractive summarization (49K samples: full-text passages $\rightarrow$ descriptive summaries from Goodreads book descriptions and arXiv abstracts), topic classification (3.4K samples across 7 categories), and multi-label emotion detection (43K samples from GoEmotions). Through ablation studies, we find that naive MTL with mean pooling and round-robin scheduling yields mixed results: topic classification gains +3.2\% accuracy, summarization remains stable, but emotion detection suffers negative transfer ($-$0.02 F1). We then show that two targeted interventions---\textbf{learned attention pooling} for the emotion head and \textbf{temperature-based task sampling} ($\alpha=0.5$)---eliminate negative transfer entirely, improving multi-task emotion sample-averaged F1 from 0.199 to 0.352 (+77\%), substantially exceeding the single-task baseline (0.218). With per-class threshold tuning, emotion macro F1 reaches 0.294. Topic classification improves to 85.7\% accuracy (95\% CI: [80.4\%, 91.0\%]), and summarization quality remains robust (ROUGE-1: 0.310, ROUGE-L: 0.185). Per-domain analysis reveals a significant quality gap between academic summaries (ROUGE-1: 0.319) and literary summaries (ROUGE-1: 0.206), attributable to the 11:1 training imbalance. We additionally contribute inter-task gradient conflict diagnostics, cross-task document deduplication, bootstrap confidence intervals, and multi-seed evaluation infrastructure. Our analysis demonstrates that architectural isolation of task-specific components (attention pooling) combined with balanced optimization (temperature sampling) can convert negative transfer to positive transfer in MTL systems.
-\end{abstract}
-\begin{IEEEkeywords}
-Multi-Task Learning, Transfer Learning, Text Summarization, Emotion Classification, FLAN-T5
-\end{IEEEkeywords}
-%=============================================================================
-\section{Introduction}
-%=============================================================================
-Multi-task learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, hypothesizing that shared representations improve generalization. In NLP, MTL has shown promise for sequence labeling \cite{collobert2011natural}, machine translation \cite{johnson2017google}, and question answering \cite{mccann2018natural}. However, recent work highlights that MTL does not universally help---negative transfer can occur when tasks compete for model capacity \cite{standley2020tasks}, and gradient conflicts between tasks can degrade joint optimization \cite{yu2020gradient}.
-We investigate MTL effectiveness in a specific, underexplored domain: \textbf{literary and academic text understanding}. Unlike news articles---which dominate existing benchmarks like CNN/DailyMail \cite{nallapati2016abstractive} and XSum \cite{narayan2018don}---literary and academic texts exhibit distinct characteristics: longer context dependencies, domain-specific vocabulary, and different summary styles (descriptive abstracts vs. extractive headlines). Recent domain-specific summarization work, including BookSum \cite{kryscinski2021booksum} for narrative summarization and CiteSum \cite{mao2022citesum} for citation-contextualized scientific summaries, demonstrates that domain matters for summarization quality---yet multi-task learning effects within these domains remain unstudied.
-Our study addresses three research questions:
-\begin{enumerate}
-    \item[\textbf{RQ1}] Does multi-task learning improve performance over single-task specialists on literary/academic domains?
-    \item[\textbf{RQ2}] Which tasks benefit from joint training, and which suffer negative transfer?
-    \item[\textbf{RQ3}] How much does pre-trained knowledge (FLAN-T5) contribute relative to task-specific fine-tuning?
-\end{enumerate}
-To answer these questions, we construct \textbf{LexiMind}, a multi-task system built on FLAN-T5-base \cite{chung2022scaling} that performs abstractive summarization, topic classification, and emotion detection. We conduct ablations comparing multi-task vs. single-task training, with vs. without FLAN-T5 initialization, and different task weight configurations. Our primary experimental contribution is the empirical characterization of transfer effects across these heterogeneous tasks:
-\begin{itemize}
-    \item \textbf{Topic classification benefits from MTL} (+3.7\% accuracy over single-task), leveraging shared encoder representations from the larger summarization dataset.
-    \item \textbf{Summarization is robust to MTL}, showing stable ROUGE scores despite sharing encoder capacity with classification heads.
-    \item \textbf{Emotion detection: from negative to positive transfer}. Naive MTL with mean pooling degrades emotion F1 by $-$0.02; learned attention pooling combined with temperature-based task sampling reverses this, yielding +0.134 F1 over the single-task baseline.
-    \item \textbf{Transfer learning dominates}: FLAN-T5 initialization provides the bulk of final performance; fine-tuning adds crucial domain adaptation.
-\end{itemize}
-We acknowledge important limitations: our main results are from single-seed runs, though we provide bootstrap confidence intervals and multi-seed evaluation infrastructure. We discuss these openly in Section~\ref{sec:limitations} and identify concrete follow-up methods (Ortho-LoRA \cite{ortholora2025}, PiKE \cite{pike2025}, ScaLearn \cite{scallearn2023}) as future work.
-%=============================================================================
-\section{Related Work}
-%=============================================================================
-\subsection{Multi-Task Learning in NLP}
-Collobert et al. \cite{collobert2011natural} demonstrated that joint training on POS tagging, chunking, and NER improved over single-task models. T5 \cite{raffel2020exploring} unified diverse NLP tasks through text-to-text framing, showing strong transfer across tasks. However, Standley et al. \cite{standley2020tasks} found that naive MTL often underperforms single-task learning, with performance depending on task groupings. More recently, Aghajanyan et al. \cite{aghajanyan2021muppet} showed that large-scale multi-task pre-finetuning can improve downstream performance, suggesting that the benefits of MTL depend on training scale and task diversity.
-\textbf{Gradient conflict and loss balancing.} Yu et al. \cite{yu2020gradient} proposed PCGrad, which projects conflicting gradients to reduce interference, while Liu et al. \cite{liu2021conflict} introduced CAGrad for conflict-averse optimization. Chen et al. \cite{chen2018gradnorm} proposed GradNorm for dynamically balancing task losses based on gradient magnitudes. Kendall et al. \cite{kendall2018multi} explored uncertainty-based task weighting. Our work uses fixed loss weights---a simpler but less adaptive approach---but includes gradient conflict diagnostics (inter-task cosine similarity monitoring) to characterize optimization interference. The negative transfer we observe on emotion detection makes dedicated mitigation methods a natural and important follow-up.
-\textbf{Recent advances in multi-task optimization.} Several recent methods address task interference more precisely. Ortho-LoRA \cite{ortholora2025} applies orthogonal constraints to low-rank adapter modules, preventing gradient interference between tasks while maintaining parameter efficiency. PiKE \cite{pike2025} proposes parameter-efficient knowledge exchange mechanisms that allow selective sharing between tasks, reducing negative transfer. ScaLearn \cite{scallearn2023} introduces shared attention layers with task-specific scaling factors, enabling fine-grained control over representation sharing. Complementary empirical work on task grouping via transfer-gain estimates \cite{taskgrouping2024} provides principled methods for deciding which tasks to train jointly, while neuron-centric MTL analysis \cite{neuroncentric2024} reveals that individual neurons specialize for different tasks, suggesting that architectural isolation strategies can be guided by activation patterns. These methods represent promising extensions to our current fixed-weight approach.
-\textbf{Multi-domain multi-task studies.} Aribandi et al. \cite{aribandi2022ext5} studied extreme multi-task scaling and found that not all tasks contribute positively. Our work provides complementary evidence at smaller scale, showing that even within a three-task setup, transfer effects are heterogeneous and depend on domain alignment.
-\subsection{Literary and Academic Summarization}
-Most summarization benchmarks focus on news \cite{nallapati2016abstractive, narayan2018don}. BookSum \cite{kryscinski2021booksum} introduced chapter-level and book-level summarization for literary texts, but targets plot summaries rather than descriptive abstracts. arXiv summarization \cite{cohan2018discourse} addresses academic papers with discourse-aware models. CiteSum \cite{mao2022citesum} leverages citation sentences as summaries for scientific papers. Our summarization setup differs from these: we pair literary source passages (extracted from Project Gutenberg full texts, avg. 3,030 characters) with Goodreads book descriptions (avg. 572 characters) as targets, training the model to generate \textit{what a book is about} rather than plot recaps. For academic text, arXiv paper body text (avg. 3,967 characters) is paired with abstracts (avg. 1,433 characters). The resulting compression ratios (0.19 for literary, 0.36 for academic) are closer to genuine summarization than short paraphrasing.
-\subsection{Emotion Detection}
-GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained emotion labels from Reddit comments. The original work reports 0.46 macro F1 using BERT-base with per-label thresholds tuned on the validation set. Subsequent work achieves 0.35--0.46 macro F1 depending on the model and threshold strategy. Importantly, all published GoEmotions baselines use encoder-only architectures (BERT, RoBERTa) rather than encoder-decoder models like T5. Our setup differs in both architecture (encoder-decoder with attention-pooled encoder states for emotion detection) and domain (training encoder primarily on literary/academic summarization), making direct comparison to published baselines informative but not fully controlled.
-%=============================================================================
-\section{Experimental Setup}
-%=============================================================================
-\subsection{Task Formulations}
-\label{sec:task_formulation}
-We define three tasks with explicit input-output specifications:
-\textbf{Summarization (generative).} The input is a passage of source text; the target is a descriptive summary. For literary texts, the source is a passage from a Project Gutenberg full text (mean: 3,030 characters, truncated to 512 tokens), and the target is the corresponding Goodreads book description (mean: 572 characters)---a back-cover style blurb describing \textit{what the book is about}, not a plot recap. For academic texts, the source is a passage from an arXiv paper body (mean: 3,967 characters, truncated to 512 tokens), and the target is the paper's abstract (mean: 1,433 characters, truncated to 512 tokens). This formulation is closer to genuine document summarization than paraphrasing: the average compression ratios are 0.19 (literary) and 0.36 (academic), comparable to standard summarization benchmarks.
-\textbf{Topic classification (discriminative, single-label).} The input is a text passage; the output is one of 7 classes: \textbf{Arts, Business, Fiction, History, Philosophy, Science, Technology}. Sources include 20 Newsgroups (mapped to our label taxonomy), Project Gutenberg subject metadata (for Fiction and Arts), and arXiv category metadata (for Science and Technology).
-\textbf{Emotion detection (discriminative, multi-label).} The input is a text passage; the output is a subset of 28 emotion labels from GoEmotions \cite{demszky2020goemotions}. Labels are predicted via sigmoid activation with a fixed threshold of 0.3 during training evaluation and 0.5 during inference. We use a fixed threshold rather than per-class tuning; this simplifies the setup but likely underestimates achievable performance (see Section~\ref{sec:emotion_analysis}).
-\subsection{Datasets}
-Table \ref{tab:datasets} summarizes dataset statistics.
-\begin{table}[htbp]
-\centering
-\caption{Dataset Statistics. Summarization sources are split approximately equally between literary and academic domains.}
-\label{tab:datasets}
-\begin{tabular}{llrrr}
-\toprule
-\textbf{Task} & \textbf{Source} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
-\midrule
-\multirow{2}{*}{Summarization} & Goodreads + Gutenberg & $\sim$4K & -- & -- \\
- & arXiv (body $\rightarrow$ abstract) & $\sim$45K & -- & -- \\
- & \textit{Combined} & 49,086 & 2,727 & 2,727 \\
-\midrule
-Topic (7 classes) & 20News + Gutenberg + arXiv & 3,402 & 189 & 189 \\
-\midrule
-Emotion (28 labels) & GoEmotions (Reddit) & 43,410 & 5,426 & 5,427 \\
-\bottomrule
-\end{tabular}
-\end{table}
-\textbf{Dataset curation.} Summarization pairs are constructed by matching Gutenberg full texts with Goodreads descriptions via title/author matching, and by pairing arXiv paper bodies with their abstracts. Text is truncated to 512 tokens (max encoder input length). No deduplication was performed \textit{within} the literary and academic subsets, as they are drawn from disjoint sources. We note that the academic subset is substantially larger ($\sim$45K vs. $\sim$4K literary), creating an approximately 11:1 domain imbalance within the summarization task---this imbalance means the encoder is disproportionately shaped by academic text and may affect literary summarization quality (see Section~\ref{sec:limitations}). Topic labels are derived from source metadata (arXiv categories, Gutenberg subjects, 20 Newsgroups categories) and mapped to our 7-class taxonomy; no manual annotation was performed, which introduces potential noise from metadata inaccuracies (e.g., a multidisciplinary paper categorized only as ``Science'' when it also involves ``Technology''). GoEmotions is used as-is from the HuggingFace datasets hub.
-\textbf{Cross-task deduplication.} Because the topic classification dataset draws from a subset of the same sources as the summarization dataset (arXiv, Project Gutenberg), we perform cross-task document deduplication to prevent data leakage. Using MD5 fingerprints of normalized text prefixes, we identify and remove any topic/emotion examples whose source text appears in the summarization training set. This ensures our MTL evaluation is not confounded by overlapping examples across tasks.
-\textbf{Note on dataset sizes.} The large disparity between topic (3.4K) and summarization (49K) training sets is a key experimental variable: it tests whether a low-resource classification task can benefit from shared representations with a high-resource generative task.
-\subsection{Model Architecture}
-LexiMind uses FLAN-T5-base (272M parameters) as the backbone, with a custom reimplementation that loads pre-trained weights via a factory module for architectural transparency:
-\begin{itemize}
-    \item 12-layer encoder, 12-layer decoder
-    \item 768-dimensional hidden states, 12 attention heads
-    \item T5-style relative position bias (no absolute positional embeddings)
-    \item Pre-Layer Normalization with RMSNorm \cite{zhang2019root}
-    \item FlashAttention via PyTorch 2.0 SDPA when compatible
-\end{itemize}
-Task-specific heads branch from the shared encoder:
-\begin{itemize}
-    \item \textbf{Summarization}: Full decoder with language modeling head (cross-entropy loss with label smoothing $\epsilon=0.1$, greedy decoding with max length 512 tokens)
-    \item \textbf{Topic}: Linear classifier on mean-pooled encoder hidden states (cross-entropy loss)
-    \item \textbf{Emotion}: Linear classifier on \textit{attention-pooled} encoder hidden states with sigmoid activation (binary cross-entropy loss). Instead of naive mean pooling, a learned attention query computes a weighted average over encoder positions: $\mathbf{h} = \sum_i \alpha_i \mathbf{h}_i$ where $\alpha_i = \mathrm{softmax}(\mathbf{q}^\top \mathbf{h}_i / \sqrt{d})$ and $\mathbf{q} \in \mathbb{R}^d$ is a trainable query vector. This allows the emotion head to attend to emotionally salient positions rather than treating all tokens equally.
-\end{itemize}
-\textbf{Architectural note.} The attention pooling mechanism for emotion detection was introduced to address a limitation of mean pooling: emotional content is typically concentrated in specific tokens or phrases, and mean pooling dilutes these signals across the full sequence. For topic classification, mean pooling remains effective because topical information is distributed more uniformly. We discuss the trade-offs of classification in encoder-decoder models in Section~\ref{sec:emotion_analysis}.
-\subsection{Training Configuration}
-All experiments use consistent hyperparameters unless otherwise noted:
-\begin{itemize}
-    \item \textbf{Optimizer}: Fused AdamW, lr=$3\times10^{-5}$, weight decay=0.01, $\beta_1$=0.9, $\beta_2$=0.98
-    \item \textbf{Batch size}: 10 per step $\times$ 4 gradient accumulation = 40 effective
-    \item \textbf{Schedule}: 300-step linear warmup, cosine decay to 0.1$\times$ peak lr
-    \item \textbf{Max epochs}: 8 with early stopping (patience=3 on validation loss)
-    \item \textbf{Precision}: BFloat16 on NVIDIA RTX 4070 (12GB VRAM)
-    \item \textbf{Gradient clipping}: Max norm 1.0
-    \item \textbf{Encoder freezing}: Bottom 4 layers frozen for stable transfer learning
-\end{itemize}
-\textbf{Task scheduling.} We use \textbf{temperature-based sampling}: task $i$ is sampled with probability $p_i \propto n_i^\alpha$, where $n_i$ is the dataset size and $\alpha = 0.5$ (square-root scaling). This gives sampling probabilities of approximately 45\% summarization, 43\% emotion, and 12\% topic---ensuring the small topic dataset receives proportionally more gradient updates than pure proportional sampling would provide, while still exposing the model more frequently to larger datasets. We compared this against round-robin scheduling (equal update frequency regardless of dataset size) in preliminary experiments and found temperature sampling yields substantially better emotion detection performance.
-\textbf{Loss weighting.} Task losses are combined with fixed weights: summarization=1.0, emotion=1.0, topic=0.3. The reduced topic weight was chosen to prevent the small topic dataset (3.4K samples, exhausted in $\sim$85 steps) from dominating gradients through rapid overfitting. We did not explore dynamic weighting methods such as GradNorm \cite{chen2018gradnorm} or uncertainty weighting \cite{kendall2018multi}; given the negative transfer observed on emotion, these methods could potentially improve results and are identified as future work.
-\textbf{Gradient conflict monitoring.} To characterize optimization interference between tasks, we implement periodic gradient conflict diagnostics. At configurable intervals during training, per-task gradients are computed independently and compared via cosine similarity: $\cos(\mathbf{g}_i, \mathbf{g}_j) = \mathbf{g}_i \cdot \mathbf{g}_j / (\|\mathbf{g}_i\| \|\mathbf{g}_j\|)$. Negative cosine similarity indicates a gradient conflict---tasks pulling the shared parameters in opposing directions. Conflict rates (fraction of measured steps with $\cos < 0$) are logged to MLflow for analysis. This diagnostic does not modify training dynamics (unlike PCGrad \cite{yu2020gradient} or CAGrad \cite{liu2021conflict}), but provides empirical evidence for whether gradient conflicts contribute to observed negative transfer.
-\textbf{Early stopping.} Early stopping is based on the combined weighted validation loss (using the same task weights as training) with patience of 3 epochs. The best checkpoint is selected by minimum combined validation loss.
-\subsection{Baselines and Ablations}
-We compare four configurations:
-\begin{enumerate}
-    \item \textbf{Random/Majority}: Random predictions for classification; summarization is not evaluated against random baselines (ROUGE of random text is near zero).
-    \item \textbf{FLAN-T5-base (zero-shot)}: Pre-trained model with task-appropriate prompts, no fine-tuning.
-    \item \textbf{Single-Task}: Separate models fine-tuned on each task individually with identical hyperparameters.
-    \item \textbf{Multi-Task Baseline}: Joint training with mean pooling and round-robin scheduling.
-    \item \textbf{Multi-Task Improved}: Joint training with attention pooling for emotion and temperature sampling ($\alpha=0.5$).
-\end{enumerate}
-We additionally ablate FLAN-T5 initialization vs. random initialization to isolate transfer learning contribution.
-\subsection{Evaluation Metrics}
-\begin{itemize}
-    \item \textbf{Summarization}: ROUGE-1/2/L \cite{lin2004rouge} (lexical overlap) and BLEU-4 (n-gram precision with brevity penalty). ROUGE-1 serves as the primary metric for summarization quality. BERTScore \cite{zhang2019bertscore} is available as an optional semantic similarity metric but is not used in our primary evaluation due to its high computational cost and the difficulty of interpreting its absolute values. Per-domain breakdown (literary vs. academic) is provided to analyze domain-specific quality.
-    \item \textbf{Topic}: Accuracy and Macro F1 (unweighted average across 7 classes).
-    \item \textbf{Emotion}: We report three complementary F1 variants: (1) \textbf{Sample-averaged F1}---computed per-sample as the harmonic mean of per-sample precision and recall, then averaged across all samples; (2) \textbf{Macro F1}---averaged per-class F1 across all 28 emotion labels, treating each class equally regardless of frequency; (3) \textbf{Micro F1}---aggregated across all class predictions, weighting by class frequency. We additionally report per-class precision, recall, and F1 for all 28 emotions, enabling fine-grained error analysis. \textbf{Per-class threshold tuning}: instead of a fixed threshold (0.3 or 0.5), we optionally tune per-class sigmoid thresholds on the validation set by sweeping $\tau \in \{0.1, 0.2, \ldots, 0.9\}$ and selecting the threshold maximizing per-class F1.
-\end{itemize}
-\textbf{Statistical rigor.} To address limitations of single-seed evaluation, we implement bootstrap confidence intervals (1,000 resamples, 95\% percentile CI) for all key metrics. For summarization, per-sample ROUGE-1 and ROUGE-L scores are bootstrapped; for emotion, per-sample F1 values; for topic, per-sample correctness indicators. We additionally provide \texttt{paired\_bootstrap\_test()} for comparing two system configurations on the same test set (null hypothesis: system B $\leq$ system A). Multi-seed evaluation infrastructure (\texttt{train\_multiseed.py}) automates training across $k$ seeds and reports mean $\pm$ standard deviation across runs, enabling variance-aware claims. Results in Table~\ref{tab:main_results} remain single-seed but should be validated with multi-seed runs before drawing strong conclusions.
-%=============================================================================
-\section{Results}
-%=============================================================================
-\subsection{Main Results}
-Table \ref{tab:main_results} compares single-task specialists, baseline MTL (mean pooling, round-robin scheduling), and improved MTL (attention pooling, temperature sampling).
-\begin{table}[htbp]
-\centering
-\caption{Main Results. Single-Task and MTL Baseline use mean pooling and round-robin scheduling. MTL Improved uses attention pooling for emotion and temperature sampling ($\alpha=0.5$). All results are single-seed. Bold indicates best.}
-\label{tab:main_results}
-\begin{tabular}{llccc}
-\toprule
-\textbf{Task} & \textbf{Metric} & \textbf{Single} & \textbf{MTL Base} & \textbf{MTL Impr.} \\
-\midrule
-\multirow{3}{*}{Summ.} & ROUGE-1 & 0.298 & 0.306 & \textbf{0.310} \\
- & ROUGE-2 & 0.085 & 0.090 & \textbf{0.091} \\
- & ROUGE-L & 0.179 & 0.183 & \textbf{0.185} \\
-\midrule
-\multirow{2}{*}{Topic} & Accuracy & 82.0\% & 85.2\% & \textbf{85.7\%} \\
- & Macro F1 & 0.812 & 0.847 & \textbf{0.854} \\
-\midrule
-\multirow{3}{*}{Emotion} & Sample F1 & 0.218 & 0.199 & \textbf{0.352} \\
- & Macro F1 & --- & --- & 0.143 \\
- & Micro F1 & --- & --- & \textbf{0.443} \\
-\bottomrule
-\end{tabular}
-\end{table}
-\textbf{Key finding}: Attention pooling and temperature sampling yield improvements across \textit{all} tasks, with the largest impact on emotion detection:
-\begin{itemize}
-    \item \textbf{Emotion detection: negative transfer eliminated.} Baseline MTL with mean pooling degraded emotion F1 by $-$0.019 vs. single-task. With attention pooling and temperature sampling, multi-task emotion F1 improves to 0.352---a +0.134 gain over single-task (0.218) and +0.153 over baseline MTL (0.199). The attention pooling mechanism allows the emotion head to focus on emotionally salient tokens rather than averaging over the full sequence, which is critical for the sparse multi-label task. Temperature sampling ensures the emotion task receives proportional gradient exposure ($\sim$43\% of steps).
-    \item \textbf{Topic classification: +3.7\% accuracy} over single-task (85.7\% vs. 82.0\%, 95\% CI: [80.4\%, 91.0\%]). The small topic dataset (3.4K samples) benefits from shared encoder representations learned from the larger summarization corpus (49K samples). The bootstrap CI is wide due to the small validation set (189 samples), but the lower bound (80.4\%) still exceeds the single-task point estimate.
-    \item \textbf{Summarization remains stable} across all configurations. ROUGE-1 improves slightly from 0.298 (single-task) to 0.310 (improved MTL). The decoder---which contains half the model's parameters---insulates summarization from classification interference. ROUGE-1 95\% CI: [0.306, 0.313].
-\end{itemize}
-\subsection{Baseline Comparisons}
-\label{sec:baseline_discussion}
-Table \ref{tab:baselines} contextualizes our results against trivial and zero-shot baselines.
-\begin{table}[htbp]
-\centering
-\caption{Comparison with Baselines (Improved MTL Configuration)}
-\label{tab:baselines}
-\begin{tabular}{lccc}
-\toprule
-\textbf{Model} & \textbf{Summ (R-L)} & \textbf{Topic (Acc)} & \textbf{Emot (F1)} \\
-\midrule
-Random/Majority & --- & 14.3\% & 0.036 \\
-FLAN-T5 zero-shot & 0.121 & 58.2\% & 0.089 \\
-Single-Task & 0.179 & 82.0\% & 0.218 \\
-\textbf{Multi-Task (Impr.)} & \textbf{0.185} & \textbf{85.7\%} & \textbf{0.352} \\
-\bottomrule
-\end{tabular}
-\end{table}
-Fine-tuning provides substantial gains over zero-shot across all tasks (+0.064 ROUGE-L, +27\% topic accuracy, +0.13 emotion F1), demonstrating the importance of domain adaptation even with instruction-tuned models. The improved MTL configuration further improves over single-task baselines on all three tasks, demonstrating that the combination of attention pooling and temperature sampling enables positive transfer even for the domain-mismatched emotion task.
-\subsection{Ablation: Transfer Learning Contribution}
-Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-training by comparing against random initialization with identical architecture and training.
-\begin{table}[htbp]
-\centering
-\caption{Effect of Pre-trained Initialization (Improved MTL Setting)}
-\label{tab:transfer_ablation}
-\begin{tabular}{lccc}
-\toprule
-\textbf{Initialization} & \textbf{Summ (R-L)} & \textbf{Topic (Acc)} & \textbf{Emot (F1)} \\
-\midrule
-Random & 0.098 & 45.2\% & 0.082 \\
-FLAN-T5-base & \textbf{0.185} & \textbf{85.7\%} & \textbf{0.352} \\
-\midrule
-\textit{Absolute gain} & +0.087 & +40.5\% & +0.270 \\
-\bottomrule
-\end{tabular}
-\end{table}
-FLAN-T5 initialization provides large absolute gains across all tasks. \textbf{Pre-training is necessary for competitive performance}---random initialization produces substantially worse results on all tasks even with identical data and training budget. Fine-tuning provides the remaining domain adaptation that zero-shot pre-training alone cannot achieve.
-\subsection{Per-Class Topic Analysis}
-Table \ref{tab:topic_breakdown} reveals per-class patterns in topic classification across the 7 classes.
-\begin{table}[htbp]
-\centering
-\caption{Per-Class Topic Classification (Improved MTL)}
-\label{tab:topic_breakdown}
-\begin{tabular}{lccc}
-\toprule
-\textbf{Topic} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} \\
-\midrule
-Arts & 0.93 & 0.79 & 0.86 \\
-Business & 0.97 & 1.00 & 0.98 \\
-Fiction & 0.95 & 1.00 & 0.97 \\
-History & 0.85 & 0.76 & 0.80 \\
-Philosophy & 0.79 & 0.82 & 0.81 \\
-Science & 0.57 & 0.80 & 0.67 \\
-Technology & 0.89 & 0.89 & 0.89 \\
-\midrule
-\textit{Macro Avg} & 0.85 & 0.87 & 0.85 \\
-\bottomrule
-\end{tabular}
-\end{table}
-Fiction and Business achieve near-perfect classification (F1 $\geq$ 0.97), while Science shows the most confusion (F1 = 0.67). Error analysis reveals Science samples are frequently misclassified as Technology---semantically plausible given that scientific research papers often describe technical methods. The Arts class shows lower recall (0.79), suggesting some arts-related texts are misclassified into adjacent categories.
-\subsection{Per-Domain Summarization Analysis}
-Table \ref{tab:domain_breakdown} reveals a substantial quality gap between academic and literary summarization, reflecting the 11:1 training imbalance.
-\begin{table}[htbp]
-\centering
-\caption{Per-Domain Summarization Performance (Improved MTL)}
-\label{tab:domain_breakdown}
-\begin{tabular}{lcccc}
-\toprule
-\textbf{Domain} & \textbf{N} & \textbf{ROUGE-1} & \textbf{ROUGE-L} & \textbf{BLEU-4} \\
-\midrule
-Academic & 2,493 & 0.319 & 0.189 & 0.026 \\
-Literary & 234 & 0.206 & 0.137 & 0.008 \\
-\midrule
-\textit{Overall} & 2,727 & 0.310 & 0.185 & 0.024 \\
-\bottomrule
-\end{tabular}
-\end{table}
-Academic summaries (ROUGE-1: 0.319) outperform literary summaries (ROUGE-1: 0.206) by +0.113, a large gap attributable to two factors: (1) the encoder is disproportionately trained on academic text ($\sim$45K academic vs. $\sim$4K literary), and (2) academic abstracts follow more predictable structural conventions (background-method-result) that are easier for the model to reproduce. Literary descriptions---which describe \textit{what a book is about} in narrative prose---require more creative generation.
-\subsection{Analysis: Emotion Detection Improvements}
-\label{sec:emotion_analysis}
-Our improved multi-task emotion sample-averaged F1 (0.352) represents a dramatic improvement over the baseline MTL configuration (0.199). With per-class threshold tuning, macro F1 reaches 0.294---approaching published GoEmotions baselines (0.46 macro F1 with BERT-base \cite{demszky2020goemotions}). We analyze the contributing factors:
-\begin{enumerate}
-    \item \textbf{Attention pooling is critical.} Replacing mean pooling with a learned attention query allows the emotion head to focus on emotionally salient tokens. In our 28-class multi-label setting, emotional signals are typically concentrated in specific words or phrases (e.g., ``grateful,'' ``hilarious,'' ``heartbreaking''), which mean pooling dilutes across the full 512-token sequence. The top-performing classes---gratitude (F1: 0.888), amusement (0.751), love (0.740), admiration (0.653)---correspond to emotions with distinctive lexical markers that attention pooling can localize.
-    \item \textbf{Temperature sampling improves optimization.} With round-robin scheduling, emotion receives equal update frequency as the other tasks, but the summarization decoder backpropagates much larger gradients through the encoder, skewing shared representations toward academic text style. Temperature sampling ($\alpha=0.5$) allocates $\sim$43\% of steps to emotion---proportional to its dataset size---ensuring the encoder maintains emotion-relevant features.
-    \item \textbf{Remaining class-level gaps.} Despite overall improvement, 15 of 28 emotion classes still have zero F1 at the default 0.5 threshold (including approval, annoyance, disapproval, anger). These tend to be either rare classes ($<$100 support) or semantically subtle emotions that overlap with other classes. Per-class threshold tuning recovers non-zero performance for most of these classes, increasing macro F1 from 0.143 to 0.294.
-    \item \textbf{Domain gap persists.} Despite improvements, the remaining gap vs. published GoEmotions baselines (0.46 macro F1) reflects the fundamental domain mismatch between Reddit comments and our literary/academic encoder. Encoder-only architectures (BERT) dedicate full model capacity to classification, whereas our encoder is optimized primarily for summarization decoding.
-\end{enumerate}
-\textbf{Per-class threshold tuning results.} Sweeping $\tau \in \{0.1, \ldots, 0.9\}$ per class on the validation set yields tuned sample-averaged F1 of 0.503, tuned macro F1 of 0.294, and tuned micro F1 of 0.486. The optimal thresholds vary widely: gratitude saturates at $\tau=0.65$ (high confidence predictions), while rare classes require $\tau \leq 0.2$ to achieve non-zero recall.
-\textbf{Implication}: Architectural isolation of classification heads (attention pooling) combined with balanced optimization (temperature sampling) can overcome domain mismatch in MTL, converting negative transfer to substantial positive transfer.
-\subsection{Training Dynamics}
-Figure \ref{fig:training_curves} shows training progression over 8 epochs (approximately 9 hours on RTX 4070 with temperature sampling).
-\begin{figure}[htbp]
-\centering
-\includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
-\caption{Training and validation loss with temperature sampling and attention pooling. Combined validation loss decreases from 4.298 to 3.925 over 8 epochs; best checkpoint at epoch 8.}
-\label{fig:training_curves}
-\end{figure}
-Key observations:
-\begin{itemize}
-    \item Topic classification converges rapidly: 91\% training accuracy by epoch 3 (84\% validation), reaching 98\% by epoch 8. Validation accuracy plateaus near 86\% from epoch 2 onward, while training accuracy continues climbing---a sign of mild overfitting on the small (3.4K) topic dataset. The reduced task weight (0.3) limits gradient dominance.
-    \item Summarization training loss decreases steadily (4.057 $\rightarrow$ 3.699), with validation loss flattening after epoch 5 (3.665 $\rightarrow$ 3.653). Training ROUGE-1 improves from 0.287 to 0.308.
-    \item Emotion F1 improves steadily throughout training: validation F1 rises from 0.197 (epoch 1) to 0.459 (epoch 8), indicating the attention pooling mechanism continues refining its weights over the full training duration.
-    \item Combined validation loss decreases from 4.298 (epoch 1) to 3.925 (epoch 8), though the decrease is marginal after epoch 5. Early stopping (patience=3) did not trigger because the combined loss continued improving slightly each epoch. Additional epochs could yield further modest gains, though the near-plateau after epoch 5 suggests diminishing returns.
-\end{itemize}
-%=============================================================================
-\section{Discussion}
-%=============================================================================
-\subsection{When Does MTL Help?}
-Our results demonstrate that MTL effectiveness depends on both task relatedness \textit{and} architectural/optimization choices:
-\textbf{MTL helps when}: (1) A small-dataset task (topic: 3.4K samples) shares domain with a large-dataset task (summarization: 49K literary/academic samples)---the topic classifier benefits from shared encoder representations tuned to literary and academic vocabulary. (2) Task-specific heads are architecturally isolated from shared representations---attention pooling for emotion allows task-specific feature extraction without interfering with the shared encoder.
-\textbf{MTL requires intervention when}: An auxiliary task's domain is misaligned with the primary training signal. With naive mean pooling, emotion detection suffered negative transfer because the encoder's representations were skewed toward summarization. Attention pooling and temperature sampling together overcame this: attention pooling provides architectural isolation, while temperature sampling ensures balanced optimization.
-\textbf{MTL is neutral for}: The primary task (summarization) with sufficient data and a dedicated component (decoder, $\sim$136M parameters) that insulates it from interference. Classification heads are small and their gradients have limited impact relative to the decoder's backpropagation signal.
-\subsection{Comparison to MTL Literature}
-Our findings align qualitatively with several key results in the MTL literature. Standley et al. \cite{standley2020tasks} showed that task groupings critically affect MTL outcomes---our baseline results (positive transfer for topic, negative for emotion) confirmed this, but our improved configuration shows that \textit{architectural interventions can change these grouping dynamics}. Yu et al. \cite{yu2020gradient} demonstrated that gradient conflicts between tasks explain negative transfer; our gradient conflict diagnostics (Section~3.4) enable empirical measurement of inter-task gradient cosine similarity, and our temperature sampling partially addresses gradient imbalance by controlling task exposure frequency. Aribandi et al. \cite{aribandi2022ext5} found diminishing or negative returns from adding more tasks; our results suggest that per-task architectural isolation (attention pooling) can mitigate this.
-A key difference from the broader MTL literature is our use of an encoder-decoder architecture with mixed generative and discriminative tasks. Most MTL studies use encoder-only models for classification-only task sets. The encoder-decoder setup creates an asymmetry: the summarization task dominates the encoder through decoder backpropagation, while classification tasks receive shared representations as a secondary benefit or detriment. Our results show that task-specific pooling strategies can partially compensate for this asymmetry. Recent neuron-centric analysis \cite{neuroncentric2024} suggests that individual neurons specialize for different tasks, which could inform more targeted isolation strategies.
-\subsection{Implications for Practitioners}
-Based on our findings:
-\begin{enumerate}
-    \item \textbf{Audit domain alignment} before combining tasks in MTL. If auxiliary tasks draw from different text domains (e.g., social media vs. academic), negative transfer is likely unless mitigated by gradient-conflict methods or per-task adapters.
-    \item \textbf{Task weighting matters} for preventing small-dataset overfitting. Our reduced weight (0.3) for topic classification prevented gradient dominance while still enabling positive transfer. Dynamic methods (GradNorm \cite{chen2018gradnorm}) may yield better balance automatically.
-    \item \textbf{Architectural isolation protects high-priority tasks}. Summarization's dedicated decoder shielded it from classification interference. For classification tasks, per-task adapter layers \cite{houlsby2019parameter} or LoRA modules \cite{hu2022lora} could provide analogous isolation. Learned attention pooling (replacing mean pooling) is a lightweight isolation strategy for multi-label heads that improves focus on task-relevant tokens.
-    \item \textbf{Monitor gradient conflicts} before deploying MTL. Inter-task gradient cosine similarity monitoring (at negligible computational cost) reveals whether tasks interfere at the optimization level, informing the choice between simple fixed weights and more sophisticated methods (PCGrad, Ortho-LoRA).
-    \item \textbf{Use temperature-based sampling} when dataset sizes vary widely. Square-root temperature ($\alpha=0.5$) balances exposure across tasks without starving small-dataset tasks.
-    \item \textbf{Validate with multiple seeds} before drawing conclusions from MTL comparisons, especially with small validation sets. Bootstrap confidence intervals provide within-run uncertainty estimates; multi-seed runs capture cross-run variance.
-\end{enumerate}
-\subsection{Limitations}
-\label{sec:limitations}
-We identify several limitations that constrain the generalizability of our findings:
-\begin{itemize}
-    \item \textbf{Single-seed results}: Reported results are from single training runs. The +3.2\% topic accuracy gain (on 189 validation samples) could be within random variance. We provide bootstrap confidence intervals to partially address this, and multi-seed evaluation infrastructure (\texttt{train\_multiseed.py}) to enable variance estimation across seeds. Results should be validated with $\geq$3 seeds before drawing strong conclusions.
-    \item \textbf{Gradient-conflict diagnostics but no mitigation}: We monitor inter-task gradient cosine similarity to characterize conflicts, but do not apply corrective methods such as PCGrad \cite{yu2020gradient}, CAGrad \cite{liu2021conflict}, GradNorm \cite{chen2018gradnorm}, or uncertainty weighting \cite{kendall2018multi}. These methods could provide additional gains beyond our attention pooling and temperature sampling improvements.
-    \item \textbf{No encoder-only baseline}: We do not compare against BERT or RoBERTa fine-tuned on GoEmotions or topic classification. Such a comparison would disentangle architecture effects from MTL effects in our classification results. The remaining gap between our tuned macro F1 (0.294) and published GoEmotions baselines (0.46) likely reflects this architectural difference.
-    \item \textbf{Cross-task data leakage}: Although topic and summarization datasets draw from overlapping sources (arXiv, Project Gutenberg), we implement cross-task deduplication via MD5 fingerprinting to prevent data leakage. However, residual near-duplicates (paraphrases, overlapping passages below the fingerprint threshold) may still exist and could inflate topic classification performance in the MTL setting.
-    \item \textbf{Dataset construction noise}: Topic labels are derived from source metadata (arXiv categories, Gutenberg subjects) via automatic mapping to our 7-class taxonomy. No manual annotation or quality verification was performed. We conducted a manual inspection of 50 random topic samples and found $\sim$90\% accuracy in the automatic mapping, with errors concentrated in ambiguous categories (e.g., ``History of Science'' mapped to History rather than Science). This noise level is acceptable for our analysis but limits the precision of per-class findings.
-    \item \textbf{No human evaluation}: ROUGE scores are imperfect proxies for summary quality, especially for creative/literary text where stylistic quality matters beyond semantic accuracy.
-    \item \textbf{Single model scale}: We study only FLAN-T5-base (272M parameters). Transfer dynamics may differ at larger scales (T5-large, T5-xl), where increased capacity could reduce task interference.
-    \item \textbf{Summarization domain imbalance}: The $\sim$11:1 ratio of academic to literary samples within the summarization task means the encoder is disproportionately shaped by academic text. Per-domain evaluation reveals this imbalance in practice, and is analyzed in per-domain breakdowns.
-\end{itemize}
-\subsection{Future Work}
-\begin{itemize}
-    \item \textbf{Gradient-conflict mitigation}: Our gradient conflict diagnostics provide the empirical foundation; the natural next step is applying Ortho-LoRA \cite{ortholora2025} for orthogonal gradient projection, PCGrad \cite{yu2020gradient} for gradient surgery, or CAGrad \cite{liu2021conflict} for conflict-averse optimization. These methods directly target the interference our diagnostics characterize.
-    \item \textbf{Parameter-efficient multi-tasking}: PiKE \cite{pike2025} for selective knowledge exchange between tasks, per-task LoRA adapters \cite{hu2022lora}, ScaLearn \cite{scallearn2023} shared attention with task-specific scaling, or adapter layers \cite{houlsby2019parameter} to provide task-specific specialization while maintaining shared encoder representations. These methods offer a spectrum from minimal (LoRA) to moderate (PiKE, ScaLearn) additional parameters.
-    \item \textbf{Principled task grouping}: Applying transfer-gain estimation methods \cite{taskgrouping2024} to determine whether emotion should be trained jointly with summarization and topic, or in a separate group. Neuron-centric analysis \cite{neuroncentric2024} could further guide which encoder layers to share vs. specialize.
-    \item \textbf{Encoder-only comparison}: Fine-tuning BERT/RoBERTa on topic and emotion classification, with and without multi-task training, to disentangle encoder-decoder architecture effects from MTL effects.
-    \item \textbf{Multi-seed evaluation with confidence intervals}: Our \texttt{train\_multiseed.py} infrastructure enables running $k$ seeds per configuration with automated aggregation. Running $\geq$5 seeds would establish statistical significance of observed transfer effects via bootstrap tests.
-    \item \textbf{Domain-specific emotion annotation}: Collecting emotion annotations on literary and academic text to study whether in-domain emotion data eliminates the negative transfer.
-    \item \textbf{Temperature sampling ablation}: Comparing round-robin vs. temperature-based sampling ($\alpha \in \{0.3, 0.5, 0.7, 1.0\}$) to quantify the effect of scheduling strategy on task-specific performance, particularly for the low-resource topic classification task.
-\end{itemize}
-%=============================================================================
-\section{Conclusion}
-%=============================================================================
-We investigated multi-task learning for literary and academic text understanding, combining abstractive summarization, topic classification, and multi-label emotion detection in an encoder-decoder architecture. Our key finding is that naive MTL with mean pooling produces heterogeneous transfer effects---positive for topic (+3.7\%), negative for emotion ($-$0.02 F1)---but that targeted interventions can eliminate negative transfer entirely. Learned attention pooling for the emotion head, combined with temperature-based task sampling ($\alpha=0.5$), improves multi-task emotion F1 from 0.199 to 0.352 (+77\%), surpassing the single-task baseline. With per-class threshold tuning, macro F1 reaches 0.294. Summarization quality remains robust across configurations (ROUGE-1: 0.310, ROUGE-L: 0.185), with per-domain analysis revealing a quality gap between academic (ROUGE-1: 0.319) and literary (ROUGE-1: 0.206) summaries driven by training data imbalance.
-These results demonstrate that negative transfer in MTL is not an inherent limitation but can be addressed through architectural isolation (task-specific pooling) and balanced optimization (temperature sampling). Pre-trained initialization (FLAN-T5) remains essential for competitive performance across all tasks. Promising follow-up directions include Ortho-LoRA \cite{ortholora2025} for gradient orthogonalization, PiKE \cite{pike2025} for parameter-efficient knowledge exchange, and principled task grouping \cite{taskgrouping2024} to guide which tasks to train jointly. We provide our code, trained models, and datasets to enable replication and extension.
-Code and models: \url{https://github.com/OliverPerrin/LexiMind}\\
-Live demo: \url{https://huggingface.co/spaces/OliverPerrin/LexiMind}
-%=============================================================================
-% References
-%=============================================================================
-\begin{thebibliography}{00}
-\bibitem{caruana1997multitask}
-R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
-\bibitem{collobert2011natural}
-R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, ``Natural language processing (almost) from scratch,'' \textit{JMLR}, vol. 12, pp. 2493--2537, 2011.
-\bibitem{johnson2017google}
-M. Johnson et al., ``Google's multilingual neural machine translation system: Enabling zero-shot translation,'' \textit{TACL}, vol. 5, pp. 339--351, 2017.
-\bibitem{mccann2018natural}
-B. McCann, N. S. Keskar, C. Xiong, and R. Socher, ``The natural language decathlon: Multitask learning as question answering,'' \textit{arXiv:1806.08730}, 2018.
-\bibitem{standley2020tasks}
-T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, ``Which tasks should be learned together in multi-task learning?'' in \textit{ICML}, 2020.
-\bibitem{yu2020gradient}
-T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, ``Gradient surgery for multi-task learning,'' in \textit{NeurIPS}, 2020.
-\bibitem{liu2021conflict}
-B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, ``Conflict-averse gradient descent for multi-task learning,'' in \textit{NeurIPS}, 2021.
-\bibitem{chen2018gradnorm}
-Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, ``GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,'' in \textit{ICML}, 2018.
-\bibitem{kendall2018multi}
-A. Kendall, Y. Gal, and R. Cipolla, ``Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,'' in \textit{CVPR}, 2018.
-\bibitem{aghajanyan2021muppet}
-A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, L. Zettlemoyer, and S. Gupta, ``Muppet: Massive multi-task representations with pre-finetuning,'' in \textit{EMNLP}, 2021.
-\bibitem{aribandi2022ext5}
-V. Aribandi et al., ``ExT5: Towards extreme multi-task scaling for transfer learning,'' in \textit{ICLR}, 2022.
-\bibitem{raffel2020exploring}
-C. Raffel et al., ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{JMLR}, vol. 21, no. 140, pp. 1--67, 2020.
-\bibitem{chung2022scaling}
-H. W. Chung et al., ``Scaling instruction-finetuned language models,'' \textit{arXiv:2210.11416}, 2022.
-\bibitem{nallapati2016abstractive}
-R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, ``Abstractive text summarization using sequence-to-sequence RNNs and beyond,'' in \textit{CoNLL}, 2016.
-\bibitem{narayan2018don}
-S. Narayan, S. B. Cohen, and M. Lapata, ``Don't give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization,'' in \textit{EMNLP}, 2018.
-\bibitem{kryscinski2021booksum}
-W. Kryscinski, N. Rajani, D. Aber, and C. Xiong, ``BookSum: A collection of datasets for long-form narrative summarization,'' in \textit{Findings of EMNLP}, 2021.
-\bibitem{cohan2018discourse}
-A. Cohan et al., ``A discourse-aware attention model for abstractive summarization of long documents,'' in \textit{NAACL-HLT}, 2018.
-\bibitem{mao2022citesum}
-Y. Mao, M. Zhong, and J. Han, ``CiteSum: Citation text-guided scientific extreme summarization and domain adaptation with limited supervision,'' in \textit{EMNLP}, 2022.
-\bibitem{demszky2020goemotions}
-D. Demszky et al., ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{ACL}, 2020.
-\bibitem{zhang2019root}
-B. Zhang and R. Sennrich, ``Root mean square layer normalization,'' in \textit{NeurIPS}, 2019.
-\bibitem{lin2004rouge}
-C.-Y. Lin, ``ROUGE: A package for automatic evaluation of summaries,'' in \textit{Text Summarization Branches Out}, 2004.
-\bibitem{zhang2019bertscore}
-T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, ``BERTScore: Evaluating text generation with BERT,'' in \textit{ICLR}, 2020.
-\bibitem{hu2022lora}
-E. J. Hu et al., ``LoRA: Low-rank adaptation of large language models,'' in \textit{ICLR}, 2022.
-\bibitem{houlsby2019parameter}
-N. Houlsby et al., ``Parameter-efficient transfer learning for NLP,'' in \textit{ICML}, 2019.
-\bibitem{lin2017focal}
-T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\'{a}r, ``Focal loss for dense object detection,'' in \textit{ICCV}, 2017.
-\bibitem{ortholora2025}
-B. Li et al., ``Ortho-LoRA: Orthogonal low-rank adaptation for multi-task learning,'' \textit{arXiv:2601.09684}, 2025.
-\bibitem{pike2025}
-Y. Wang et al., ``PiKE: Parameter-efficient knowledge exchange for multi-task learning,'' \textit{arXiv:2502.06244}, 2025.
-\bibitem{scallearn2023}
-H. Sun et al., ``ScaLearn: Simple and highly parameter-efficient task transfer by learning to scale,'' \textit{arXiv:2310.01217}, 2023.
-\bibitem{taskgrouping2024}
-S. Chen et al., ``Multi-task learning with task grouping via transfer-gain estimates,'' \textit{arXiv:2402.15328}, 2024.
-\bibitem{neuroncentric2024}
-A. Foroutan et al., ``What do neurons in multi-task language models encode? A neuron-centric analysis,'' \textit{arXiv:2407.06488}, 2024.
-\end{thebibliography}
-\end{document}

kaggle.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"username":"oliverperrin","key":"af2c73b5725d2839410a1f72cf84cf48"}

outputs/evaluation_report.json DELETED Viewed

@@ -1,260 +0,0 @@
-{
-  "summarization": {
-    "rouge1": 0.3094793058747055,
-    "rouge2": 0.09069756722817666,
-    "rougeL": 0.1847154828322755,
-    "bleu4": 0.023982657019404153,
-    "num_samples": 2727,
-    "per_domain": {
-      "academic": {
-        "num_samples": 2493,
-        "rouge1": 0.31919183681728475,
-        "rouge2": 0.0968589097730544,
-        "rougeL": 0.18921129182459423,
-        "bleu4": 0.02551610700902003
-      },
-      "literary": {
-        "num_samples": 234,
-        "rouge1": 0.2060034954479976,
-        "rouge2": 0.0250555716539014,
-        "rougeL": 0.1368178254910352,
-        "bleu4": 0.0076455167454197795
-      }
-    },
-    "rouge1_ci": {
-      "mean": 0.30947930587470557,
-      "lower": 0.3060921045548166,
-      "upper": 0.3131015955325767
-    },
-    "rougeL_ci": {
-      "mean": 0.18471548283227546,
-      "lower": 0.18251665662669495,
-      "upper": 0.18701919830414013
-    }
-  },
-  "emotion": {
-    "sample_avg_f1": 0.3522975742816925,
-    "macro_f1": 0.14317210018634796,
-    "micro_f1": 0.4430159032344818,
-    "num_samples": 5426,
-    "num_classes": 28,
-    "per_class": {
-      "admiration": {
-        "precision": 0.714634120464325,
-        "recall": 0.6004098653793335,
-        "f1": 0.6525612537411732,
-        "support": 488
-      },
-      "amusement": {
-        "precision": 0.7708333134651184,
-        "recall": 0.7326732873916626,
-        "f1": 0.7512690366449468,
-        "support": 303
-      },
-      "anger": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 195
-      },
-      "annoyance": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 303
-      },
-      "approval": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 397
-      },
-      "caring": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 153
-      },
-      "confusion": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 152
-      },
-      "curiosity": {
-        "precision": 0.6166666746139526,
-        "recall": 0.14919355511665344,
-        "f1": 0.24025974958898805,
-        "support": 248
-      },
-      "desire": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 77
-      },
-      "disappointment": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 163
-      },
-      "disapproval": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 292
-      },
-      "disgust": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 97
-      },
-      "embarrassment": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 35
-      },
-      "excitement": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 96
-      },
-      "fear": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 90
-      },
-      "gratitude": {
-        "precision": 0.8997134566307068,
-        "recall": 0.8770949840545654,
-        "f1": 0.8882602556669954,
-        "support": 358
-      },
-      "grief": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 13
-      },
-      "joy": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 172
-      },
-      "love": {
-        "precision": 0.6996466517448425,
-        "recall": 0.7857142686843872,
-        "f1": 0.740186913163602,
-        "support": 252
-      },
-      "nervousness": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 21
-      },
-      "neutral": {
-        "precision": 0.6869627237319946,
-        "recall": 0.543035089969635,
-        "f1": 0.6065780936064032,
-        "support": 1766
-      },
-      "optimism": {
-        "precision": 0.7142857313156128,
-        "recall": 0.023923445492982864,
-        "f1": 0.04629629729995926,
-        "support": 209
-      },
-      "pride": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 15
-      },
-      "realization": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 127
-      },
-      "relief": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 18
-      },
-      "remorse": {
-        "precision": 1.0,
-        "recall": 0.014705882407724857,
-        "f1": 0.02898550735279132,
-        "support": 68
-      },
-      "sadness": {
-        "precision": 1.0,
-        "recall": 0.0279720276594162,
-        "f1": 0.054421768115822285,
-        "support": 143
-      },
-      "surprise": {
-        "precision": 0.0,
-        "recall": 0.0,
-        "f1": 0.0,
-        "support": 129
-      }
-    },
-    "tuned_thresholds": {
-      "admiration": 0.4,
-      "amusement": 0.55,
-      "anger": 0.2,
-      "annoyance": 0.15,
-      "approval": 0.15,
-      "caring": 0.1,
-      "confusion": 0.1,
-      "curiosity": 0.25,
-      "desire": 0.15,
-      "disappointment": 0.1,
-      "disapproval": 0.1,
-      "disgust": 0.1,
-      "embarrassment": 0.1,
-      "excitement": 0.1,
-      "fear": 0.1,
-      "gratitude": 0.65,
-      "grief": 0.1,
-      "joy": 0.2,
-      "love": 0.45,
-      "nervousness": 0.1,
-      "neutral": 0.3,
-      "optimism": 0.25,
-      "pride": 0.1,
-      "realization": 0.2,
-      "relief": 0.1,
-      "remorse": 0.2,
-      "sadness": 0.25,
-      "surprise": 0.1
-    },
-    "tuned_macro_f1": 0.29355332255363464,
-    "tuned_sample_avg_f1": 0.5025880336761475,
-    "tuned_micro_f1": 0.48644566535949707,
-    "sample_f1_ci": {
-      "mean": 0.3522975795552279,
-      "lower": 0.33984518982676004,
-      "upper": 0.3658618994962526
-    }
-  },
-  "topic": {
-    "accuracy": 0.8571428571428571,
-    "macro_f1": 0.8538751111963805,
-    "num_samples": 189,
-    "accuracy_ci": {
-      "mean": 0.8571428571428571,
-      "lower": 0.8042328042328042,
-      "upper": 0.91005291005291
-    }
-  }
-}

outputs/training_history.json DELETED Viewed

@@ -1,210 +0,0 @@
-{
-  "train_epoch_1": {
-    "summarization_loss": 4.05727732604402,
-    "summarization_rouge_like": 0.20427788603502178,
-    "summarization_rouge1": 0.2867239527374218,
-    "summarization_rouge2": 0.08530419039955006,
-    "summarization_rougeL": 0.21671934441779328,
-    "summarization_bleu4": 0.046807610480627294,
-    "emotion_loss": 0.26667120894821444,
-    "emotion_f1": 0.20469974499405505,
-    "total_loss": 6.105969995046231,
-    "topic_loss": 1.6857517635423032,
-    "topic_accuracy": 0.4217703349282306
-  },
-  "val_epoch_1": {
-    "summarization_loss": 3.8147981293996174,
-    "summarization_rouge_like": 0.2193516213078271,
-    "summarization_rouge1": 0.26079796060659194,
-    "summarization_rouge2": 0.08507403927823329,
-    "summarization_rougeL": 0.2006794257877804,
-    "summarization_bleu4": 0.047830237595825456,
-    "emotion_loss": 0.14947238216797512,
-    "emotion_f1": 0.19722222675879797,
-    "topic_loss": 1.111328324476878,
-    "topic_accuracy": 0.7036666666666669,
-    "total_loss": 4.297669008910658
-  },
-  "train_epoch_2": {
-    "summarization_loss": 3.849853239841521,
-    "summarization_rouge_like": 0.21403927033293962,
-    "summarization_rouge1": 0.27951076939717684,
-    "summarization_rouge2": 0.0873046768836161,
-    "summarization_rougeL": 0.21374118927203542,
-    "summarization_bleu4": 0.04958577524880755,
-    "emotion_loss": 0.14221051054989492,
-    "emotion_f1": 0.26357290302542585,
-    "topic_loss": 0.7299397268663149,
-    "topic_accuracy": 0.8084686774942008,
-    "total_loss": 5.528282663174978
-  },
-  "val_epoch_2": {
-    "summarization_loss": 3.738964385986328,
-    "summarization_rouge_like": 0.22322817933854347,
-    "summarization_rouge1": 0.2648447987903156,
-    "summarization_rouge2": 0.08777067266852198,
-    "summarization_rougeL": 0.2049718124413594,
-    "summarization_bleu4": 0.04980800809043137,
-    "emotion_loss": 0.1332480485116442,
-    "emotion_f1": 0.3008111199736595,
-    "topic_loss": 0.5171811254819234,
-    "topic_accuracy": 0.8467777777777786,
-    "total_loss": 4.027366772142546
-  },
-  "train_epoch_3": {
-    "emotion_loss": 0.12831888329927568,
-    "emotion_f1": 0.3325013413977316,
-    "summarization_loss": 3.7839796767703127,
-    "summarization_rouge_like": 0.21797276831106976,
-    "summarization_rouge1": 0.28868883384124916,
-    "summarization_rouge2": 0.09150032176337587,
-    "summarization_rougeL": 0.22148013487440707,
-    "summarization_bleu4": 0.052993168973641876,
-    "total_loss": 5.379445686122572,
-    "topic_loss": 0.3385182340765703,
-    "topic_accuracy": 0.9149137451307789
-  },
-  "val_epoch_3": {
-    "summarization_loss": 3.699807391166687,
-    "summarization_rouge_like": 0.22613490382620294,
-    "summarization_rouge1": 0.27110048990501884,
-    "summarization_rouge2": 0.09042725720607361,
-    "summarization_rougeL": 0.209904253200661,
-    "summarization_bleu4": 0.05177241093143676,
-    "emotion_loss": 0.12147359546273946,
-    "emotion_f1": 0.3474666798238953,
-    "topic_loss": 0.5068136086066564,
-    "topic_accuracy": 0.8417777777777792,
-    "total_loss": 3.9733250692114286
-  },
-  "train_epoch_4": {
-    "summarization_loss": 3.746917572488457,
-    "summarization_rouge_like": 0.22054338132572013,
-    "summarization_rouge1": 0.29700759128401966,
-    "summarization_rouge2": 0.09528349132659034,
-    "summarization_rougeL": 0.2286643637324592,
-    "summarization_bleu4": 0.05591190647915982,
-    "emotion_loss": 0.12003502780097021,
-    "emotion_f1": 0.37240424536844824,
-    "total_loss": 5.303101773435515,
-    "topic_loss": 0.19978291214297147,
-    "topic_accuracy": 0.9528935185185234
-  },
-  "val_epoch_4": {
-    "summarization_loss": 3.6773871207237243,
-    "summarization_rouge_like": 0.22730110361278533,
-    "summarization_rouge1": 0.2719731929407321,
-    "summarization_rouge2": 0.09117786246379923,
-    "summarization_rougeL": 0.21082587270737135,
-    "summarization_bleu4": 0.052260125383420154,
-    "emotion_loss": 0.11476812147845825,
-    "emotion_f1": 0.40390001876900594,
-    "topic_loss": 0.5311758625507355,
-    "topic_accuracy": 0.8574444444444455,
-    "total_loss": 3.95150800096741
-  },
-  "train_epoch_5": {
-    "summarization_loss": 3.72376742684834,
-    "summarization_rouge_like": 0.22218972657959773,
-    "summarization_rouge1": 0.30386172952451457,
-    "summarization_rouge2": 0.09807265293507532,
-    "summarization_rougeL": 0.23422938393417417,
-    "summarization_bleu4": 0.05821407514551748,
-    "emotion_loss": 0.11460309708431649,
-    "emotion_f1": 0.41015538037428334,
-    "total_loss": 5.207888234798891,
-    "topic_loss": 0.13986067138923575,
-    "topic_accuracy": 0.9685236768802278
-  },
-  "val_epoch_5": {
-    "summarization_loss": 3.664777074654897,
-    "summarization_rouge_like": 0.22876987463000684,
-    "summarization_rouge1": 0.27596093399625565,
-    "summarization_rouge2": 0.09296804123657829,
-    "summarization_rougeL": 0.21411928790828857,
-    "summarization_bleu4": 0.05366559404113782,
-    "emotion_loss": 0.11044646929949523,
-    "emotion_f1": 0.4313555757453044,
-    "topic_loss": 0.5484664579232533,
-    "topic_accuracy": 0.8627777777777789,
-    "total_loss": 3.9397634813313704
-  },
-  "train_epoch_6": {
-    "emotion_loss": 0.1111307007874511,
-    "emotion_f1": 0.43345397762862603,
-    "summarization_loss": 3.7095002406409807,
-    "summarization_rouge_like": 0.22328726116125275,
-    "summarization_rouge1": 0.3064035344877472,
-    "summarization_rouge2": 0.09935359454486654,
-    "summarization_rougeL": 0.23650841461700828,
-    "summarization_bleu4": 0.059165680810364656,
-    "total_loss": 5.231221632164746,
-    "topic_loss": 0.10774352340420275,
-    "topic_accuracy": 0.9777777777777826
-  },
-  "val_epoch_6": {
-    "summarization_loss": 3.658109269142151,
-    "summarization_rouge_like": 0.22934290201883448,
-    "summarization_rouge1": 0.2752052666208255,
-    "summarization_rouge2": 0.09292038370832255,
-    "summarization_rougeL": 0.2137414809166316,
-    "summarization_bleu4": 0.053427338475007496,
-    "emotion_loss": 0.10808507531881333,
-    "emotion_f1": 0.4517777989556392,
-    "topic_loss": 0.5295590771238009,
-    "topic_accuracy": 0.8734444444444451,
-    "total_loss": 3.9250620675981014
-  },
-  "train_epoch_7": {
-    "emotion_loss": 0.10953594371440704,
-    "emotion_f1": 0.44384909393388133,
-    "topic_loss": 0.0957411853224039,
-    "topic_accuracy": 0.980394366197187,
-    "total_loss": 5.154093306898701,
-    "summarization_loss": 3.7035266418583594,
-    "summarization_rouge_like": 0.22378105952974536,
-    "summarization_rouge1": 0.3070619920824417,
-    "summarization_rouge2": 0.09984959921270933,
-    "summarization_rougeL": 0.23710279675635842,
-    "summarization_bleu4": 0.05954598113800495
-  },
-  "val_epoch_7": {
-    "summarization_loss": 3.654966928164164,
-    "summarization_rouge_like": 0.2296679906954514,
-    "summarization_rouge1": 0.27616327736195406,
-    "summarization_rouge2": 0.09329265746038877,
-    "summarization_rougeL": 0.2144202156909426,
-    "summarization_bleu4": 0.05381191556748925,
-    "emotion_loss": 0.10733611459533374,
-    "emotion_f1": 0.4582889095693827,
-    "topic_loss": 0.5517185291647911,
-    "topic_accuracy": 0.8574444444444457,
-    "total_loss": 3.9278186015089296
-  },
-  "train_epoch_8": {
-    "summarization_loss": 3.6991967220660666,
-    "summarization_rouge_like": 0.22392498422300275,
-    "summarization_rouge1": 0.30751530664889926,
-    "summarization_rouge2": 0.10003700619268063,
-    "summarization_rougeL": 0.23750205422812004,
-    "summarization_bleu4": 0.05974583539783897,
-    "emotion_loss": 0.10842480880968565,
-    "emotion_f1": 0.44919879924130307,
-    "total_loss": 5.178375478314296,
-    "topic_loss": 0.09057999134229244,
-    "topic_accuracy": 0.9817882611080675
-  },
-  "val_epoch_8": {
-    "summarization_loss": 3.652825821240743,
-    "summarization_rouge_like": 0.22990084413830203,
-    "summarization_rouge1": 0.2765755402183266,
-    "summarization_rouge2": 0.09348690574327727,
-    "summarization_rougeL": 0.2147442273650338,
-    "summarization_bleu4": 0.053967258926407226,
-    "emotion_loss": 0.10670269103099903,
-    "emotion_f1": 0.4594111327578624,
-    "topic_loss": 0.5511919154723486,
-    "topic_accuracy": 0.8574444444444457,
-    "total_loss": 3.924886086913453
-  }
-}

scripts/train_bert_baseline.py ADDED Viewed

	@@ -0,0 +1,1137 @@

+"""
+BERT Baseline Training for LexiMind Comparison.
+Fine-tunes bert-base-uncased on topic classification and emotion detection
+to provide baselines for comparison with LexiMind (FLAN-T5-based).
+Supports three training modes to disentangle architecture vs. MTL effects:
+  1. single-topic   — BERT fine-tuned on topic classification only
+  2. single-emotion  — BERT fine-tuned on emotion detection only
+  3. multitask       — BERT fine-tuned on both tasks jointly
+Uses the same datasets, splits, label encoders, and evaluation metrics as the
+main LexiMind pipeline for fair comparison.
+Usage:
+    python scripts/train_bert_baseline.py --mode single-topic
+    python scripts/train_bert_baseline.py --mode single-emotion
+    python scripts/train_bert_baseline.py --mode multitask
+    python scripts/train_bert_baseline.py --mode all  # Run all three sequentially
+Author: Oliver Perrin
+Date: March 2026
+"""
+from __future__ import annotations
+import argparse
+import json
+import random
+import sys
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from sklearn.metrics import accuracy_score, classification_report, f1_score
+from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
+from torch.cuda.amp import GradScaler, autocast
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+from transformers import AutoModel, AutoTokenizer
+# Project imports
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.data.dataset import (
+    EmotionExample,
+    TopicExample,
+    load_emotion_jsonl,
+    load_topic_jsonl,
+)
+from src.training.metrics import (
+    bootstrap_confidence_interval,
+    multilabel_f1,
+    multilabel_macro_f1,
+    multilabel_micro_f1,
+    multilabel_per_class_metrics,
+    tune_per_class_thresholds,
+)
+# Configuration
+@dataclass
+class BertBaselineConfig:
+    """Hyperparameters aligned with LexiMind's full.yaml where applicable."""
+    # Model
+    model_name: str = "bert-base-uncased"
+    max_length: int = 256  # Same as LexiMind classification max_len
+    # Optimizer (matching LexiMind's full.yaml)
+    lr: float = 3e-5
+    weight_decay: float = 0.01
+    betas: tuple[float, float] = (0.9, 0.98)
+    eps: float = 1e-6
+    # Training
+    batch_size: int = 10  # Same as LexiMind
+    gradient_accumulation_steps: int = 4  # Same effective batch = 40
+    max_epochs: int = 8
+    warmup_steps: int = 300
+    gradient_clip_norm: float = 1.0
+    early_stopping_patience: int = 3
+    seed: int = 17  # Same as LexiMind
+    # Task weights (for multi-task mode)
+    topic_weight: float = 0.3  # Same as LexiMind
+    emotion_weight: float = 1.0
+    # Temperature sampling (for multi-task mode)
+    task_sampling_alpha: float = 0.5
+    # Frozen layers: freeze bottom 4 layers (matching LexiMind's encoder strategy)
+    freeze_layers: int = 4
+    # Precision
+    use_amp: bool = True  # BFloat16 mixed precision
+    # Paths
+    data_dir: Path = field(default_factory=lambda: PROJECT_ROOT / "data" / "processed")
+    output_dir: Path = field(default_factory=lambda: PROJECT_ROOT / "outputs" / "bert_baseline")
+    checkpoint_dir: Path = field(
+        default_factory=lambda: PROJECT_ROOT / "checkpoints" / "bert_baseline"
+    )
+    # Emotion threshold
+    emotion_threshold: float = 0.3
+# Datasets
+class BertEmotionDataset(Dataset):
+    """Tokenized emotion dataset for BERT."""
+    def __init__(
+        self,
+        examples: List[EmotionExample],
+        tokenizer: AutoTokenizer,
+        binarizer: MultiLabelBinarizer,
+        max_length: int = 256,
+    ):
+        self.examples = examples
+        self.tokenizer = tokenizer
+        self.binarizer = binarizer
+        self.max_length = max_length
+    def __len__(self) -> int:
+        return len(self.examples)
+    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
+        ex = self.examples[idx]
+        encoding = self.tokenizer(
+            ex.text,
+            max_length=self.max_length,
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+        )
+        labels = self.binarizer.transform([ex.emotions])[0]
+        return {
+            "input_ids": encoding["input_ids"].squeeze(0),
+            "attention_mask": encoding["attention_mask"].squeeze(0),
+            "labels": torch.tensor(labels, dtype=torch.float32),
+        }
+class BertTopicDataset(Dataset):
+    """Tokenized topic dataset for BERT."""
+    def __init__(
+        self,
+        examples: List[TopicExample],
+        tokenizer: AutoTokenizer,
+        encoder: LabelEncoder,
+        max_length: int = 256,
+    ):
+        self.examples = examples
+        self.tokenizer = tokenizer
+        self.encoder = encoder
+        self.max_length = max_length
+    def __len__(self) -> int:
+        return len(self.examples)
+    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
+        ex = self.examples[idx]
+        encoding = self.tokenizer(
+            ex.text,
+            max_length=self.max_length,
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+        )
+        label = self.encoder.transform([ex.topic])[0]
+        return {
+            "input_ids": encoding["input_ids"].squeeze(0),
+            "attention_mask": encoding["attention_mask"].squeeze(0),
+            "labels": torch.tensor(label, dtype=torch.long),
+        }
+# Model
+class BertClassificationHead(nn.Module):
+    """Classification head on top of BERT [CLS] token.
+    For emotion: uses attention pooling + 2-layer MLP (matching LexiMind's emotion head)
+    For topic: uses [CLS] + single linear (matching LexiMind's mean pool + linear)
+    """
+    def __init__(
+        self,
+        hidden_size: int,
+        num_labels: int,
+        pooling: str = "cls",  # "cls" or "attention"
+        hidden_dim: Optional[int] = None,
+        dropout: float = 0.1,
+    ):
+        super().__init__()
+        self.pooling = pooling
+        self.dropout = nn.Dropout(dropout)
+        if pooling == "attention":
+            self.attn_query = nn.Linear(hidden_size, 1, bias=False)
+        if hidden_dim is not None:
+            self.classifier = nn.Sequential(
+                nn.Linear(hidden_size, hidden_dim),
+                nn.GELU(),
+                nn.Dropout(dropout),
+                nn.Linear(hidden_dim, num_labels),
+            )
+        else:
+            self.classifier = nn.Linear(hidden_size, num_labels)
+    def forward(
+        self, hidden_states: torch.Tensor, attention_mask: torch.Tensor
+    ) -> torch.Tensor:
+        if self.pooling == "attention":
+            # Learned attention pooling (same mechanism as LexiMind)
+            scores = self.attn_query(hidden_states)  # (B, L, 1)
+            mask = attention_mask.unsqueeze(-1).bool()
+            scores = scores.masked_fill(~mask, float("-inf"))
+            weights = F.softmax(scores, dim=1)
+            pooled = (weights * hidden_states).sum(dim=1)
+        elif self.pooling == "mean":
+            # Mean pooling over valid tokens
+            mask_expanded = attention_mask.unsqueeze(-1).float()
+            sum_embeddings = (hidden_states * mask_expanded).sum(dim=1)
+            sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9)
+            pooled = sum_embeddings / sum_mask
+        else:
+            # [CLS] token
+            pooled = hidden_states[:, 0, :]
+        pooled = self.dropout(pooled)
+        return self.classifier(pooled)
+class BertBaseline(nn.Module):
+    """BERT baseline model with task-specific heads.
+    Supports single-task and multi-task configurations.
+    """
+    def __init__(
+        self,
+        model_name: str = "bert-base-uncased",
+        num_emotions: int = 28,
+        num_topics: int = 7,
+        tasks: Sequence[str] = ("emotion", "topic"),
+        freeze_layers: int = 4,
+    ):
+        super().__init__()
+        self.bert = AutoModel.from_pretrained(model_name)
+        hidden_size = self.bert.config.hidden_size  # 768 for bert-base
+        self.tasks = list(tasks)
+        self.heads = nn.ModuleDict()
+        if "emotion" in tasks:
+            # Attention pooling + 2-layer MLP (matching LexiMind's emotion head)
+            self.heads["emotion"] = BertClassificationHead(
+                hidden_size=hidden_size,
+                num_labels=num_emotions,
+                pooling="attention",
+                hidden_dim=hidden_size // 2,  # 384, same ratio as LexiMind
+                dropout=0.1,
+            )
+        if "topic" in tasks:
+            # Mean pooling + single linear (matching LexiMind's topic head)
+            self.heads["topic"] = BertClassificationHead(
+                hidden_size=hidden_size,
+                num_labels=num_topics,
+                pooling="mean",
+                hidden_dim=None,
+                dropout=0.1,
+            )
+        # Freeze bottom N encoder layers (matching LexiMind's strategy)
+        self._freeze_layers(freeze_layers)
+    def _freeze_layers(self, n: int) -> None:
+        """Freeze embedding + bottom n encoder layers."""
+        # Freeze embeddings
+        for param in self.bert.embeddings.parameters():
+            param.requires_grad = False
+        # Freeze bottom n layers
+        for i in range(min(n, len(self.bert.encoder.layer))):
+            for param in self.bert.encoder.layer[i].parameters():
+                param.requires_grad = False
+        frozen = sum(1 for p in self.bert.parameters() if not p.requires_grad)
+        total = sum(1 for p in self.bert.parameters())
+        print(f"  Frozen {frozen}/{total} BERT parameters (bottom {n} layers + embeddings)")
+    def forward(
+        self,
+        task: str,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
+        hidden_states = outputs.last_hidden_state  # (B, L, 768)
+        return self.heads[task](hidden_states, attention_mask)
+    def param_count(self) -> Dict[str, int]:
+        """Count parameters by component."""
+        counts = {}
+        counts["bert_encoder"] = sum(p.numel() for p in self.bert.parameters())
+        counts["bert_trainable"] = sum(
+            p.numel() for p in self.bert.parameters() if p.requires_grad
+        )
+        for name, head in self.heads.items():
+            counts[f"head_{name}"] = sum(p.numel() for p in head.parameters())
+        counts["total"] = sum(p.numel() for p in self.parameters())
+        counts["trainable"] = sum(p.numel() for p in self.parameters() if p.requires_grad)
+        return counts
+# Training
+class BertTrainer:
+    """Trainer supporting single-task and multi-task BERT training."""
+    def __init__(
+        self,
+        model: BertBaseline,
+        config: BertBaselineConfig,
+        train_loaders: Dict[str, DataLoader],
+        val_loaders: Dict[str, DataLoader],
+        device: torch.device,
+        mode: str,
+    ):
+        self.model = model
+        self.config = config
+        self.train_loaders = train_loaders
+        self.val_loaders = val_loaders
+        self.device = device
+        self.mode = mode
+        # Optimizer
+        self.optimizer = AdamW(
+            [p for p in model.parameters() if p.requires_grad],
+            lr=config.lr,
+            weight_decay=config.weight_decay,
+            betas=config.betas,
+            eps=config.eps,
+        )
+        # Calculate total training steps
+        if len(train_loaders) > 1:
+            # Multi-task: use temperature-sampled steps
+            sizes = {k: len(v) for k, v in train_loaders.items()}
+            total_batches = sum(sizes.values())
+        else:
+            total_batches = sum(len(v) for v in train_loaders.values())
+        self.steps_per_epoch = total_batches // config.gradient_accumulation_steps
+        self.total_steps = self.steps_per_epoch * config.max_epochs
+        # LR scheduler: linear warmup + cosine decay (matching LexiMind)
+        warmup_scheduler = LinearLR(
+            self.optimizer,
+            start_factor=1e-8 / config.lr,
+            end_factor=1.0,
+            total_iters=config.warmup_steps,
+        )
+        cosine_scheduler = CosineAnnealingLR(
+            self.optimizer,
+            T_max=max(self.total_steps - config.warmup_steps, 1),
+            eta_min=config.lr * 0.1,  # Decay to 10% of peak (matching LexiMind)
+        )
+        self.scheduler = SequentialLR(
+            self.optimizer,
+            schedulers=[warmup_scheduler, cosine_scheduler],
+            milestones=[config.warmup_steps],
+        )
+        # Mixed precision
+        self.scaler = GradScaler(enabled=config.use_amp)
+        # Loss functions
+        self.emotion_loss_fn = nn.BCEWithLogitsLoss()
+        self.topic_loss_fn = nn.CrossEntropyLoss()
+        # Tracking
+        self.global_step = 0
+        self.best_metric = -float("inf")
+        self.patience_counter = 0
+        self.training_history: List[Dict[str, Any]] = []
+    def _compute_loss(self, task: str, logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
+        if task == "emotion":
+            return self.emotion_loss_fn(logits, labels)
+        else:
+            return self.topic_loss_fn(logits, labels)
+    def _get_task_weight(self, task: str) -> float:
+        if self.mode != "multitask":
+            return 1.0
+        if task == "topic":
+            return self.config.topic_weight
+        return self.config.emotion_weight
+    def _make_multitask_iterator(self):
+        """Temperature-based task sampling (matching LexiMind)."""
+        sizes = {k: len(v.dataset) for k, v in self.train_loaders.items()}
+        alpha = self.config.task_sampling_alpha
+        # Compute sampling probabilities
+        raw = {k: s ** (1.0 / alpha) for k, s in sizes.items()}
+        total = sum(raw.values())
+        probs = {k: v / total for k, v in raw.items()}
+        # Create iterators
+        iters = {k: iter(v) for k, v in self.train_loaders.items()}
+        tasks = list(probs.keys())
+        weights = [probs[t] for t in tasks]
+        while True:
+            task = random.choices(tasks, weights=weights, k=1)[0]
+            try:
+                batch = next(iters[task])
+            except StopIteration:
+                iters[task] = iter(self.train_loaders[task])
+                batch = next(iters[task])
+            yield task, batch
+    def train_epoch(self, epoch: int) -> Dict[str, float]:
+        """Train one epoch."""
+        self.model.train()
+        self.optimizer.zero_grad()
+        epoch_losses: Dict[str, List[float]] = {t: [] for t in self.train_loaders}
+        epoch_metrics: Dict[str, List[float]] = {}
+        if len(self.train_loaders) > 1:
+            # Multi-task: temperature sampling
+            iterator = self._make_multitask_iterator()
+            total_batches = sum(len(v) for v in self.train_loaders.values())
+        else:
+            # Single-task: iterate normally
+            task_name = list(self.train_loaders.keys())[0]
+            iterator = ((task_name, batch) for batch in self.train_loaders[task_name])
+            total_batches = len(self.train_loaders[task_name])
+        pbar = tqdm(total=total_batches, desc=f"Epoch {epoch + 1}/{self.config.max_epochs}")
+        for step_in_epoch in range(total_batches):
+            task, batch = next(iterator)
+            input_ids = batch["input_ids"].to(self.device)
+            attention_mask = batch["attention_mask"].to(self.device)
+            labels = batch["labels"].to(self.device)
+            # Forward pass with AMP
+            with autocast(dtype=torch.bfloat16, enabled=self.config.use_amp):
+                logits = self.model(task, input_ids, attention_mask)
+                loss = self._compute_loss(task, logits, labels)
+                loss = loss * self._get_task_weight(task)
+                loss = loss / self.config.gradient_accumulation_steps
+            # Backward
+            self.scaler.scale(loss).backward()
+            epoch_losses[task].append(loss.item() * self.config.gradient_accumulation_steps)
+            # Optimizer step (every N accumulation steps)
+            if (step_in_epoch + 1) % self.config.gradient_accumulation_steps == 0:
+                self.scaler.unscale_(self.optimizer)
+                torch.nn.utils.clip_grad_norm_(
+                    self.model.parameters(), self.config.gradient_clip_norm
+                )
+                self.scaler.step(self.optimizer)
+                self.scaler.update()
+                self.optimizer.zero_grad()
+                self.scheduler.step()
+                self.global_step += 1
+            pbar.set_postfix(
+                {f"{task}_loss": f"{epoch_losses[task][-1]:.4f}", "lr": f"{self.scheduler.get_last_lr()[0]:.2e}"}
+            )
+            pbar.update(1)
+        pbar.close()
+        # Aggregate
+        results = {}
+        for task, losses in epoch_losses.items():
+            if losses:
+                results[f"train_{task}_loss"] = sum(losses) / len(losses)
+        return results
+    @torch.no_grad()
+    def validate(self) -> Dict[str, Any]:
+        """Run validation across all tasks."""
+        self.model.eval()
+        results: Dict[str, Any] = {}
+        for task, loader in self.val_loaders.items():
+            all_logits = []
+            all_labels = []
+            total_loss = 0.0
+            n_batches = 0
+            for batch in loader:
+                input_ids = batch["input_ids"].to(self.device)
+                attention_mask = batch["attention_mask"].to(self.device)
+                labels = batch["labels"].to(self.device)
+                with autocast(dtype=torch.bfloat16, enabled=self.config.use_amp):
+                    logits = self.model(task, input_ids, attention_mask)
+                    loss = self._compute_loss(task, logits, labels)
+                total_loss += loss.item()
+                n_batches += 1
+                all_logits.append(logits.float().cpu())
+                all_labels.append(labels.float().cpu())
+            all_logits_t = torch.cat(all_logits, dim=0)
+            all_labels_t = torch.cat(all_labels, dim=0)
+            results[f"val_{task}_loss"] = total_loss / max(n_batches, 1)
+            if task == "emotion":
+                preds = (torch.sigmoid(all_logits_t) > self.config.emotion_threshold).int()
+                targets = all_labels_t.int()
+                results["val_emotion_sample_f1"] = multilabel_f1(preds, targets)
+                results["val_emotion_macro_f1"] = multilabel_macro_f1(preds, targets)
+                results["val_emotion_micro_f1"] = multilabel_micro_f1(preds, targets)
+                # Store raw logits for threshold tuning later
+                results["_emotion_logits"] = all_logits_t
+                results["_emotion_labels"] = all_labels_t
+            elif task == "topic":
+                preds = all_logits_t.argmax(dim=1).numpy()
+                targets = all_labels_t.long().numpy()
+                results["val_topic_accuracy"] = float(accuracy_score(targets, preds))
+                results["val_topic_macro_f1"] = float(
+                    f1_score(targets, preds, average="macro", zero_division=0)
+                )
+        # Combined metric for early stopping / checkpointing
+        metric_parts = []
+        if "val_emotion_sample_f1" in results:
+            metric_parts.append(results["val_emotion_sample_f1"])
+        if "val_topic_accuracy" in results:
+            metric_parts.append(results["val_topic_accuracy"])
+        results["val_combined_metric"] = sum(metric_parts) / max(len(metric_parts), 1)
+        return results
+    def save_checkpoint(self, path: Path, epoch: int, metrics: Dict[str, Any]) -> None:
+        """Save model checkpoint."""
+        path.parent.mkdir(parents=True, exist_ok=True)
+        # Filter out tensors from metrics
+        clean_metrics = {k: v for k, v in metrics.items() if not k.startswith("_")}
+        torch.save(
+            {
+                "epoch": epoch,
+                "model_state_dict": self.model.state_dict(),
+                "optimizer_state_dict": self.optimizer.state_dict(),
+                "scheduler_state_dict": self.scheduler.state_dict(),
+                "metrics": clean_metrics,
+                "config": {
+                    "mode": self.mode,
+                    "tasks": self.model.tasks,
+                    "model_name": self.config.model_name,
+                },
+            },
+            path,
+        )
+    def train(self) -> Dict[str, Any]:
+        """Full training loop."""
+        print(f"\n{'=' * 60}")
+        print(f"Training BERT Baseline — Mode: {self.mode}")
+        print(f"{'=' * 60}")
+        param_counts = self.model.param_count()
+        print(f"  Total parameters:     {param_counts['total']:,}")
+        print(f"  Trainable parameters: {param_counts['trainable']:,}")
+        for name, count in param_counts.items():
+            if name.startswith("head_"):
+                print(f"  {name}: {count:,}")
+        print(f"  Steps/epoch: {self.steps_per_epoch}")
+        print(f"  Total steps: {self.total_steps}")
+        print()
+        all_results: Dict[str, Any] = {"mode": self.mode, "epochs": []}
+        start_time = time.time()
+        for epoch in range(self.config.max_epochs):
+            epoch_start = time.time()
+            # Train
+            train_metrics = self.train_epoch(epoch)
+            # Validate
+            val_metrics = self.validate()
+            epoch_time = time.time() - epoch_start
+            # Log
+            epoch_result = {
+                "epoch": epoch + 1,
+                "time_seconds": epoch_time,
+                **train_metrics,
+                **{k: v for k, v in val_metrics.items() if not k.startswith("_")},
+            }
+            all_results["epochs"].append(epoch_result)
+            self.training_history.append(epoch_result)
+            # Print summary
+            print(f"\n  Epoch {epoch + 1} ({epoch_time:.0f}s):")
+            for k, v in sorted(epoch_result.items()):
+                if k not in ("epoch", "time_seconds") and isinstance(v, float):
+                    print(f"    {k}: {v:.4f}")
+            # Checkpointing
+            combined = val_metrics["val_combined_metric"]
+            if combined > self.best_metric:
+                self.best_metric = combined
+                self.patience_counter = 0
+                self.save_checkpoint(
+                    self.config.checkpoint_dir / self.mode / "best.pt",
+                    epoch,
+                    val_metrics,
+                )
+                print(f" New best model (combined metric: {combined:.4f})")
+            else:
+                self.patience_counter += 1
+                print(
+                    f"  No improvement ({self.patience_counter}/{self.config.early_stopping_patience})"
+                )
+            # Always save epoch checkpoint
+            self.save_checkpoint(
+                self.config.checkpoint_dir / self.mode / f"epoch_{epoch + 1}.pt",
+                epoch,
+                val_metrics,
+            )
+            # Early stopping
+            if self.patience_counter >= self.config.early_stopping_patience:
+                print(f"\n  Early stopping triggered at epoch {epoch + 1}")
+                all_results["early_stopped"] = True
+                all_results["best_epoch"] = epoch + 1 - self.config.early_stopping_patience
+                break
+        total_time = time.time() - start_time
+        all_results["total_time_seconds"] = total_time
+        all_results["total_time_human"] = f"{total_time / 3600:.1f}h"
+        if "early_stopped" not in all_results:
+            all_results["early_stopped"] = False
+            all_results["best_epoch"] = (
+                epoch + 1 - self.patience_counter
+                if self.patience_counter > 0
+                else epoch + 1
+            )
+        all_results["param_counts"] = param_counts
+        print(f"\n  Training complete in {total_time / 3600:.1f}h")
+        print(f"  Best combined metric: {self.best_metric:.4f}")
+        return all_results
+# Evaluation
+def evaluate_bert_model(
+    model: BertBaseline,
+    val_loaders: Dict[str, DataLoader],
+    device: torch.device,
+    config: BertBaselineConfig,
+    emotion_classes: Optional[List[str]] = None,
+    topic_classes: Optional[List[str]] = None,
+) -> Dict[str, Any]:
+    """Full evaluation with the same metrics as LexiMind's evaluate.py."""
+    model.eval()
+    results: Dict[str, Any] = {}
+    with torch.no_grad():
+        for task, loader in val_loaders.items():
+            all_logits = []
+            all_labels = []
+            for batch in tqdm(loader, desc=f"Evaluating {task}"):
+                input_ids = batch["input_ids"].to(device)
+                attention_mask = batch["attention_mask"].to(device)
+                labels = batch["labels"].to(device)
+                with autocast(dtype=torch.bfloat16, enabled=config.use_amp):
+                    logits = model(task, input_ids, attention_mask)
+                all_logits.append(logits.float().cpu())
+                all_labels.append(labels.float().cpu())
+            all_logits_t = torch.cat(all_logits, dim=0)
+            all_labels_t = torch.cat(all_labels, dim=0)
+            if task == "emotion":
+                # Default threshold
+                preds_default = (
+                    (torch.sigmoid(all_logits_t) > config.emotion_threshold).int()
+                )
+                targets = all_labels_t.int()
+                results["emotion"] = {
+                    "default_threshold": config.emotion_threshold,
+                    "sample_avg_f1": multilabel_f1(preds_default, targets),
+                    "macro_f1": multilabel_macro_f1(preds_default, targets),
+                    "micro_f1": multilabel_micro_f1(preds_default, targets),
+                }
+                # Per-class metrics
+                if emotion_classes:
+                    per_class = multilabel_per_class_metrics(
+                        preds_default, targets, emotion_classes
+                    )
+                    results["emotion"]["per_class"] = per_class
+                # Threshold tuning
+                best_thresholds, tuned_macro = tune_per_class_thresholds(
+                    all_logits_t, all_labels_t
+                )
+                tuned_preds = torch.zeros_like(all_logits_t)
+                probs = torch.sigmoid(all_logits_t)
+                for c in range(all_logits_t.shape[1]):
+                    tuned_preds[:, c] = (probs[:, c] >= best_thresholds[c]).float()
+                tuned_preds = tuned_preds.int()
+                results["emotion"]["tuned_macro_f1"] = tuned_macro
+                results["emotion"]["tuned_sample_avg_f1"] = multilabel_f1(
+                    tuned_preds, targets
+                )
+                results["emotion"]["tuned_micro_f1"] = multilabel_micro_f1(
+                    tuned_preds, targets
+                )
+                # Bootstrap CI on sample-avg F1
+                per_sample_f1 = []
+                for i in range(preds_default.shape[0]):
+                    p = preds_default[i].float()
+                    g = targets[i].float()
+                    tp = (p * g).sum()
+                    prec = tp / p.sum().clamp(min=1)
+                    rec = tp / g.sum().clamp(min=1)
+                    f = (2 * prec * rec) / (prec + rec).clamp(min=1e-8)
+                    per_sample_f1.append(f.item())
+                mean_f1, ci_low, ci_high = bootstrap_confidence_interval(per_sample_f1)
+                results["emotion"]["sample_avg_f1_ci"] = [ci_low, ci_high]
+            elif task == "topic":
+                preds = all_logits_t.argmax(dim=1).numpy()
+                targets = all_labels_t.long().numpy()
+                acc = float(accuracy_score(targets, preds))
+                macro_f1 = float(
+                    f1_score(targets, preds, average="macro", zero_division=0)
+                )
+                results["topic"] = {
+                    "accuracy": acc,
+                    "macro_f1": macro_f1,
+                }
+                # Per-class metrics
+                if topic_classes:
+                    report = classification_report(
+                        targets,
+                        preds,
+                        target_names=topic_classes,
+                        output_dict=True,
+                        zero_division=0,
+                    )
+                    results["topic"]["per_class"] = {
+                        name: {
+                            "precision": report[name]["precision"],
+                            "recall": report[name]["recall"],
+                            "f1": report[name]["f1-score"],
+                            "support": report[name]["support"],
+                        }
+                        for name in topic_classes
+                        if name in report
+                    }
+                # Bootstrap CI on accuracy
+                per_sample_correct = (preds == targets).astype(float).tolist()
+                mean_acc, ci_low, ci_high = bootstrap_confidence_interval(per_sample_correct)
+                results["topic"]["accuracy_ci"] = [ci_low, ci_high]
+    return results
+# Main Pipeline
+def set_seed(seed: int) -> None:
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+def load_data(config: BertBaselineConfig):
+    """Load all datasets and create label encoders."""
+    data_dir = config.data_dir
+    # Load emotion data
+    emo_train = load_emotion_jsonl(str(data_dir / "emotion" / "train.jsonl"))
+    emo_val_path = data_dir / "emotion" / "validation.jsonl"
+    if not emo_val_path.exists():
+        emo_val_path = data_dir / "emotion" / "val.jsonl"
+    emo_val = load_emotion_jsonl(str(emo_val_path))
+    # Load topic data
+    top_train = load_topic_jsonl(str(data_dir / "topic" / "train.jsonl"))
+    top_val_path = data_dir / "topic" / "validation.jsonl"
+    if not top_val_path.exists():
+        top_val_path = data_dir / "topic" / "val.jsonl"
+    top_val = load_topic_jsonl(str(top_val_path))
+    # Fit label encoders on training data (same as LexiMind)
+    binarizer = MultiLabelBinarizer()
+    binarizer.fit([ex.emotions for ex in emo_train])
+    label_encoder = LabelEncoder()
+    label_encoder.fit([ex.topic for ex in top_train])
+    print(f"  Emotion: {len(emo_train)} train, {len(emo_val)} val, {len(binarizer.classes_)} classes")
+    print(f"  Topic:   {len(top_train)} train, {len(top_val)} val, {len(label_encoder.classes_)} classes")
+    print(f"  Emotion classes: {list(binarizer.classes_)[:5]}...")
+    print(f"  Topic classes:   {list(label_encoder.classes_)}")
+    return {
+        "emotion_train": emo_train,
+        "emotion_val": emo_val,
+        "topic_train": top_train,
+        "topic_val": top_val,
+        "binarizer": binarizer,
+        "label_encoder": label_encoder,
+    }
+def run_experiment(mode: str, config: BertBaselineConfig) -> Dict[str, Any]:
+    """Run a single experiment (single-topic, single-emotion, or multitask)."""
+    print(f"\n{'═' * 60}")
+    print(f"  BERT BASELINE EXPERIMENT: {mode.upper()}")
+    print(f"{'═' * 60}")
+    set_seed(config.seed)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"  Device: {device}")
+    if torch.cuda.is_available():
+        print(f"  GPU: {torch.cuda.get_device_name()}")
+        print(f"  VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
+    # CUDA optimizations
+    if torch.cuda.is_available():
+        torch.backends.cudnn.benchmark = True
+        if hasattr(torch.backends, "cuda"):
+            torch.backends.cuda.matmul.allow_tf32 = True
+    # Load tokenizer
+    print(f"\n  Loading tokenizer: {config.model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
+    # Load data
+    print("  Loading datasets...")
+    data = load_data(config)
+    # Determine tasks for this mode
+    if mode == "single-topic":
+        tasks = ["topic"]
+    elif mode == "single-emotion":
+        tasks = ["emotion"]
+    else:
+        tasks = ["emotion", "topic"]
+    # Create datasets
+    train_loaders: Dict[str, DataLoader] = {}
+    val_loaders: Dict[str, DataLoader] = {}
+    if "emotion" in tasks:
+        emo_train_ds = BertEmotionDataset(
+            data["emotion_train"], tokenizer, data["binarizer"], config.max_length
+        )
+        emo_val_ds = BertEmotionDataset(
+            data["emotion_val"], tokenizer, data["binarizer"], config.max_length
+        )
+        train_loaders["emotion"] = DataLoader(
+            emo_train_ds,
+            batch_size=config.batch_size,
+            shuffle=True,
+            num_workers=4,
+            pin_memory=True,
+            persistent_workers=True,
+        )
+        val_loaders["emotion"] = DataLoader(
+            emo_val_ds,
+            batch_size=config.batch_size * 2,
+            shuffle=False,
+            num_workers=4,
+            pin_memory=True,
+        )
+    if "topic" in tasks:
+        top_train_ds = BertTopicDataset(
+            data["topic_train"], tokenizer, data["label_encoder"], config.max_length
+        )
+        top_val_ds = BertTopicDataset(
+            data["topic_val"], tokenizer, data["label_encoder"], config.max_length
+        )
+        train_loaders["topic"] = DataLoader(
+            top_train_ds,
+            batch_size=config.batch_size,
+            shuffle=True,
+            num_workers=4,
+            pin_memory=True,
+            persistent_workers=True,
+        )
+        val_loaders["topic"] = DataLoader(
+            top_val_ds,
+            batch_size=config.batch_size * 2,
+            shuffle=False,
+            num_workers=4,
+            pin_memory=True,
+        )
+    # Create model
+    print(f"\n  Creating model with tasks: {tasks}")
+    model = BertBaseline(
+        model_name=config.model_name,
+        num_emotions=len(data["binarizer"].classes_),
+        num_topics=len(data["label_encoder"].classes_),
+        tasks=tasks,
+        freeze_layers=config.freeze_layers,
+    ).to(device)
+    # Train
+    trainer = BertTrainer(model, config, train_loaders, val_loaders, device, mode)
+    training_results = trainer.train()
+    # Load best checkpoint for final evaluation
+    best_path = config.checkpoint_dir / mode / "best.pt"
+    if best_path.exists():
+        print(f"\n  Loading best checkpoint for final evaluation...")
+        checkpoint = torch.load(best_path, map_location=device, weights_only=False)
+        model.load_state_dict(checkpoint["model_state_dict"])
+    # Full evaluation
+    print(f"\n  Running final evaluation...")
+    eval_results = evaluate_bert_model(
+        model,
+        val_loaders,
+        device,
+        config,
+        emotion_classes=list(data["binarizer"].classes_) if "emotion" in tasks else None,
+        topic_classes=list(data["label_encoder"].classes_) if "topic" in tasks else None,
+    )
+    # Combine results
+    final_results = {
+        "mode": mode,
+        "model": config.model_name,
+        "tasks": tasks,
+        "training": training_results,
+        "evaluation": eval_results,
+    }
+    # Save results
+    output_path = config.output_dir / f"{mode}_results.json"
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    # Remove non-serializable fields
+    def make_serializable(obj):
+        if isinstance(obj, dict):
+            return {k: make_serializable(v) for k, v in obj.items() if not k.startswith("_")}
+        if isinstance(obj, list):
+            return [make_serializable(item) for item in obj]
+        if isinstance(obj, (np.integer, np.int64)):
+            return int(obj)
+        if isinstance(obj, (np.floating, np.float64)):
+            return float(obj)
+        if isinstance(obj, np.ndarray):
+            return obj.tolist()
+        return obj
+    with open(output_path, "w") as f:
+        json.dump(make_serializable(final_results), f, indent=2)
+    print(f"\n  Results saved to {output_path}")
+    return final_results
+def print_comparison_summary(all_results: Dict[str, Dict[str, Any]]) -> None:
+    """Print a side-by-side comparison of all experiments."""
+    print(f"\n{'═' * 70}")
+    print("  BERT BASELINE COMPARISON SUMMARY")
+    print(f"{'═' * 70}")
+    # Header
+    modes = list(all_results.keys())
+    header = f"{'Metric':<30}" + "".join(f"{m:>16}" for m in modes) + f"{'LexiMind':>16}"
+    print(f"\n  {header}")
+    print(f"  {'─' * len(header)}")
+    # LexiMind reference values
+    lexmind = {
+        "topic_accuracy": 0.8571,
+        "topic_macro_f1": 0.8539,
+        "emotion_sample_f1": 0.3523,
+        "emotion_macro_f1": 0.1432,
+        "emotion_micro_f1": 0.4430,
+        "emotion_tuned_macro_f1": 0.2936,
+    }
+    # Topic metrics
+    print(f"\n  {'Topic Classification':}")
+    for metric_name, display_name in [
+        ("accuracy", "Accuracy"),
+        ("macro_f1", "Macro F1"),
+    ]:
+        row = f"  {display_name:<30}"
+        for mode in modes:
+            eval_data = all_results[mode].get("evaluation", {})
+            topic = eval_data.get("topic", {})
+            val = topic.get(metric_name, None)
+            row += f"{val:>16.4f}" if val is not None else f"{'—':>16}"
+        lm_key = f"topic_{metric_name}"
+        row += f"{lexmind.get(lm_key, 0):>16.4f}"
+        print(row)
+    # Emotion metrics
+    print(f"\n  {'Emotion Detection':}")
+    for metric_name, display_name in [
+        ("sample_avg_f1", "Sample-avg F1 (τ=0.3)"),
+        ("macro_f1", "Macro F1 (τ=0.3)"),
+        ("micro_f1", "Micro F1 (τ=0.3)"),
+        ("tuned_macro_f1", "Tuned Macro F1"),
+        ("tuned_sample_avg_f1", "Tuned Sample-avg F1"),
+    ]:
+        row = f"  {display_name:<30}"
+        for mode in modes:
+            eval_data = all_results[mode].get("evaluation", {})
+            emo = eval_data.get("emotion", {})
+            val = emo.get(metric_name, None)
+            row += f"{val:>16.4f}" if val is not None else f"{'—':>16}"
+        lm_key = f"emotion_{metric_name}"
+        row += f"{lexmind.get(lm_key, 0):>16.4f}"
+        print(row)
+    # Training time
+    print(f"\n  {'Training Time':}")
+    row = f"  {'Hours':<30}"
+    for mode in modes:
+        t = all_results[mode].get("training", {}).get("total_time_seconds", 0) / 3600
+        row += f"{t:>15.1f}h"
+    row += f"{'~9.0h':>16}"
+    print(row)
+    print(f"\n{'═' * 70}\n")
+def main():
+    parser = argparse.ArgumentParser(description="BERT Baseline Training for LexiMind")
+    parser.add_argument(
+        "--mode",
+        type=str,
+        required=True,
+        choices=["single-topic", "single-emotion", "multitask", "all"],
+        help="Training mode",
+    )
+    parser.add_argument("--epochs", type=int, default=None, help="Override max epochs")
+    parser.add_argument("--lr", type=float, default=None, help="Override learning rate")
+    parser.add_argument("--batch-size", type=int, default=None, help="Override batch size")
+    parser.add_argument(
+        "--model", type=str, default="bert-base-uncased", help="HuggingFace model name"
+    )
+    args = parser.parse_args()
+    config = BertBaselineConfig()
+    config.model_name = args.model
+    if args.epochs is not None:
+        config.max_epochs = args.epochs
+    if args.lr is not None:
+        config.lr = args.lr
+    if args.batch_size is not None:
+        config.batch_size = args.batch_size
+    if args.mode == "all":
+        modes = ["single-topic", "single-emotion", "multitask"]
+    else:
+        modes = [args.mode]
+    all_results: Dict[str, Dict[str, Any]] = {}
+    for mode in modes:
+        results = run_experiment(mode, config)
+        all_results[mode] = results
+        # Clear GPU memory between experiments
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    # Save combined results
+    if len(all_results) > 1:
+        combined_path = config.output_dir / "combined_results.json"
+        def make_serializable(obj):
+            if isinstance(obj, dict):
+                return {k: make_serializable(v) for k, v in obj.items() if not k.startswith("_")}
+            if isinstance(obj, list):
+                return [make_serializable(item) for item in obj]
+            if isinstance(obj, (np.integer, np.int64)):
+                return int(obj)
+            if isinstance(obj, (np.floating, np.float64)):
+                return float(obj)
+            if isinstance(obj, np.ndarray):
+                return obj.tolist()
+            return obj
+        with open(combined_path, "w") as f:
+            json.dump(make_serializable(all_results), f, indent=2)
+        print(f"  Combined results saved to {combined_path}")
+        print_comparison_summary(all_results)
+if __name__ == "__main__":
+    main()