Spaces:

OliverPerrin
/

LexiMind

Sleeping

App Files Files Community

OliverPerrin commited on 4 days ago

Commit

8f5fea2

1 Parent(s): 4bc92d5

Updated Research Paper, README, and old gradio about info, along with other docs.

Browse files

Files changed (7) hide show

README.md +127 -140
configs/data/datasets.yaml +1 -1
docs/architecture.md +1 -1
docs/paper.tex +1 -1
docs/research_paper.tex +176 -107
scripts/demo_gradio.py +8 -8
scripts/train.py +1 -1

README.md CHANGED Viewed

@@ -8,217 +8,204 @@ app_file: scripts/demo_gradio.py
 pinned: false
 ---
-## LexiMind: A Multi-Task NLP Model
-LexiMind is a state-of-the-art Natural Language Processing model designed for complex document understanding. It features a **custom-built Transformer architecture** initialized with weights from Google's **FLAN-T5**, combining the flexibility of from-scratch implementation with the power of modern pre-trained models.
-The model performs three sophisticated tasks simultaneously: **text summarization**, **emotion classification**, and **topic clustering**.
-This project is built with industry-standard MLOps practices, including configuration management with Hydra, experiment tracking with MLflow, and containerization with Docker, making it a reproducible and scalable solution.
-## Core Features
-* **Abstractive Summarization:** Generates concise, coherent summaries of long-form text using encoder-decoder attention. Trained on BookSum (literary) and arXiv (academic papers).
-* **Emotion Classification:** Identifies 28 emotions from Google's GoEmotions dataset (admiration, amusement, anger, joy, love, etc.).
-* **Topic Classification:** Classifies documents into 8 categories (Fiction, Science, Technology, Philosophy, History, Psychology, Business, Arts).
-## Model Architecture
-LexiMind implements a **from-scratch Transformer** with modern architectural choices:
-### Custom Transformer Features
-* **Pre-Layer Normalization (Pre-LN):** RMSNorm applied before each sublayer for stable training
-* **FlashAttention:** Via PyTorch 2.0's `scaled_dot_product_attention` for efficient computation
-* **Learned Positional Embeddings:** Trainable position representations
-* **Multi-Head Attention:** 12 heads with 768-dimensional representations
-* **RMSNorm:** Modern normalization without bias (more efficient than LayerNorm)
-### Pre-trained Weight Initialization
-The model loads weights from **Google's FLAN-T5-base**, which provides:
-* Strong language understanding from instruction-tuning
-* Excellent performance on summarization and classification tasks
-* Encoder-decoder architecture matching our custom implementation
-### Multi-Task Learning
-A shared encoder-decoder backbone with task-specific heads:
-* **Summarization Head:** Language modeling head with weight tying
-* **Emotion Head:** Mean-pooled classification with dropout
-* **Topic Head:** Mean-pooled classification with dropout
-## Technical Specifications
-| Component | Specification |
-| --------- | -------------- |
-| Architecture | Encoder-Decoder Transformer |
-| Pre-trained Base | google/flan-t5-base |
-| Hidden Dimension | 768 |
-| Encoder Layers | 12 |
-| Decoder Layers | 12 |
-| Attention Heads | 12 |
-| FFN Dimension | 2048 |
-| Normalization | RMSNorm (Pre-LN) |
-| Position Encoding | Learned Embeddings |
-| Max Sequence Length | 512 tokens |
 ## Getting Started
 ### Prerequisites
-* Python 3.10+
-* Poetry for dependency management
-* Docker (for containerized deployment)
-* An NVIDIA GPU with CUDA support (for training and accelerated inference)
 ### Installation
-1. **Clone the repository:**
-   ```bash
-   git clone https://github.com/OliverPerrin/LexiMind.git
-   cd LexiMind
-   ```
-2. **Install dependencies:**
-   ```bash
-   poetry install
-   ```
-3. **Download datasets:**
-   ```bash
-   poetry run python scripts/download_data.py
-   ```
-   This downloads CNN/DailyMail, BookSum, GoEmotions, AG News, and Gutenberg books.
-## Usage
-### Configuration
-All training and model parameters are managed via Hydra. Configurations are located in the `configs/` directory.
-Available configurations:
-* `model=base` - FLAN-T5-base (default, 12 layers)
-* `model=small` - Smaller model for testing (no pretrained weights)
-* `model=large` - FLAN-T5-large (24 layers, requires more VRAM)
-* `training=dev` - Quick development run (~10-15 min)
-* `training=medium` - Balanced training (~45-60 min on RTX 4070)
-* `training=full` - Full training run (~3-4 hours, or ~24h for max data)
 ### Training
 ```bash
-# Default training with FLAN-T5-base
-poetry run python scripts/train.py
-# Quick development run
 poetry run python scripts/train.py training=dev
-# Medium training run (recommended for RTX 4070)
 poetry run python scripts/train.py training=medium
 # Override parameters
 poetry run python scripts/train.py training.optimizer.lr=5e-5
-# Resume from a checkpoint
 poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
 ```
-Experiments are automatically tracked with MLflow. View results with `mlflow ui`.
 ### Evaluation
 ```bash
-# Run inference on test data
-poetry run python scripts/inference.py "Your text to analyze"
 ```
-### Inference & Demo
 ```bash
-# Command-line inference
 poetry run python scripts/inference.py "Your text to analyze"
 # Gradio web demo
 poetry run python scripts/demo_gradio.py
 ```
-## Docker
 ```bash
-# Build
 docker build -t leximind .
-# Run demo
 docker run -p 7860:7860 leximind
 ```
 ## Project Structure
-```text
-├── configs/            # Hydra configuration files
-│   ├── model/          # Model architectures (base, small, large)
-│   ├── training/       # Training configs (dev, medium, full)
-│   └── data/           # Dataset paths
 ├── data/
-│   └── processed/      # Training data (downloaded via scripts/download_data.py)
-│       ├── summarization/  # CNN/DailyMail + BookSum
-│       ├── emotion/        # GoEmotions (28 labels)
-│       ├── topic/          # AG News (4 categories)
-│       └── books/          # Gutenberg prose chunks
-├── src/
-│   ├── models/         # Custom Transformer implementation
-│   │   ├── encoder.py  # TransformerEncoder with Pre-LN RMSNorm
-│   │   ├── decoder.py  # TransformerDecoder with KV-cache
-│   │   ├── attention.py # Multi-Head Attention with FlashAttention
-│   │   └── factory.py  # Model building with FLAN-T5 weight loading
-│   ├── data/           # Dataset classes and dataloaders
-│   ├── training/       # Trainer with AMP and gradient accumulation
-│   └── inference/      # Inference pipeline
-├── scripts/
-│   ├── train.py        # Main training script
-│   ├── download_data.py # Dataset downloader
-│   ├── inference.py    # CLI inference
-│   └── demo_gradio.py  # Web demo
-└── tests/              # Unit tests
 ```
 ## Code Quality
-* **Ruff:** Fast linting and formatting
-* **MyPy:** Static type checking
-* **Pytest:** Full test suite covering data, models, and training
-* **Pre-commit hooks:** Automated quality checks
 ```bash
-# Install hooks
-poetry run pre-commit install
-# Lint
-poetry run ruff check .
-# Type check
-poetry run mypy .
-# Tests
-poetry run pytest
-```
-## Performance Optimizations
-* **torch.compile:** JIT compilation with Inductor backend
-* **Mixed Precision:** bfloat16 training on Ampere/Ada GPUs
-* **TF32:** Enabled for RTX 30xx/40xx series
-* **KV-Cache:** Efficient autoregressive decoding
-* **FlashAttention:** Memory-efficient attention via SDPA
 ## License
-GNU License - see [LICENSE](LICENSE) for details.

 pinned: false
 ---
+# LexiMind
+A multi-task NLP system for literary and academic text understanding. LexiMind performs **abstractive summarization**, **topic classification**, and **emotion detection** using a single encoder-decoder transformer initialized from [FLAN-T5-base](https://huggingface.co/google/flan-t5-base) (272M parameters).
+**[Live Demo](https://huggingface.co/spaces/OliverPerrin/LexiMind)** · **[Model](https://huggingface.co/OliverPerrin/LexiMind-Model)** · **[Discovery Dataset](https://huggingface.co/datasets/OliverPerrin/LexiMind-Discovery)** · **[Research Paper](docs/research_paper.tex)**
+## What It Does
+| Task | Description | Metric |
+|------|-------------|--------|
+| **Summarization** | Generates back-cover style book descriptions and paper abstracts from source text | BERTScore F1: **0.830** |
+| **Topic Classification** | Classifies passages into 7 categories | Accuracy: **85.2%** |
+| **Emotion Detection** | Identifies emotions from 28 fine-grained labels (multi-label) | Sample-avg F1: **0.199** |
+**Topic labels:** Arts · Business · Fiction · History · Philosophy · Science · Technology
+The model is trained on literary text (Project Gutenberg + Goodreads descriptions), academic papers (arXiv), and emotion-annotated Reddit comments (GoEmotions). For summarization, it learns to produce descriptive summaries—what a book *is about*—rather than plot recaps, by pairing Gutenberg full texts with Goodreads descriptions and arXiv bodies with their abstracts.
+## Architecture
+LexiMind is a **custom Transformer implementation** that loads pre-trained weights from FLAN-T5-base via a factory module. The architecture is reimplemented from scratch for transparency, not wrapped from HuggingFace.
+| Component | Detail |
+|-----------|--------|
+| Backbone | Encoder-Decoder Transformer (272M params) |
+| Encoder / Decoder | 12 layers each |
+| Hidden Dim | 768, 12 attention heads |
+| Position Encoding | T5-style relative position bias |
+| Normalization | RMSNorm (Pre-LN) |
+| Attention | FlashAttention via PyTorch 2.0 SDPA |
+| Summarization Head | Full decoder with language modeling head |
+| Classification Heads | Linear layers on mean-pooled encoder states |
+### Multi-Task Training
+All three tasks share the encoder. Summarization uses the full encoder-decoder; topic and emotion classification branch off the encoder with lightweight linear heads. Training uses round-robin scheduling (one batch per task per step), fixed loss weights (summarization=1.0, emotion=1.0, topic=0.3), and early stopping.
+## Training Data
+| Task | Source | Train Samples |
+|------|--------|---------------|
+| Summarization | Gutenberg + Goodreads (literary) | ~4K |
+| Summarization | arXiv body → abstract (academic) | ~45K |
+| Topic | 20 Newsgroups + Gutenberg + arXiv metadata | 3,402 |
+| Emotion | GoEmotions (Reddit comments, 28 labels) | 43,410 |
 ## Getting Started
 ### Prerequisites
+- Python 3.10+
+- [Poetry](https://python-poetry.org/) for dependency management
+- NVIDIA GPU with CUDA (for training; CPU works for inference)
 ### Installation
+```bash
+git clone https://github.com/OliverPerrin/LexiMind.git
+cd LexiMind
+poetry install
+```
+### Download Data
+```bash
+poetry run python scripts/download_data.py
+```
+Downloads Goodreads descriptions, arXiv papers, GoEmotions, 20 Newsgroups, and Gutenberg texts.
 ### Training
 ```bash
+# Full training (~45-60 min on RTX 4070 12GB)
+poetry run python scripts/train.py training=full
+# Quick dev run (~10-15 min)
 poetry run python scripts/train.py training=dev
+# Medium run (~30-45 min)
 poetry run python scripts/train.py training=medium
 # Override parameters
 poetry run python scripts/train.py training.optimizer.lr=5e-5
+# Resume from checkpoint
 poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
 ```
+Training uses BFloat16 mixed precision, gradient checkpointing, `torch.compile`, and cosine LR decay with warmup. Experiments are tracked with MLflow (`mlflow ui` to browse).
 ### Evaluation
 ```bash
+# Full evaluation (ROUGE, BERTScore, topic accuracy, emotion F1)
+poetry run python scripts/evaluate.py
+# Skip BERTScore for faster runs
+poetry run python scripts/evaluate.py --skip-bertscore
+# Single task
+poetry run python scripts/evaluate.py --summarization-only
 ```
+### Inference
 ```bash
+# Command-line
 poetry run python scripts/inference.py "Your text to analyze"
 # Gradio web demo
 poetry run python scripts/demo_gradio.py
 ```
+### Docker
 ```bash
 docker build -t leximind .
 docker run -p 7860:7860 leximind
 ```
 ## Project Structure
+```
+configs/
+├── config.yaml              # Main Hydra config
+├── data/datasets.yaml       # Dataset paths and tokenizer settings
+├── model/                   # Architecture configs (base, small, large)
+└── training/                # Training configs (dev, medium, full)
+src/
+├── models/
+│   ├── encoder.py           # Transformer Encoder with Pre-LN RMSNorm
+│   ├── decoder.py           # Transformer Decoder with KV-cache
+│   ├── attention.py         # Multi-Head Attention + T5 relative position bias
+│   ├── feedforward.py       # Gated feed-forward network
+│   ├── positional_encoding.py  # Sinusoidal & learned position encodings
+│   ├── t5_layer_norm.py     # T5-style RMSNorm
+│   ├── heads.py             # Task-specific classification heads
+│   ├── multitask.py         # Multi-task model combining all components
+│   └── factory.py           # Model builder with FLAN-T5 weight loading
 ├── data/
+│   ├── dataset.py           # Dataset classes for all tasks
+│   ├── dataloader.py        # Multi-task dataloader with round-robin sampling
+│   └── tokenization.py      # Tokenizer wrapper
+├── training/
+│   ├── trainer.py           # Training loop with AMP, grad accumulation, early stopping
+│   ├── metrics.py           # ROUGE, BERTScore, F1, accuracy computation
+│   └── utils.py             # Checkpointing, logging utilities
+├── inference/
+│   ├── pipeline.py          # End-to-end inference pipeline
+│   └── factory.py           # Model loading for inference
+├── api/                     # FastAPI REST endpoint
+└── utils/                   # Shared utilities
+scripts/
+├── train.py                 # Training entry point
+├── evaluate.py              # Evaluation with all metrics
+├── inference.py             # CLI inference
+├── demo_gradio.py           # Gradio web UI
+├── download_data.py         # Dataset downloader
+├── export_model.py          # Model export utilities
+├── export_tokenizer.py      # Tokenizer export
+├── preprocess_data.py       # Data preprocessing
+├── process_books.py         # Gutenberg text processing
+├── eval_rouge.py            # ROUGE-only evaluation
+└── visualize_training.py    # Training curve plotting
+tests/                       # Pytest suite (data, models, training, inference, utils)
+docs/                        # Research paper and architecture notes
+artifacts/                   # Tokenizer files and label definitions
+checkpoints/                 # Saved model checkpoints
 ```
 ## Code Quality
 ```bash
+poetry run ruff check .     # Linting
+poetry run mypy .           # Type checking
+poetry run pytest           # Test suite
+poetry run pre-commit run --all-files  # All checks
+```
+## Key Results
+From the research paper ([docs/research_paper.tex](docs/research_paper.tex)):
+- **Multi-task learning helps topic classification** (+3.2% accuracy over single-task) because the small topic dataset (3.4K) benefits from shared encoder representations trained on the larger summarization corpus (49K).
+- **Summarization is robust to MTL**—quality stays comparable whether trained alone or jointly.
+- **Emotion detection shows slight negative transfer** (−0.02 F1), likely due to domain mismatch between Reddit-sourced emotion labels and literary/academic text.
+- **FLAN-T5 pre-training is essential**—random initialization produces dramatically worse results on all tasks.
+See the paper for full ablations, per-class breakdowns, and discussion of limitations.
 ## License
+GPL-3.0 — see [LICENSE](LICENSE) for details.
+---
+*Built by Oliver Perrin · Appalachian State University · 2025–2026*

configs/data/datasets.yaml CHANGED Viewed

@@ -4,7 +4,7 @@
 processed:
   summarization: data/processed/summarization  # BookSum + arXiv
   emotion: data/processed/emotion              # GoEmotions (28 labels)
-  topic: data/processed/topic                  # Books + Papers (8 labels)
   books: data/processed/books                  # Gutenberg prose chunks
 tokenizer:

 processed:
   summarization: data/processed/summarization  # BookSum + arXiv
   emotion: data/processed/emotion              # GoEmotions (28 labels)
+  topic: data/processed/topic                  # Books + Papers (7 labels)
   books: data/processed/books                  # Gutenberg prose chunks
 tokenizer:

docs/architecture.md CHANGED Viewed

@@ -53,7 +53,7 @@ The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible
 | ---- | ------- | ---- | ------ |
 | Summarization | BookSum + arXiv | ~90K | Text→Summary |
 | Emotion | GoEmotions | ~43K | 28 emotions (multi-label) |
-| Topic | Books + Papers | ~50K | 8 categories (Fiction, Science, Technology, etc.) |
 | Books | Gutenberg (prose chunks) | ~30K | Literary text |
 ### T5 Tokenizer Differences

 | ---- | ------- | ---- | ------ |
 | Summarization | BookSum + arXiv | ~90K | Text→Summary |
 | Emotion | GoEmotions | ~43K | 28 emotions (multi-label) |
+| Topic | Books + Papers | 3.4K | 7 categories (Arts, Business, Fiction, History, Philosophy, Science, Technology) |
 | Books | Gutenberg (prose chunks) | ~30K | Literary text |
 ### T5 Tokenizer Differences

docs/paper.tex CHANGED Viewed

@@ -59,7 +59,7 @@ Email: perrinot@appstate.edu}}
 \maketitle
 \begin{abstract}
-This paper presents LexiMind, a multi-task Natural Language Processing (NLP) system that combines a custom-built Transformer architecture with pre-trained weights from Google's FLAN-T5 (Fine-tuned Language Net Text-to-Text Transfer Transformer). The system performs three fundamental NLP tasks simultaneously: abstractive text summarization, multi-label emotion classification, and single-label topic classification. Unlike news-focused models, LexiMind specializes in literary and academic content. For summarization, we train on 49,086 samples combining Goodreads book descriptions (back-cover style blurbs) with arXiv academic paper abstracts. Emotion classification uses 43,410 samples from GoEmotions \cite{demszky2020goemotions}, a dataset of 28 fine-grained emotion labels derived from Reddit comments. Topic classification spans 3,402 samples from 20 Newsgroups, Project Gutenberg literary texts, and scientific papers across 7 categories (Fiction, Science, Technology, Philosophy, History, Psychology, Business). By implementing modern architectural innovations including Pre-Layer Normalization (Pre-LN) with Root Mean Square Layer Normalization (RMSNorm), T5-style relative position bias, FlashAttention via PyTorch 2.0's Scaled Dot-Product Attention (SDPA), gradient checkpointing, and torch.compile optimization, LexiMind achieves efficient training on consumer GPUs while maintaining strong performance. Our final model achieves a BERTScore F1 of 0.83 and ROUGE-1 of 0.31 for summarization, 85.2\% accuracy for topic classification, and F1 of 0.20 for 28-class multi-label emotion detection. The 272M-parameter architecture is constructed from first principles in a bottom-up fashion, with each component (attention mechanisms, feed-forward networks, encoder/decoder blocks) implemented as standalone modules. A factory pattern enables seamless weight transfer from FLAN-T5-base, allowing the system to leverage Google's pre-trained knowledge while maintaining full architectural transparency and customization capability.
 \end{abstract}
 \begin{IEEEkeywords}

 \maketitle
 \begin{abstract}
+This paper presents LexiMind, a multi-task Natural Language Processing (NLP) system that combines a custom-built Transformer architecture with pre-trained weights from Google's FLAN-T5 (Fine-tuned Language Net Text-to-Text Transfer Transformer). The system performs three fundamental NLP tasks simultaneously: abstractive text summarization, multi-label emotion classification, and single-label topic classification. Unlike news-focused models, LexiMind specializes in literary and academic content. For summarization, we train on 49,086 samples combining Goodreads book descriptions (back-cover style blurbs) with arXiv academic paper abstracts. Emotion classification uses 43,410 samples from GoEmotions \cite{demszky2020goemotions}, a dataset of 28 fine-grained emotion labels derived from Reddit comments. Topic classification spans 3,402 samples from 20 Newsgroups, Project Gutenberg literary texts, and scientific papers across 7 categories (Arts, Business, Fiction, History, Philosophy, Science, Technology). By implementing modern architectural innovations including Pre-Layer Normalization (Pre-LN) with Root Mean Square Layer Normalization (RMSNorm), T5-style relative position bias, FlashAttention via PyTorch 2.0's Scaled Dot-Product Attention (SDPA), gradient checkpointing, and torch.compile optimization, LexiMind achieves efficient training on consumer GPUs while maintaining strong performance. Our final model achieves a BERTScore F1 of 0.83 and ROUGE-1 of 0.31 for summarization, 85.2\% accuracy for topic classification, and F1 of 0.20 for 28-class multi-label emotion detection. The 272M-parameter architecture is constructed from first principles in a bottom-up fashion, with each component (attention mechanisms, feed-forward networks, encoder/decoder blocks) implemented as standalone modules. A factory pattern enables seamless weight transfer from FLAN-T5-base, allowing the system to leverage Google's pre-trained knowledge while maintaining full architectural transparency and customization capability.
 \end{abstract}
 \begin{IEEEkeywords}

docs/research_paper.tex CHANGED Viewed

@@ -1,5 +1,5 @@
 % LexiMind: Multi-Task Learning for Literary and Academic Text Understanding
-% Research Paper Version - Focus on Experimental Analysis and Novel Contributions
 % Author: Oliver Perrin
 \documentclass[conference]{IEEEtran}
@@ -44,7 +44,7 @@ Email: perrinot@appstate.edu}}
 \maketitle
 \begin{abstract}
-Multi-task learning (MTL) promises improved generalization through shared representations, but its benefits depend heavily on task relatedness and domain characteristics. We investigate whether MTL improves performance on literary and academic text understanding---domains underrepresented in existing benchmarks dominated by news articles. Using a FLAN-T5-base backbone, we jointly train on three tasks: abstractive summarization (49K samples from book descriptions and paper abstracts), topic classification (3.4K samples across 7 categories), and emotion detection (43K samples from GoEmotions). Through systematic ablation studies comparing single-task specialists against multi-task configurations, we find that: (1) MTL provides a +3.2\% accuracy boost for topic classification due to shared encoder representations, (2) summarization quality remains comparable (BERTScore F1 0.83 vs. 0.82 single-task), and (3) emotion detection suffers from negative transfer (-0.02 F1), likely due to domain mismatch between Reddit-sourced emotion labels and literary/academic text. We further ablate the contribution of FLAN-T5 pre-training, showing that transfer learning accounts for 85\% of final performance, with fine-tuning providing crucial domain adaptation. Our analysis reveals that MTL benefits depend critically on dataset size ratios and domain alignment, offering practical guidance for multi-task system design.
 \end{abstract}
 \begin{IEEEkeywords}
@@ -55,9 +55,9 @@ Multi-Task Learning, Transfer Learning, Text Summarization, Emotion Classificati
 \section{Introduction}
 %=============================================================================
-Multi-task learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, hypothesizing that shared representations improve generalization. In NLP, MTL has shown promise for sequence labeling \cite{collobert2011natural}, machine translation \cite{johnson2017google}, and question answering \cite{mccann2018natural}. However, recent work highlights that MTL does not universally help---negative transfer can occur when tasks compete for model capacity \cite{wang2019characterizing, standley2020tasks}.
-We investigate MTL effectiveness in a specific, underexplored domain: \textbf{literary and academic text understanding}. Unlike news articles---which dominate existing benchmarks like CNN/DailyMail \cite{nallapati2016abstractive}---literary and academic texts exhibit distinct characteristics: longer context dependencies, domain-specific vocabulary, and different summary styles (descriptive abstracts vs. extractive headlines).
 Our study addresses three research questions:
@@ -67,127 +67,140 @@ Our study addresses three research questions:
     \item[\textbf{RQ3}] How much does pre-trained knowledge (FLAN-T5) contribute relative to task-specific fine-tuning?
 \end{enumerate}
-To answer these questions, we construct \textbf{LexiMind}, a multi-task system built on FLAN-T5-base \cite{chung2022scaling} that performs abstractive summarization, topic classification, and emotion detection. We conduct systematic ablations comparing:
-\begin{itemize}
-    \item Multi-task vs. single-task training
-    \item With vs. without FLAN-T5 initialization
-    \item Different task weight configurations
-\end{itemize}
-Our key findings are:
 \begin{itemize}
     \item \textbf{Topic classification benefits most from MTL} (+3.2\% accuracy), leveraging shared encoder representations from the larger summarization dataset.
-    \item \textbf{Summarization is robust to MTL}, showing minimal degradation despite sharing capacity with classification heads.
-    \item \textbf{Emotion detection suffers negative transfer} (-0.02 F1), attributed to domain mismatch between GoEmotions' Reddit comments and literary/academic register.
-    \item \textbf{Transfer learning dominates}: FLAN-T5 initialization provides 85\% of final performance; fine-tuning adds crucial domain adaptation.
 \end{itemize}
 %=============================================================================
 \section{Related Work}
 %=============================================================================
 \subsection{Multi-Task Learning in NLP}
-Collobert et al. \cite{collobert2011natural} demonstrated that joint training on POS tagging, chunking, and NER improved over single-task models. T5 \cite{raffel2020exploring} unified diverse NLP tasks through text-to-text framing, showing strong transfer across tasks. However, Standley et al. \cite{standley2020tasks} found that naive MTL often underperforms single-task learning, with performance depending on task groupings.
-Recent work on task interference \cite{wang2019characterizing, yu2020gradient} identifies gradient conflicts as a source of negative transfer. Our work contributes empirical evidence for task interactions in the literary/academic domain, an underexplored setting.
-\subsection{Literary and Academic NLP}
-Most summarization benchmarks focus on news \cite{nallapati2016abstractive, narayan2018don}. BookSum \cite{kryscinski2021booksum} introduced chapter-level book summarization, but targets plot summaries rather than descriptive abstracts. arXiv summarization \cite{cohan2018discourse} addresses academic papers but remains single-domain. Our dataset combines book descriptions (back-cover style) with paper abstracts, training models to generate \textit{what it's about} summaries.
 \subsection{Emotion Detection}
-GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained emotion labels from Reddit comments. Prior work achieves 0.35--0.46 macro F1 using BERT-based classifiers \cite{demszky2020goemotions}. Our lower performance (0.20 F1) reflects the domain shift from conversational Reddit to formal literary/academic text---a finding that informs domain-aware emotion system design.
 %=============================================================================
 \section{Experimental Setup}
 %=============================================================================
 \subsection{Datasets}
-Table \ref{tab:datasets} summarizes our datasets, curated to focus on literary and academic content.
 \begin{table}[htbp]
 \centering
-\caption{Dataset Statistics}
 \label{tab:datasets}
 \begin{tabular}{llrrr}
 \toprule
 \textbf{Task} & \textbf{Source} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
 \midrule
-\multirow{2}{*}{Summarization} & Goodreads descriptions & 24,543 & 1,363 & 1,364 \\
- & arXiv abstracts & 24,543 & 1,364 & 1,363 \\
 \midrule
-Topic (7 classes) & Mixed sources & 3,402 & 189 & 189 \\
 \midrule
-Emotion (28 labels) & GoEmotions & 43,410 & 5,426 & 5,427 \\
 \bottomrule
 \end{tabular}
 \end{table}
-\textbf{Summarization}: We combine Goodreads book descriptions---back-cover style blurbs describing \textit{what a book is about}---with arXiv paper abstracts. This trains descriptive summarization rather than extractive plot recaps.
-\textbf{Topic Classification}: 7-class single-label classification (Fiction, Science, Technology, Philosophy, History, Psychology, Business) from 20 Newsgroups, Project Gutenberg, and scientific papers.
-\textbf{Emotion Detection}: GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained multi-label emotions. We include this to study cross-domain transfer effects.
 \subsection{Model Architecture}
-LexiMind uses FLAN-T5-base (272M parameters) as the backbone:
 \begin{itemize}
     \item 12-layer encoder, 12-layer decoder
     \item 768-dimensional hidden states, 12 attention heads
-    \item T5-style relative position bias
-    \item Pre-Layer Normalization with RMSNorm
 \end{itemize}
-Task-specific components:
 \begin{itemize}
-    \item \textbf{Summarization}: Decoder with language modeling head
-    \item \textbf{Topic}: Linear classifier on encoder [CLS]-equivalent (mean pooling)
-    \item \textbf{Emotion}: Multi-label classifier with sigmoid activation
 \end{itemize}
 \subsection{Training Configuration}
-All experiments use consistent hyperparameters:
 \begin{itemize}
-    \item Optimizer: AdamW, lr=$3\times10^{-5}$, weight decay=0.01
-    \item Batch size: 40 (effective, via gradient accumulation)
-    \item Warmup: 300 steps with cosine decay
-    \item Max epochs: 8 with early stopping (patience=3)
-    \item Precision: BFloat16 on NVIDIA RTX 4070 (12GB)
 \end{itemize}
-For MTL, task losses are weighted: summarization=1.0, emotion=1.0, topic=0.3 (reduced due to rapid convergence from small dataset size).
 \subsection{Baselines and Ablations}
 We compare four configurations:
 \begin{enumerate}
-    \item \textbf{Random/Majority}: Random predictions (classification) or output of ``Summary not available'' (summarization)
-    \item \textbf{FLAN-T5-base (zero-shot)}: Pre-trained model without fine-tuning
-    \item \textbf{Single-Task}: Separate models fine-tuned on each task individually
-    \item \textbf{Multi-Task (LexiMind)}: Joint training on all three tasks
 \end{enumerate}
-We also ablate:
-\begin{itemize}
-    \item \textbf{Random init vs. FLAN-T5 init}: Isolate transfer learning contribution
-    \item \textbf{Task weight variations}: Study effect of loss balancing
-\end{itemize}
 \subsection{Evaluation Metrics}
 \begin{itemize}
-    \item \textbf{Summarization}: ROUGE-1/2/L \cite{lin2004rouge}, BERTScore F1 \cite{zhang2019bertscore}
-    \item \textbf{Topic}: Accuracy, Macro F1
-    \item \textbf{Emotion}: Multi-label F1 (sample-averaged)
 \end{itemize}
-BERTScore captures semantic similarity even when surface forms differ---crucial for abstractive summarization where paraphrasing is expected.
 %=============================================================================
 \section{Results}
@@ -199,7 +212,7 @@ Table \ref{tab:main_results} compares MTL against single-task specialists.
 \begin{table}[htbp]
 \centering
-\caption{Main Results: Multi-Task vs. Single-Task Performance}
 \label{tab:main_results}
 \begin{tabular}{llcc}
 \toprule
@@ -213,7 +226,7 @@ Table \ref{tab:main_results} compares MTL against single-task specialists.
 \multirow{2}{*}{Topic} & Accuracy & 82.0\% & \textbf{85.2\%} \\
  & Macro F1 & 0.812 & \textbf{0.847} \\
 \midrule
-Emotion & Multi-label F1 & \textbf{0.218} & 0.199 \\
 \bottomrule
 \end{tabular}
 \end{table}
@@ -221,14 +234,15 @@ Emotion & Multi-label F1 & \textbf{0.218} & 0.199 \\
 \textbf{Key finding}: MTL provides heterogeneous effects across tasks:
 \begin{itemize}
-    \item \textbf{Topic classification gains +3.2\% accuracy} from MTL. The small topic dataset (3.4K samples) benefits from shared encoder representations learned from the larger summarization corpus (49K samples). This exemplifies positive transfer from high-resource to low-resource tasks.
-    \item \textbf{Summarization shows modest improvement} (+0.009 BERTScore F1). The generative task is robust to sharing encoder capacity with classification heads, likely because the decoder remains task-specific.
-    \item \textbf{Emotion detection degrades by -0.019 F1}. This negative transfer likely stems from domain mismatch: GoEmotions labels derive from informal Reddit comments, while our encoder representations are shaped by formal literary/academic text from summarization.
 \end{itemize}
 \subsection{Baseline Comparisons}
 Table \ref{tab:baselines} contextualizes our results against trivial and zero-shot baselines.
@@ -248,15 +262,17 @@ Single-Task & 0.821 & 82.0\% & 0.218 \\
 \end{tabular}
 \end{table}
-Fine-tuning provides substantial gains over zero-shot (+0.106 BERTScore, +27\% topic accuracy), demonstrating the importance of domain adaptation even with strong pre-trained models.
 \subsection{Ablation: Transfer Learning Contribution}
-Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-training.
 \begin{table}[htbp]
 \centering
-\caption{Effect of Pre-trained Initialization}
 \label{tab:transfer_ablation}
 \begin{tabular}{lccc}
 \toprule
@@ -265,20 +281,20 @@ Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-train
 Random & 0.523 & 45.2\% & 0.082 \\
 FLAN-T5-base & \textbf{0.830} & \textbf{85.2\%} & \textbf{0.199} \\
 \midrule
-\textit{Gain from transfer} & +0.307 & +40.0\% & +0.117 \\
 \bottomrule
 \end{tabular}
 \end{table}
-FLAN-T5 initialization accounts for the majority of final performance. Training from random initialization with identical architecture and data yields substantially worse results, confirming that pre-trained linguistic knowledge is essential---not just architectural choices.
-\subsection{Analysis: Per-Class Topic Performance}
-Table \ref{tab:topic_breakdown} reveals per-class patterns in topic classification.
 \begin{table}[htbp]
 \centering
-\caption{Per-Class Topic Classification}
 \label{tab:topic_breakdown}
 \begin{tabular}{lccc}
 \toprule
@@ -297,38 +313,41 @@ Technology & 0.86 & 0.89 & 0.87 \\
 \end{tabular}
 \end{table}
-Fiction and Business achieve near-perfect classification (F1 $\geq$ 0.97), while Science shows the most confusion (F1 = 0.65). Error analysis reveals Science samples are frequently misclassified as Technology---an expected confusion given semantic overlap between scientific research and technical applications.
 \subsection{Analysis: Why Does Emotion Detection Underperform?}
-Our emotion F1 (0.20) is substantially lower than reported GoEmotions baselines (0.35--0.46) \cite{demszky2020goemotions}. We identify three contributing factors:
 \begin{enumerate}
-    \item \textbf{Domain shift}: GoEmotions labels were annotated on Reddit comments. Our encoder, shaped by literary book descriptions and academic abstracts, learns representations optimized for formal register---misaligned with Reddit's conversational tone.
-    \item \textbf{Label sparsity}: 28 emotion classes with multi-label annotation creates extreme class imbalance. Many emotions (grief, remorse, nervousness) appear in $<$2\% of samples.
-    \item \textbf{Encoder-decoder architecture}: GoEmotions baselines use BERT (encoder-only). Our encoder-decoder architecture may be suboptimal for classification, as the encoder is primarily trained to produce representations useful for the decoder.
 \end{enumerate}
-This finding has practical implications: \textbf{domain-specific emotion data is critical} for literary/academic applications. Off-the-shelf emotion classifiers trained on social media transfer poorly to formal text.
 \subsection{Training Dynamics}
-Figure \ref{fig:training_curves} shows training progression over 7 epochs.
 \begin{figure}[htbp]
 \centering
 \includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
-\caption{Training and validation loss. Best checkpoint at epoch 4; later epochs show validation loss plateau, triggering early stopping.}
 \label{fig:training_curves}
 \end{figure}
 Key observations:
 \begin{itemize}
-    \item Topic classification converges by epoch 3 (99\% training accuracy), validating our reduced task weight (0.3) to prevent gradient dominance.
-    \item Summarization loss decreases monotonically through epoch 4, then plateaus.
-    \item Best checkpoint at epoch 4 balances all tasks; later epochs show slight overfitting on the small topic dataset.
 \end{itemize}
 %=============================================================================
@@ -337,55 +356,81 @@ Key observations:
 \subsection{When Does MTL Help?}
-Our results support nuanced guidance for MTL system design:
-\textbf{MTL helps when}: A small dataset task (topic: 3.4K samples) can leverage representations from a large dataset task (summarization: 49K samples) through shared encoder layers. The topic task effectively benefits from ``free'' pre-training on literary/academic text.
-\textbf{MTL hurts when}: Task domains are misaligned. Emotion detection trained on Reddit comments does not benefit from---and is potentially harmed by---encoder representations shaped by formal literary/academic summarization.
-\textbf{MTL is neutral when}: The primary task (summarization) has sufficient data and a task-specific component (decoder) that insulates it from interference.
 \subsection{Implications for Practitioners}
-Based on our findings, we recommend:
 \begin{enumerate}
-    \item \textbf{Audit domain alignment} before combining tasks. If auxiliary tasks come from different domains (e.g., social media vs. academic), negative transfer is likely.
-    \item \textbf{Use task weighting} to prevent small datasets from overfitting. Our 0.3 weight for topic classification prevented gradient dominance while still enabling positive transfer.
-    \item \textbf{Consider task-specific components} for high-priority tasks. Summarization's dedicated decoder protected it from classification interference.
 \end{enumerate}
 \subsection{Limitations}
 \begin{itemize}
-    \item \textbf{Single model size}: We study only FLAN-T5-base (272M). Larger models (T5-large, T5-xl) may show different MTL dynamics.
-    \item \textbf{No human evaluation}: Our summarization metrics (ROUGE, BERTScore) are automatic. Human judgment of summary quality---especially for creative literary text---would strengthen conclusions.
-    \item \textbf{Limited task combinations}: We study three specific tasks. Other task groupings might yield different transfer patterns.
 \end{itemize}
 \subsection{Future Work}
 \begin{itemize}
-    \item \textbf{Domain-specific emotion data}: Collecting emotion annotations on literary text could dramatically improve emotion detection while maintaining domain coherence.
-    \item \textbf{Gradient analysis}: Measuring gradient conflicts \cite{yu2020gradient} between tasks would provide mechanistic understanding of observed transfer effects.
-    \item \textbf{Parameter-efficient fine-tuning}: LoRA \cite{hu2022lora} or adapters could enable per-task specialization while maintaining shared representations.
 \end{itemize}
 %=============================================================================
 \section{Conclusion}
 %=============================================================================
-We investigated multi-task learning for literary and academic text understanding, finding heterogeneous transfer effects across tasks. Topic classification benefits substantially from shared representations (+3.2\% accuracy), while emotion detection suffers negative transfer due to domain mismatch (-0.02 F1). Summarization remains robust to multi-task training.
-Our ablations confirm that FLAN-T5 pre-training dominates final performance, but fine-tuning provides essential domain adaptation. These findings offer practical guidance: MTL benefits depend critically on domain alignment and dataset size ratios. Practitioners should audit task compatibility before combining disparate datasets.
-Code, models, and data are available at \url{https://github.com/OliverPerrin/LexiMind}, with a live demo at \url{https://huggingface.co/spaces/OliverPerrin/LexiMind}.
 %=============================================================================
 % References
@@ -397,25 +442,40 @@ Code, models, and data are available at \url{https://github.com/OliverPerrin/Lex
 R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
 \bibitem{collobert2011natural}
-R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, ``Natural language processing (almost) from scratch,'' \textit{Journal of Machine Learning Research}, vol. 12, pp. 2493--2537, 2011.
 \bibitem{johnson2017google}
-M. Johnson et al., ``Google's multilingual neural machine translation system: Enabling zero-shot translation,'' \textit{Transactions of the Association for Computational Linguistics}, vol. 5, pp. 339--351, 2017.
 \bibitem{mccann2018natural}
-B. McCann, N. S. Keskar, C. Xiong, and R. Socher, ``The natural language decathlon: Multitask learning as question answering,'' \textit{arXiv preprint arXiv:1806.08730}, 2018.
-\bibitem{wang2019characterizing}
-A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, ``SuperGLUE: A stickier benchmark for general-purpose language understanding systems,'' in \textit{NeurIPS}, 2019.
 \bibitem{standley2020tasks}
 T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, ``Which tasks should be learned together in multi-task learning?'' in \textit{ICML}, 2020.
 \bibitem{raffel2020exploring}
 C. Raffel et al., ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{JMLR}, vol. 21, no. 140, pp. 1--67, 2020.
 \bibitem{chung2022scaling}
-H. W. Chung et al., ``Scaling instruction-finetuned language models,'' \textit{arXiv preprint arXiv:2210.11416}, 2022.
 \bibitem{nallapati2016abstractive}
 R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, ``Abstractive text summarization using sequence-to-sequence RNNs and beyond,'' in \textit{CoNLL}, 2016.
@@ -427,13 +487,16 @@ S. Narayan, S. B. Cohen, and M. Lapata, ``Don't give me the details, just the su
 W. Kryscinski, N. Rajani, D. Aber, and C. Xiong, ``BookSum: A collection of datasets for long-form narrative summarization,'' in \textit{Findings of EMNLP}, 2021.
 \bibitem{cohan2018discourse}
-A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, ``A discourse-aware attention model for abstractive summarization of long documents,'' in \textit{NAACL-HLT}, 2018.
 \bibitem{demszky2020goemotions}
 D. Demszky et al., ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{ACL}, 2020.
-\bibitem{yu2020gradient}
-T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, ``Gradient surgery for multi-task learning,'' in \textit{NeurIPS}, 2020.
 \bibitem{lin2004rouge}
 C.-Y. Lin, ``ROUGE: A package for automatic evaluation of summaries,'' in \textit{Text Summarization Branches Out}, 2004.
@@ -444,6 +507,12 @@ T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, ``BERTScore: Evalua
 \bibitem{hu2022lora}
 E. J. Hu et al., ``LoRA: Low-rank adaptation of large language models,'' in \textit{ICLR}, 2022.
 \end{thebibliography}
 \end{document}

 % LexiMind: Multi-Task Learning for Literary and Academic Text Understanding
+% Research Paper - Revised with Experimental Rigor
 % Author: Oliver Perrin
 \documentclass[conference]{IEEEtran}
 \maketitle
 \begin{abstract}
+Multi-task learning (MTL) promises improved generalization through shared representations, but its benefits depend heavily on task relatedness and domain characteristics. We investigate whether MTL improves performance on literary and academic text understanding---domains underrepresented in existing benchmarks dominated by news articles. Using a FLAN-T5-base encoder-decoder backbone (272M parameters), we jointly train on three tasks: abstractive summarization (49K samples: full-text passages $\rightarrow$ descriptive summaries from Goodreads book descriptions and arXiv abstracts), topic classification (3.4K samples across 7 categories), and multi-label emotion detection (43K samples from GoEmotions). Through ablation studies comparing single-task specialists against multi-task configurations, we find that: (1) MTL provides a +3.2\% accuracy boost for topic classification due to shared encoder representations from the larger summarization corpus, (2) summarization quality remains comparable (BERTScore F1 0.83 vs. 0.82 single-task), and (3) emotion detection suffers negative transfer ($-$0.02 F1), which we attribute to domain mismatch between Reddit-sourced emotion labels and literary/academic text, compounded by the 28-class multi-label sparsity and the use of an encoder-decoder (rather than encoder-only) backbone. We further ablate the contribution of FLAN-T5 pre-training versus random initialization, finding that transfer learning accounts for the majority of final performance across all tasks. Our analysis reveals that MTL benefits depend critically on dataset size ratios, domain alignment, and architectural isolation of task-specific components, offering practical guidance for multi-task system design. We note limitations in statistical power (single-seed results on a small topic dataset) and the absence of gradient-conflict mitigation methods such as PCGrad, which we identify as important future work.
 \end{abstract}
 \begin{IEEEkeywords}
 \section{Introduction}
 %=============================================================================
+Multi-task learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, hypothesizing that shared representations improve generalization. In NLP, MTL has shown promise for sequence labeling \cite{collobert2011natural}, machine translation \cite{johnson2017google}, and question answering \cite{mccann2018natural}. However, recent work highlights that MTL does not universally help---negative transfer can occur when tasks compete for model capacity \cite{standley2020tasks}, and gradient conflicts between tasks can degrade joint optimization \cite{yu2020gradient}.
+We investigate MTL effectiveness in a specific, underexplored domain: \textbf{literary and academic text understanding}. Unlike news articles---which dominate existing benchmarks like CNN/DailyMail \cite{nallapati2016abstractive} and XSum \cite{narayan2018don}---literary and academic texts exhibit distinct characteristics: longer context dependencies, domain-specific vocabulary, and different summary styles (descriptive abstracts vs. extractive headlines). Recent domain-specific summarization work, including BookSum \cite{kryscinski2021booksum} for narrative summarization and CiteSum \cite{mao2022citesum} for citation-contextualized scientific summaries, demonstrates that domain matters for summarization quality---yet multi-task learning effects within these domains remain unstudied.
 Our study addresses three research questions:
     \item[\textbf{RQ3}] How much does pre-trained knowledge (FLAN-T5) contribute relative to task-specific fine-tuning?
 \end{enumerate}
+To answer these questions, we construct \textbf{LexiMind}, a multi-task system built on FLAN-T5-base \cite{chung2022scaling} that performs abstractive summarization, topic classification, and emotion detection. We conduct ablations comparing multi-task vs. single-task training, with vs. without FLAN-T5 initialization, and different task weight configurations. Our primary experimental contribution is the empirical characterization of transfer effects across these heterogeneous tasks:
 \begin{itemize}
     \item \textbf{Topic classification benefits most from MTL} (+3.2\% accuracy), leveraging shared encoder representations from the larger summarization dataset.
+    \item \textbf{Summarization is robust to MTL}, showing minimal change despite sharing encoder capacity with classification heads.
+    \item \textbf{Emotion detection suffers negative transfer} ($-$0.02 F1), attributed to domain mismatch between GoEmotions' Reddit source and the formal literary/academic register.
+    \item \textbf{Transfer learning dominates}: FLAN-T5 initialization provides the bulk of final performance; fine-tuning adds crucial domain adaptation.
 \end{itemize}
+We acknowledge important limitations: our results are from single-seed runs, we do not explore gradient-conflict mitigation methods (PCGrad \cite{yu2020gradient}, CAGrad \cite{liu2021conflict}), and our emotion evaluation conflates domain mismatch with multi-label threshold and architecture choices. We discuss these openly in Section~\ref{sec:limitations} and identify them as directions for future work.
 %=============================================================================
 \section{Related Work}
 %=============================================================================
 \subsection{Multi-Task Learning in NLP}
+Collobert et al. \cite{collobert2011natural} demonstrated that joint training on POS tagging, chunking, and NER improved over single-task models. T5 \cite{raffel2020exploring} unified diverse NLP tasks through text-to-text framing, showing strong transfer across tasks. However, Standley et al. \cite{standley2020tasks} found that naive MTL often underperforms single-task learning, with performance depending on task groupings. More recently, Aghajanyan et al. \cite{aghajanyan2021muppet} showed that large-scale multi-task pre-finetuning can improve downstream performance, suggesting that the benefits of MTL depend on training scale and task diversity.
+\textbf{Gradient conflict and loss balancing.} Yu et al. \cite{yu2020gradient} proposed PCGrad, which projects conflicting gradients to reduce interference, while Liu et al. \cite{liu2021conflict} introduced CAGrad for conflict-averse optimization. Chen et al. \cite{chen2018gradnorm} proposed GradNorm for dynamically balancing task losses based on gradient magnitudes. Kendall et al. \cite{kendall2018multi} explored uncertainty-based task weighting. Our work uses fixed loss weights---a simpler but less adaptive approach. We did not explore these gradient-balancing methods; the negative transfer we observe on emotion detection makes them a natural and important follow-up.
+\textbf{Multi-domain multi-task studies.} Aribandi et al. \cite{aribandi2022ext5} studied extreme multi-task scaling and found that not all tasks contribute positively. Our work provides complementary evidence at smaller scale, showing that even within a three-task setup, transfer effects are heterogeneous and depend on domain alignment.
+\subsection{Literary and Academic Summarization}
+Most summarization benchmarks focus on news \cite{nallapati2016abstractive, narayan2018don}. BookSum \cite{kryscinski2021booksum} introduced chapter-level and book-level summarization for literary texts, but targets plot summaries rather than descriptive abstracts. arXiv summarization \cite{cohan2018discourse} addresses academic papers with discourse-aware models. CiteSum \cite{mao2022citesum} leverages citation sentences as summaries for scientific papers. Our summarization setup differs from these: we pair literary source passages (extracted from Project Gutenberg full texts, avg. 3,030 characters) with Goodreads book descriptions (avg. 572 characters) as targets, training the model to generate \textit{what a book is about} rather than plot recaps. For academic text, arXiv paper body text (avg. 3,967 characters) is paired with abstracts (avg. 1,433 characters). The resulting compression ratios (0.19 for literary, 0.36 for academic) are closer to genuine summarization than short paraphrasing.
 \subsection{Emotion Detection}
+GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained emotion labels from Reddit comments. The original work reports 0.46 macro F1 using BERT-base with per-label thresholds tuned on the validation set. Subsequent work achieves 0.35--0.46 macro F1 depending on the model and threshold strategy. Importantly, all published GoEmotions baselines use encoder-only architectures (BERT, RoBERTa) rather than encoder-decoder models like T5. Our setup differs in both architecture (encoder-decoder with mean-pooled encoder states) and domain (training encoder primarily on literary/academic summarization), making direct comparison to published baselines informative but not fully controlled.
 %=============================================================================
 \section{Experimental Setup}
 %=============================================================================
+\subsection{Task Formulations}
+\label{sec:task_formulation}
+We define three tasks with explicit input-output specifications:
+\textbf{Summarization (generative).} The input is a passage of source text; the target is a descriptive summary. For literary texts, the source is a passage from a Project Gutenberg full text (mean: 3,030 characters, truncated to 512 tokens), and the target is the corresponding Goodreads book description (mean: 572 characters)---a back-cover style blurb describing \textit{what the book is about}, not a plot recap. For academic texts, the source is a passage from an arXiv paper body (mean: 3,967 characters, truncated to 512 tokens), and the target is the paper's abstract (mean: 1,433 characters, truncated to 512 tokens). This formulation is closer to genuine document summarization than paraphrasing: the average compression ratios are 0.19 (literary) and 0.36 (academic), comparable to standard summarization benchmarks.
+\textbf{Topic classification (discriminative, single-label).} The input is a text passage; the output is one of 7 classes: \textbf{Arts, Business, Fiction, History, Philosophy, Science, Technology}. Sources include 20 Newsgroups (mapped to our label taxonomy), Project Gutenberg subject metadata (for Fiction and Arts), and arXiv category metadata (for Science and Technology).
+\textbf{Emotion detection (discriminative, multi-label).} The input is a text passage; the output is a subset of 28 emotion labels from GoEmotions \cite{demszky2020goemotions}. Labels are predicted via sigmoid activation with a fixed threshold of 0.3 during training evaluation and 0.5 during inference. We use a fixed threshold rather than per-class tuning; this simplifies the setup but likely underestimates achievable performance (see Section~\ref{sec:emotion_analysis}).
 \subsection{Datasets}
+Table \ref{tab:datasets} summarizes dataset statistics.
 \begin{table}[htbp]
 \centering
+\caption{Dataset Statistics. Summarization sources are split approximately equally between literary and academic domains.}
 \label{tab:datasets}
 \begin{tabular}{llrrr}
 \toprule
 \textbf{Task} & \textbf{Source} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
 \midrule
+\multirow{2}{*}{Summarization} & Goodreads + Gutenberg & $\sim$4K & -- & -- \\
+ & arXiv (body $\rightarrow$ abstract) & $\sim$45K & -- & -- \\
+ & \textit{Combined} & 49,086 & 2,727 & 2,727 \\
 \midrule
+Topic (7 classes) & 20News + Gutenberg + arXiv & 3,402 & 189 & 189 \\
 \midrule
+Emotion (28 labels) & GoEmotions (Reddit) & 43,410 & 5,426 & 5,427 \\
 \bottomrule
 \end{tabular}
 \end{table}
+\textbf{Dataset curation.} Summarization pairs are constructed by matching Gutenberg full texts with Goodreads descriptions via title/author matching, and by pairing arXiv paper bodies with their abstracts. Text is truncated to 512 tokens (max encoder input length). No deduplication was performed across the literary and academic subsets, as they are drawn from disjoint sources. We note that the academic subset is substantially larger ($\sim$45K vs. $\sim$4K literary), creating a domain imbalance within the summarization task. Topic labels are derived from source metadata (arXiv categories, Gutenberg subjects, 20 Newsgroups categories) and mapped to our 7-class taxonomy; no manual annotation was performed. GoEmotions is used as-is from the HuggingFace datasets hub.
+\textbf{Note on dataset sizes.} The large disparity between topic (3.4K) and summarization (49K) training sets is a key experimental variable: it tests whether a low-resource classification task can benefit from shared representations with a high-resource generative task.
 \subsection{Model Architecture}
+LexiMind uses FLAN-T5-base (272M parameters) as the backbone, with a custom reimplementation that loads pre-trained weights via a factory module for architectural transparency:
 \begin{itemize}
     \item 12-layer encoder, 12-layer decoder
     \item 768-dimensional hidden states, 12 attention heads
+    \item T5-style relative position bias (no absolute positional embeddings)
+    \item Pre-Layer Normalization with RMSNorm \cite{zhang2019root}
+    \item FlashAttention via PyTorch 2.0 SDPA when compatible
 \end{itemize}
+Task-specific heads branch from the shared encoder:
 \begin{itemize}
+    \item \textbf{Summarization}: Full decoder with language modeling head (cross-entropy loss with label smoothing)
+    \item \textbf{Topic}: Linear classifier on mean-pooled encoder hidden states (cross-entropy loss)
+    \item \textbf{Emotion}: Linear classifier on mean-pooled encoder hidden states with sigmoid activation (binary cross-entropy loss)
 \end{itemize}
+\textbf{Architectural note.} Using mean-pooled encoder states for classification in an encoder-decoder model is a pragmatic choice for parameter sharing, but may be suboptimal compared to encoder-only architectures (BERT, RoBERTa) where the encoder is fully dedicated to producing classification-ready representations. We discuss this trade-off in Section~\ref{sec:emotion_analysis}.
 \subsection{Training Configuration}
+All experiments use consistent hyperparameters unless otherwise noted:
 \begin{itemize}
+    \item \textbf{Optimizer}: Fused AdamW, lr=$3\times10^{-5}$, weight decay=0.01, $\beta_1$=0.9, $\beta_2$=0.98
+    \item \textbf{Batch size}: 10 per step $\times$ 4 gradient accumulation = 40 effective
+    \item \textbf{Schedule}: 300-step linear warmup, cosine decay to 0.1$\times$ peak lr
+    \item \textbf{Max epochs}: 8 with early stopping (patience=3 on validation loss)
+    \item \textbf{Precision}: BFloat16 on NVIDIA RTX 4070 (12GB VRAM)
+    \item \textbf{Gradient clipping}: Max norm 1.0
+    \item \textbf{Encoder freezing}: Bottom 4 layers frozen for stable transfer learning
 \end{itemize}
+\textbf{Task scheduling.} We use round-robin scheduling: at each training step, the model processes one batch from \textit{each} task sequentially, accumulating gradients before the optimizer step. This ensures all tasks receive equal update frequency regardless of dataset size. We did not explore alternative scheduling strategies (proportional sampling, temperature-based sampling), which is a limitation---proportional or temperature-based sampling could alter optimization dynamics, particularly for the small topic dataset.
+\textbf{Loss weighting.} Task losses are combined with fixed weights: summarization=1.0, emotion=1.0, topic=0.3. The reduced topic weight was chosen to prevent the small topic dataset (3.4K samples, exhausted in $\sim$85 steps) from dominating gradients through rapid overfitting. We did not explore dynamic weighting methods such as GradNorm \cite{chen2018gradnorm} or uncertainty weighting \cite{kendall2018multi}; given the negative transfer observed on emotion, these methods could potentially improve results and are identified as future work.
 \subsection{Baselines and Ablations}
 We compare four configurations:
 \begin{enumerate}
+    \item \textbf{Random/Majority}: Random predictions for classification; for summarization, BERTScore is computed against the reference using a fixed output ``Summary not available'' (producing a baseline that reflects only the BERTScore model's behavior on unrelated text pairs---see Section~\ref{sec:baseline_discussion} for discussion).
+    \item \textbf{FLAN-T5-base (zero-shot)}: Pre-trained model with task-appropriate prompts, no fine-tuning.
+    \item \textbf{Single-Task}: Separate models fine-tuned on each task individually with identical hyperparameters. The single-task summarization model uses only the summarization dataset; topic and emotion models use only their respective datasets.
+    \item \textbf{Multi-Task (LexiMind)}: Joint training on all three tasks with round-robin scheduling.
 \end{enumerate}
+We additionally ablate FLAN-T5 initialization vs. random initialization to isolate transfer learning contribution.
 \subsection{Evaluation Metrics}
 \begin{itemize}
+    \item \textbf{Summarization}: ROUGE-1/2/L \cite{lin2004rouge} (lexical overlap) and BERTScore F1 \cite{zhang2019bertscore} using RoBERTa-large (semantic similarity). We report BERTScore as the primary metric because abstractive summarization produces paraphrases that ROUGE systematically undervalues.
+    \item \textbf{Topic}: Accuracy and Macro F1 (unweighted average across 7 classes).
+    \item \textbf{Emotion}: Sample-averaged F1 (computed per-sample as the harmonic mean of per-sample precision and recall, then averaged across all samples). We acknowledge that macro F1 (averaged per-class) and micro F1 (aggregated across all predictions) would provide complementary views; these are not reported in our current evaluation but are discussed in Section~\ref{sec:emotion_analysis}.
 \end{itemize}
+\textbf{Statistical note.} All results are from single training runs. We do not report confidence intervals or variance across seeds. Given the small topic dataset (189 validation samples), the observed +3.2\% accuracy improvement could be within random variance. We flag this as a limitation and recommend multi-seed evaluation for any production deployment.
 %=============================================================================
 \section{Results}
 \begin{table}[htbp]
 \centering
+\caption{Main Results: Multi-Task vs. Single-Task Performance. All results are single-seed. Bold indicates better performance between the two configurations.}
 \label{tab:main_results}
 \begin{tabular}{llcc}
 \toprule
 \multirow{2}{*}{Topic} & Accuracy & 82.0\% & \textbf{85.2\%} \\
  & Macro F1 & 0.812 & \textbf{0.847} \\
 \midrule
+Emotion & Sample-avg F1 & \textbf{0.218} & 0.199 \\
 \bottomrule
 \end{tabular}
 \end{table}
 \textbf{Key finding}: MTL provides heterogeneous effects across tasks:
 \begin{itemize}
+    \item \textbf{Topic classification gains +3.2\% accuracy} from MTL. The small topic dataset (3.4K samples) benefits from shared encoder representations learned from the larger summarization corpus (49K samples). This is consistent with known benefits of MTL for low-resource tasks \cite{caruana1997multitask}. However, given the small validation set (189 samples), this gain corresponds to approximately 6 additional correct predictions---within plausible variance without multi-seed confirmation.
+    \item \textbf{Summarization shows modest improvement} (+0.009 BERTScore F1). The generative task is robust to sharing encoder capacity with classification heads, likely because the decoder---which contains half the model's parameters---remains task-specific and insulates summarization from classification interference.
+    \item \textbf{Emotion detection degrades by $-$0.019 F1}. This negative transfer is consistent with domain mismatch: GoEmotions labels derive from informal Reddit comments, while our encoder representations are shaped by formal literary/academic text. However, this also conflates with other factors (Section~\ref{sec:emotion_analysis}).
 \end{itemize}
 \subsection{Baseline Comparisons}
+\label{sec:baseline_discussion}
 Table \ref{tab:baselines} contextualizes our results against trivial and zero-shot baselines.
 \end{tabular}
 \end{table}
+\textbf{On the random baseline BERTScore (0.412).} BERTScore computes cosine similarity between contextual embeddings from RoBERTa-large. Even unrelated text pairs produce non-zero similarity because (a) common function words and subword tokens share embedding space, and (b) RoBERTa's embeddings have a non-zero mean that inflates cosine similarity. The 0.412 baseline reflects this ``floor'' effect rather than any meaningful semantic overlap. This is consistent with Zhang et al.'s \cite{zhang2019bertscore} observation that BERTScore baselines vary by language and domain.
+Fine-tuning provides substantial gains over zero-shot across all tasks (+0.106 BERTScore, +27\% topic accuracy, +0.11 emotion F1), demonstrating the importance of domain adaptation even with instruction-tuned models.
 \subsection{Ablation: Transfer Learning Contribution}
+Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-training by comparing against random initialization with identical architecture and training.
 \begin{table}[htbp]
 \centering
+\caption{Effect of Pre-trained Initialization (Multi-Task Setting)}
 \label{tab:transfer_ablation}
 \begin{tabular}{lccc}
 \toprule
 Random & 0.523 & 45.2\% & 0.082 \\
 FLAN-T5-base & \textbf{0.830} & \textbf{85.2\%} & \textbf{0.199} \\
 \midrule
+\textit{Absolute gain} & +0.307 & +40.0\% & +0.117 \\
 \bottomrule
 \end{tabular}
 \end{table}
+FLAN-T5 initialization provides large absolute gains across all tasks. We initially characterized this as ``85\% of final performance,'' but this framing oversimplifies heterogeneous metrics: BERTScore, accuracy, and F1 have different scales and baselines, making percentage attribution across them misleading. A more precise characterization: \textbf{pre-training is necessary for competitive performance}---random initialization produces substantially worse results on all tasks even with identical data and training budget. Fine-tuning provides the remaining domain adaptation that zero-shot pre-training alone cannot achieve.
+\subsection{Per-Class Topic Analysis}
+Table \ref{tab:topic_breakdown} reveals per-class patterns in topic classification across the 7 classes.
 \begin{table}[htbp]
 \centering
+\caption{Per-Class Topic Classification (Multi-Task, 7 Classes: Arts, Business, Fiction, History, Philosophy, Science, Technology)}
 \label{tab:topic_breakdown}
 \begin{tabular}{lccc}
 \toprule
 \end{tabular}
 \end{table}
+Fiction and Business achieve near-perfect classification (F1 $\geq$ 0.97), while Science shows the most confusion (F1 = 0.65). Error analysis reveals Science samples are frequently misclassified as Technology---semantically plausible given that scientific research papers often describe technical methods. The Arts class (which covers visual arts, music, drama, and poetry from Gutenberg subject metadata) shows lower recall (0.76), suggesting some arts-related texts are misclassified into adjacent categories.
 \subsection{Analysis: Why Does Emotion Detection Underperform?}
+\label{sec:emotion_analysis}
+Our emotion sample-averaged F1 (0.20) is substantially lower than reported GoEmotions baselines (0.46 macro F1 with BERT-base \cite{demszky2020goemotions}). We identify four contributing factors, acknowledging that our experimental design does not fully disentangle them:
 \begin{enumerate}
+    \item \textbf{Domain shift}: GoEmotions labels were annotated on Reddit comments in conversational register. Our encoder is shaped by literary and academic text through the summarization objective, producing representations optimized for formal text. This domain mismatch is likely the largest factor, but we cannot isolate it without a controlled experiment (e.g., fine-tuning BERT on GoEmotions with our frozen encoder vs. BERT's own encoder).
+    \item \textbf{Label sparsity and class imbalance}: The 28-class multi-label scheme creates extreme imbalance. Rare emotions (grief, remorse, nervousness) appear in $<$2\% of samples. We use a fixed prediction threshold of 0.3 (during training evaluation), without per-class threshold tuning on the validation set---a simplification that the original GoEmotions work \cite{demszky2020goemotions} explicitly optimizes. Per-class threshold tuning could meaningfully improve results.
+    \item \textbf{Architecture mismatch}: Published GoEmotions baselines use encoder-only models (BERT-base), where the full model capacity is dedicated to producing classification-ready representations. Our encoder-decoder architecture optimizes the encoder primarily for producing representations that the decoder can use for summarization---classification heads receive these representations secondarily. The mean-pooling strategy may also be suboptimal; alternatives such as [CLS] token pooling, attention-weighted pooling, or adapter layers \cite{houlsby2019parameter} could yield better classification features.
+    \item \textbf{Metric reporting}: We report sample-averaged F1 (per-sample, then averaged), which is not directly comparable to macro F1 (per-class, then averaged) as reported in the original GoEmotions work. Reporting macro F1, micro F1, and per-label performance would provide a more complete picture. We identify this as a gap in our current evaluation.
 \end{enumerate}
+\textbf{Implication}: Off-the-shelf emotion datasets from social media should not be naively combined with literary/academic tasks in MTL. Domain-specific emotion annotation or domain adaptation techniques are needed for formal text domains.
 \subsection{Training Dynamics}
+Figure \ref{fig:training_curves} shows training progression over 7 epochs (approximately 6 hours on RTX 4070).
 \begin{figure}[htbp]
 \centering
 \includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
+\caption{Training and validation loss. Best checkpoint at epoch 4; validation loss plateaus from epochs 5--7, triggering early stopping at epoch 7 (patience=3).}
 \label{fig:training_curves}
 \end{figure}
 Key observations:
 \begin{itemize}
+    \item Topic classification converges by epoch 3 (99\% training accuracy), consistent with the small dataset (3.4K) being memorized quickly. The reduced task weight (0.3) prevents topic gradients from dominating updates.
+    \item Summarization loss decreases monotonically through epoch 4, then plateaus (best validation summarization loss: 3.698 at epoch 4).
+    \item The train-validation gap widens after epoch 4, primarily driven by topic overfitting on the small dataset. The best checkpoint (epoch 4) balances generalization across all tasks.
 \end{itemize}
 %=============================================================================
 \subsection{When Does MTL Help?}
+Our results support nuanced, task-dependent guidance:
+\textbf{MTL helps when}: A small-dataset task (topic: 3.4K samples) shares domain with a large-dataset task (summarization: 49K literary/academic samples). The topic classifier effectively receives ``free'' pre-training on in-domain text through the shared encoder, benefiting from representations tuned to literary and academic vocabulary and structure.
+\textbf{MTL hurts when}: An auxiliary task's domain is misaligned with the primary training signal. Emotion detection, trained on Reddit comments, does not benefit from encoder representations shaped by formal literary/academic summarization. The round-robin scheduling ensures emotion batches receive equal update frequency, but the encoder's representations are skewed toward the summarization domain by gradient magnitude (summarization loss is substantially larger than classification losses).
+\textbf{MTL is neutral when}: The primary task (summarization) has sufficient data and a task-specific component (decoder, $\sim$136M parameters) that insulates it from interference. Classification heads are small (single linear layers) and their gradients have limited impact on the shared encoder relative to the decoder's backpropagation signal.
+\subsection{Comparison to MTL Literature}
+Our findings align qualitatively with several key results in the MTL literature. Standley et al. \cite{standley2020tasks} showed that task groupings critically affect MTL outcomes---we observe this in the contrast between topic (positive transfer) and emotion (negative transfer). Yu et al. \cite{yu2020gradient} demonstrated that gradient conflicts between tasks explain negative transfer; our round-robin scheduling with fixed weights does not address such conflicts, and methods like PCGrad could potentially mitigate the emotion degradation by projecting away conflicting gradient components. Aribandi et al. \cite{aribandi2022ext5} found diminishing or negative returns from adding more tasks in extreme multi-task settings; our small-scale results are consistent with this pattern.
+A key difference from the broader MTL literature is our use of an encoder-decoder architecture with mixed generative and discriminative tasks. Most MTL studies use encoder-only models for classification-only task sets. The encoder-decoder setup creates an asymmetry: the summarization task dominates the encoder through decoder backpropagation, while classification tasks receive shared representations as a secondary benefit or detriment. This architectural dynamic deserves further study.
 \subsection{Implications for Practitioners}
+Based on our findings:
 \begin{enumerate}
+    \item \textbf{Audit domain alignment} before combining tasks in MTL. If auxiliary tasks draw from different text domains (e.g., social media vs. academic), negative transfer is likely unless mitigated by gradient-conflict methods or per-task adapters.
+    \item \textbf{Task weighting matters} for preventing small-dataset overfitting. Our reduced weight (0.3) for topic classification prevented gradient dominance while still enabling positive transfer. Dynamic methods (GradNorm \cite{chen2018gradnorm}) may yield better balance automatically.
+    \item \textbf{Architectural isolation protects high-priority tasks}. Summarization's dedicated decoder shielded it from classification interference. For classification tasks, per-task adapter layers \cite{houlsby2019parameter} or LoRA modules \cite{hu2022lora} could provide analogous isolation.
+    \item \textbf{Validate with multiple seeds} before drawing conclusions from MTL comparisons, especially with small validation sets.
 \end{enumerate}
 \subsection{Limitations}
+\label{sec:limitations}
+We identify several limitations that constrain the generalizability of our findings:
 \begin{itemize}
+    \item \textbf{Single-seed results}: All experiments are single runs. The +3.2\% topic accuracy gain (on 189 validation samples) could be within random variance. Multi-seed evaluation with confidence intervals is needed to confirm the direction and magnitude of transfer effects.
+    \item \textbf{No gradient-conflict mitigation}: We use fixed loss weights and do not explore PCGrad \cite{yu2020gradient}, CAGrad \cite{liu2021conflict}, GradNorm \cite{chen2018gradnorm}, or uncertainty weighting \cite{kendall2018multi}. These methods are directly relevant to our observed negative transfer on emotion detection and could potentially convert it to positive or neutral transfer.
+    \item \textbf{No encoder-only baseline}: We do not compare against BERT or RoBERTa fine-tuned on GoEmotions or topic classification. Such a comparison would disentangle architecture effects from MTL effects in our classification results.
+    \item \textbf{Emotion evaluation gaps}: We report sample-averaged F1 with a fixed threshold (0.3). Per-class thresholds tuned on validation, per-label metrics, focal loss for class imbalance \cite{lin2017focal}, and calibration analysis would provide more informative evaluation. The conclusion that ``domain mismatch is the primary cause'' of low emotion F1 is plausible but confounded by these design choices.
+    \item \textbf{No human evaluation}: ROUGE and BERTScore are imperfect proxies for summary quality, especially for creative/literary text where stylistic quality matters beyond semantic accuracy.
+    \item \textbf{Single model scale}: We study only FLAN-T5-base (272M parameters). Transfer dynamics may differ at larger scales (T5-large, T5-xl), where increased capacity could reduce task interference.
+    \item \textbf{Summarization domain imbalance}: The $\sim$11:1 ratio of academic to literary samples within the summarization task means the encoder is disproportionately shaped by academic text. This imbalance is not analyzed separately but could affect literary summarization quality.
 \end{itemize}
 \subsection{Future Work}
 \begin{itemize}
+    \item \textbf{Gradient-conflict mitigation}: Applying PCGrad or CAGrad to test whether emotion negative transfer can be reduced or eliminated. This is the most directly actionable follow-up given our current findings.
+    \item \textbf{Parameter-efficient multi-tasking}: Using per-task LoRA adapters \cite{hu2022lora} or adapter layers \cite{houlsby2019parameter} to provide task-specific specialization while maintaining shared encoder representations. This could reduce interference between tasks with misaligned domains.
+    \item \textbf{Encoder-only comparison}: Fine-tuning BERT/RoBERTa on topic and emotion classification, with and without multi-task training, to disentangle encoder-decoder architecture effects from MTL effects.
+    \item \textbf{Multi-seed evaluation}: Running at least 3--5 seeds per configuration to establish statistical significance of observed transfer effects.
+    \item \textbf{Domain-specific emotion annotation}: Collecting emotion annotations on literary and academic text to study whether in-domain emotion data eliminates the negative transfer.
+    \item \textbf{Improved emotion evaluation}: Per-class threshold tuning, macro/micro F1, class-level analysis, and focal loss to address class imbalance.
 \end{itemize}
 %=============================================================================
 \section{Conclusion}
 %=============================================================================
+We investigated multi-task learning for literary and academic text understanding, combining abstractive summarization, topic classification, and multi-label emotion detection in an encoder-decoder architecture. Our ablation studies reveal heterogeneous transfer effects: topic classification benefits from shared representations with the larger summarization corpus (+3.2\% accuracy), while emotion detection suffers negative transfer ($-$0.02 F1) due to domain mismatch with Reddit-sourced labels. Summarization quality is robust to multi-task training, insulated by its task-specific decoder.
+Pre-trained initialization (FLAN-T5) is essential for competitive performance across all tasks, with fine-tuning providing necessary domain adaptation. These findings are consistent with the broader MTL literature on the importance of task compatibility and domain alignment. However, we emphasize the limitations of our single-seed evaluation design and the absence of gradient-conflict mitigation methods, which could alter the negative transfer findings. We provide our code, trained models, and datasets to enable replication and extension.
+Code and models: \url{https://github.com/OliverPerrin/LexiMind}\\
+Live demo: \url{https://huggingface.co/spaces/OliverPerrin/LexiMind}
 %=============================================================================
 % References
 R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
 \bibitem{collobert2011natural}
+R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, ``Natural language processing (almost) from scratch,'' \textit{JMLR}, vol. 12, pp. 2493--2537, 2011.
 \bibitem{johnson2017google}
+M. Johnson et al., ``Google's multilingual neural machine translation system: Enabling zero-shot translation,'' \textit{TACL}, vol. 5, pp. 339--351, 2017.
 \bibitem{mccann2018natural}
+B. McCann, N. S. Keskar, C. Xiong, and R. Socher, ``The natural language decathlon: Multitask learning as question answering,'' \textit{arXiv:1806.08730}, 2018.
 \bibitem{standley2020tasks}
 T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, ``Which tasks should be learned together in multi-task learning?'' in \textit{ICML}, 2020.
+\bibitem{yu2020gradient}
+T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, ``Gradient surgery for multi-task learning,'' in \textit{NeurIPS}, 2020.
+\bibitem{liu2021conflict}
+B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, ``Conflict-averse gradient descent for multi-task learning,'' in \textit{NeurIPS}, 2021.
+\bibitem{chen2018gradnorm}
+Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, ``GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,'' in \textit{ICML}, 2018.
+\bibitem{kendall2018multi}
+A. Kendall, Y. Gal, and R. Cipolla, ``Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,'' in \textit{CVPR}, 2018.
+\bibitem{aghajanyan2021muppet}
+A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, L. Zettlemoyer, and S. Gupta, ``Muppet: Massive multi-task representations with pre-finetuning,'' in \textit{EMNLP}, 2021.
+\bibitem{aribandi2022ext5}
+V. Aribandi et al., ``ExT5: Towards extreme multi-task scaling for transfer learning,'' in \textit{ICLR}, 2022.
 \bibitem{raffel2020exploring}
 C. Raffel et al., ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{JMLR}, vol. 21, no. 140, pp. 1--67, 2020.
 \bibitem{chung2022scaling}
+H. W. Chung et al., ``Scaling instruction-finetuned language models,'' \textit{arXiv:2210.11416}, 2022.
 \bibitem{nallapati2016abstractive}
 R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, ``Abstractive text summarization using sequence-to-sequence RNNs and beyond,'' in \textit{CoNLL}, 2016.
 W. Kryscinski, N. Rajani, D. Aber, and C. Xiong, ``BookSum: A collection of datasets for long-form narrative summarization,'' in \textit{Findings of EMNLP}, 2021.
 \bibitem{cohan2018discourse}
+A. Cohan et al., ``A discourse-aware attention model for abstractive summarization of long documents,'' in \textit{NAACL-HLT}, 2018.
+\bibitem{mao2022citesum}
+Y. Mao, M. Zhong, and J. Han, ``CiteSum: Citation text-guided scientific extreme summarization and domain adaptation with limited supervision,'' in \textit{EMNLP}, 2022.
 \bibitem{demszky2020goemotions}
 D. Demszky et al., ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{ACL}, 2020.
+\bibitem{zhang2019root}
+B. Zhang and R. Sennrich, ``Root mean square layer normalization,'' in \textit{NeurIPS}, 2019.
 \bibitem{lin2004rouge}
 C.-Y. Lin, ``ROUGE: A package for automatic evaluation of summaries,'' in \textit{Text Summarization Branches Out}, 2004.
 \bibitem{hu2022lora}
 E. J. Hu et al., ``LoRA: Low-rank adaptation of large language models,'' in \textit{ICLR}, 2022.
+\bibitem{houlsby2019parameter}
+N. Houlsby et al., ``Parameter-efficient transfer learning for NLP,'' in \textit{ICML}, 2019.
+\bibitem{lin2017focal}
+T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\'{a}r, ``Focal loss for dense object detection,'' in \textit{ICCV}, 2017.
 \end{thebibliography}
 \end{document}

scripts/demo_gradio.py CHANGED Viewed

@@ -468,12 +468,12 @@ since descriptions paraphrase rather than quote the source text.*
                 ### Training Data
-                | Dataset | Task | Description |
-                |---------|------|-------------|
-                | Goodreads (711k+ blurbs) | Book Descriptions | Back-cover style descriptions matched with Gutenberg texts |
-                | arXiv | Paper Abstracts | Scientific paper summarization |
-                | 20 Newsgroups + Gutenberg | Topic Classification | Multi-domain topic categorization |
-                | GoEmotions | Emotion Detection | 28-class multi-label emotion classification |
                 ### Key Design Decision
@@ -483,9 +483,9 @@ since descriptions paraphrase rather than quote the source text.*
                 ### Evaluation Metrics
-                - **ROUGE-1/2/L**: Lexical overlap (expected range: 0.15-0.25 for descriptions)
                 - **BLEU-4**: N-gram precision
-                - **BERTScore**: Semantic similarity using contextual embeddings (key metric for paraphrasing)
                 ### Links

                 ### Training Data
+                | Dataset | Task | Samples |
+                |---------|------|---------|
+                | Gutenberg + Goodreads | Book Descriptions | ~4K literary pairs |
+                | arXiv (body → abstract) | Paper Abstracts | ~45K academic pairs |
+                | 20 Newsgroups + Gutenberg + arXiv | Topic Classification | 3.4K (7 classes) |
+                | GoEmotions (Reddit) | Emotion Detection | 43K (28 labels) |
                 ### Key Design Decision
                 ### Evaluation Metrics
+                - **ROUGE-1/2/L**: Lexical overlap with reference summaries
                 - **BLEU-4**: N-gram precision
+                - **BERTScore**: Semantic similarity using contextual embeddings (primary metric for abstractive summarization)
                 ### Links

scripts/train.py CHANGED Viewed

@@ -5,7 +5,7 @@ Training script for LexiMind.
 Simple, clean training with multi-task learning across:
 - Summarization (BookSum + arXiv papers)
 - Emotion classification (GoEmotions, 28 labels)
-- Topic classification (Books + Papers, 8 labels: Fiction, Science, Technology, etc.)
 Usage:
     python scripts/train.py training=medium

 Simple, clean training with multi-task learning across:
 - Summarization (BookSum + arXiv papers)
 - Emotion classification (GoEmotions, 28 labels)
+- Topic classification (Books + Papers, 7 labels: Arts, Business, Fiction, History, Philosophy, Science, Technology)
 Usage:
     python scripts/train.py training=medium