OliverPerrin commited on
Commit
8f5fea2
·
1 Parent(s): 4bc92d5

Updated Research Paper, README, and old gradio about info, along with other docs.

Browse files
README.md CHANGED
@@ -8,217 +8,204 @@ app_file: scripts/demo_gradio.py
8
  pinned: false
9
  ---
10
 
11
- ## LexiMind: A Multi-Task NLP Model
12
 
13
- LexiMind is a state-of-the-art Natural Language Processing model designed for complex document understanding. It features a **custom-built Transformer architecture** initialized with weights from Google's **FLAN-T5**, combining the flexibility of from-scratch implementation with the power of modern pre-trained models.
14
 
15
- The model performs three sophisticated tasks simultaneously: **text summarization**, **emotion classification**, and **topic clustering**.
16
 
17
- This project is built with industry-standard MLOps practices, including configuration management with Hydra, experiment tracking with MLflow, and containerization with Docker, making it a reproducible and scalable solution.
18
 
19
- ## Core Features
 
 
 
 
20
 
21
- * **Abstractive Summarization:** Generates concise, coherent summaries of long-form text using encoder-decoder attention. Trained on BookSum (literary) and arXiv (academic papers).
22
- * **Emotion Classification:** Identifies 28 emotions from Google's GoEmotions dataset (admiration, amusement, anger, joy, love, etc.).
23
- * **Topic Classification:** Classifies documents into 8 categories (Fiction, Science, Technology, Philosophy, History, Psychology, Business, Arts).
24
 
25
- ## Model Architecture
26
 
27
- LexiMind implements a **from-scratch Transformer** with modern architectural choices:
28
 
29
- ### Custom Transformer Features
30
 
31
- * **Pre-Layer Normalization (Pre-LN):** RMSNorm applied before each sublayer for stable training
32
- * **FlashAttention:** Via PyTorch 2.0's `scaled_dot_product_attention` for efficient computation
33
- * **Learned Positional Embeddings:** Trainable position representations
34
- * **Multi-Head Attention:** 12 heads with 768-dimensional representations
35
- * **RMSNorm:** Modern normalization without bias (more efficient than LayerNorm)
36
-
37
- ### Pre-trained Weight Initialization
38
-
39
- The model loads weights from **Google's FLAN-T5-base**, which provides:
40
-
41
- * Strong language understanding from instruction-tuning
42
- * Excellent performance on summarization and classification tasks
43
- * Encoder-decoder architecture matching our custom implementation
44
-
45
- ### Multi-Task Learning
46
 
47
- A shared encoder-decoder backbone with task-specific heads:
48
 
49
- * **Summarization Head:** Language modeling head with weight tying
50
- * **Emotion Head:** Mean-pooled classification with dropout
51
- * **Topic Head:** Mean-pooled classification with dropout
52
 
53
- ## Technical Specifications
54
 
55
- | Component | Specification |
56
- | --------- | -------------- |
57
- | Architecture | Encoder-Decoder Transformer |
58
- | Pre-trained Base | google/flan-t5-base |
59
- | Hidden Dimension | 768 |
60
- | Encoder Layers | 12 |
61
- | Decoder Layers | 12 |
62
- | Attention Heads | 12 |
63
- | FFN Dimension | 2048 |
64
- | Normalization | RMSNorm (Pre-LN) |
65
- | Position Encoding | Learned Embeddings |
66
- | Max Sequence Length | 512 tokens |
67
 
68
  ## Getting Started
69
 
70
  ### Prerequisites
71
 
72
- * Python 3.10+
73
- * Poetry for dependency management
74
- * Docker (for containerized deployment)
75
- * An NVIDIA GPU with CUDA support (for training and accelerated inference)
76
 
77
  ### Installation
78
 
79
- 1. **Clone the repository:**
80
-
81
- ```bash
82
- git clone https://github.com/OliverPerrin/LexiMind.git
83
- cd LexiMind
84
- ```
85
-
86
- 2. **Install dependencies:**
87
-
88
- ```bash
89
- poetry install
90
- ```
91
-
92
- 3. **Download datasets:**
93
-
94
- ```bash
95
- poetry run python scripts/download_data.py
96
- ```
97
-
98
- This downloads CNN/DailyMail, BookSum, GoEmotions, AG News, and Gutenberg books.
99
-
100
- ## Usage
101
-
102
- ### Configuration
103
 
104
- All training and model parameters are managed via Hydra. Configurations are located in the `configs/` directory.
105
 
106
- Available configurations:
 
 
107
 
108
- * `model=base` - FLAN-T5-base (default, 12 layers)
109
- * `model=small` - Smaller model for testing (no pretrained weights)
110
- * `model=large` - FLAN-T5-large (24 layers, requires more VRAM)
111
- * `training=dev` - Quick development run (~10-15 min)
112
- * `training=medium` - Balanced training (~45-60 min on RTX 4070)
113
- * `training=full` - Full training run (~3-4 hours, or ~24h for max data)
114
 
115
  ### Training
116
 
117
  ```bash
118
- # Default training with FLAN-T5-base
119
- poetry run python scripts/train.py
120
 
121
- # Quick development run
122
  poetry run python scripts/train.py training=dev
123
 
124
- # Medium training run (recommended for RTX 4070)
125
  poetry run python scripts/train.py training=medium
126
 
127
  # Override parameters
128
  poetry run python scripts/train.py training.optimizer.lr=5e-5
129
 
130
- # Resume from a checkpoint
131
  poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
132
  ```
133
 
134
- Experiments are automatically tracked with MLflow. View results with `mlflow ui`.
135
 
136
  ### Evaluation
137
 
138
  ```bash
139
- # Run inference on test data
140
- poetry run python scripts/inference.py "Your text to analyze"
 
 
 
 
 
 
141
  ```
142
 
143
- ### Inference & Demo
144
 
145
  ```bash
146
- # Command-line inference
147
  poetry run python scripts/inference.py "Your text to analyze"
148
 
149
  # Gradio web demo
150
  poetry run python scripts/demo_gradio.py
151
  ```
152
 
153
- ## Docker
154
 
155
  ```bash
156
- # Build
157
  docker build -t leximind .
158
-
159
- # Run demo
160
  docker run -p 7860:7860 leximind
161
  ```
162
 
163
  ## Project Structure
164
 
165
- ```text
166
- ├── configs/ # Hydra configuration files
167
- ├── model/ # Model architectures (base, small, large)
168
- ├── training/ # Training configs (dev, medium, full)
169
- │ └── data/ # Dataset paths
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  ├── data/
171
- └── processed/ # Training data (downloaded via scripts/download_data.py)
172
- ├── summarization/ # CNN/DailyMail + BookSum
173
- ├── emotion/ # GoEmotions (28 labels)
174
- ├── topic/ # AG News (4 categories)
175
- └── books/ # Gutenberg prose chunks
176
- ├── src/
177
- ├── models/ # Custom Transformer implementation
178
- │ │ ├── encoder.py # TransformerEncoder with Pre-LN RMSNorm
179
- ├── decoder.py # TransformerDecoder with KV-cache
180
- │ ├── attention.py # Multi-Head Attention with FlashAttention
181
- │ │ └── factory.py # Model building with FLAN-T5 weight loading
182
- │ ├── data/ # Dataset classes and dataloaders
183
- │ ├── training/ # Trainer with AMP and gradient accumulation
184
- │ └── inference/ # Inference pipeline
185
- ├── scripts/
186
- ├── train.py # Main training script
187
- ├── download_data.py # Dataset downloader
188
- ├── inference.py # CLI inference
189
- │ └── demo_gradio.py # Web demo
190
- └── tests/ # Unit tests
 
 
 
 
 
 
 
 
 
 
191
  ```
192
 
193
  ## Code Quality
194
 
195
- * **Ruff:** Fast linting and formatting
196
- * **MyPy:** Static type checking
197
- * **Pytest:** Full test suite covering data, models, and training
198
- * **Pre-commit hooks:** Automated quality checks
199
-
200
  ```bash
201
- # Install hooks
202
- poetry run pre-commit install
203
-
204
- # Lint
205
- poetry run ruff check .
206
 
207
- # Type check
208
- poetry run mypy .
209
 
210
- # Tests
211
- poetry run pytest
212
- ```
213
 
214
- ## Performance Optimizations
 
 
 
215
 
216
- * **torch.compile:** JIT compilation with Inductor backend
217
- * **Mixed Precision:** bfloat16 training on Ampere/Ada GPUs
218
- * **TF32:** Enabled for RTX 30xx/40xx series
219
- * **KV-Cache:** Efficient autoregressive decoding
220
- * **FlashAttention:** Memory-efficient attention via SDPA
221
 
222
  ## License
223
 
224
- GNU License - see [LICENSE](LICENSE) for details.
 
 
 
 
 
8
  pinned: false
9
  ---
10
 
11
+ # LexiMind
12
 
13
+ A multi-task NLP system for literary and academic text understanding. LexiMind performs **abstractive summarization**, **topic classification**, and **emotion detection** using a single encoder-decoder transformer initialized from [FLAN-T5-base](https://huggingface.co/google/flan-t5-base) (272M parameters).
14
 
15
+ **[Live Demo](https://huggingface.co/spaces/OliverPerrin/LexiMind)** · **[Model](https://huggingface.co/OliverPerrin/LexiMind-Model)** · **[Discovery Dataset](https://huggingface.co/datasets/OliverPerrin/LexiMind-Discovery)** · **[Research Paper](docs/research_paper.tex)**
16
 
17
+ ## What It Does
18
 
19
+ | Task | Description | Metric |
20
+ |------|-------------|--------|
21
+ | **Summarization** | Generates back-cover style book descriptions and paper abstracts from source text | BERTScore F1: **0.830** |
22
+ | **Topic Classification** | Classifies passages into 7 categories | Accuracy: **85.2%** |
23
+ | **Emotion Detection** | Identifies emotions from 28 fine-grained labels (multi-label) | Sample-avg F1: **0.199** |
24
 
25
+ **Topic labels:** Arts · Business · Fiction · History · Philosophy · Science · Technology
 
 
26
 
27
+ The model is trained on literary text (Project Gutenberg + Goodreads descriptions), academic papers (arXiv), and emotion-annotated Reddit comments (GoEmotions). For summarization, it learns to produce descriptive summaries—what a book *is about*—rather than plot recaps, by pairing Gutenberg full texts with Goodreads descriptions and arXiv bodies with their abstracts.
28
 
29
+ ## Architecture
30
 
31
+ LexiMind is a **custom Transformer implementation** that loads pre-trained weights from FLAN-T5-base via a factory module. The architecture is reimplemented from scratch for transparency, not wrapped from HuggingFace.
32
 
33
+ | Component | Detail |
34
+ |-----------|--------|
35
+ | Backbone | Encoder-Decoder Transformer (272M params) |
36
+ | Encoder / Decoder | 12 layers each |
37
+ | Hidden Dim | 768, 12 attention heads |
38
+ | Position Encoding | T5-style relative position bias |
39
+ | Normalization | RMSNorm (Pre-LN) |
40
+ | Attention | FlashAttention via PyTorch 2.0 SDPA |
41
+ | Summarization Head | Full decoder with language modeling head |
42
+ | Classification Heads | Linear layers on mean-pooled encoder states |
 
 
 
 
 
43
 
44
+ ### Multi-Task Training
45
 
46
+ All three tasks share the encoder. Summarization uses the full encoder-decoder; topic and emotion classification branch off the encoder with lightweight linear heads. Training uses round-robin scheduling (one batch per task per step), fixed loss weights (summarization=1.0, emotion=1.0, topic=0.3), and early stopping.
 
 
47
 
48
+ ## Training Data
49
 
50
+ | Task | Source | Train Samples |
51
+ |------|--------|---------------|
52
+ | Summarization | Gutenberg + Goodreads (literary) | ~4K |
53
+ | Summarization | arXiv body → abstract (academic) | ~45K |
54
+ | Topic | 20 Newsgroups + Gutenberg + arXiv metadata | 3,402 |
55
+ | Emotion | GoEmotions (Reddit comments, 28 labels) | 43,410 |
 
 
 
 
 
 
56
 
57
  ## Getting Started
58
 
59
  ### Prerequisites
60
 
61
+ - Python 3.10+
62
+ - [Poetry](https://python-poetry.org/) for dependency management
63
+ - NVIDIA GPU with CUDA (for training; CPU works for inference)
 
64
 
65
  ### Installation
66
 
67
+ ```bash
68
+ git clone https://github.com/OliverPerrin/LexiMind.git
69
+ cd LexiMind
70
+ poetry install
71
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ ### Download Data
74
 
75
+ ```bash
76
+ poetry run python scripts/download_data.py
77
+ ```
78
 
79
+ Downloads Goodreads descriptions, arXiv papers, GoEmotions, 20 Newsgroups, and Gutenberg texts.
 
 
 
 
 
80
 
81
  ### Training
82
 
83
  ```bash
84
+ # Full training (~45-60 min on RTX 4070 12GB)
85
+ poetry run python scripts/train.py training=full
86
 
87
+ # Quick dev run (~10-15 min)
88
  poetry run python scripts/train.py training=dev
89
 
90
+ # Medium run (~30-45 min)
91
  poetry run python scripts/train.py training=medium
92
 
93
  # Override parameters
94
  poetry run python scripts/train.py training.optimizer.lr=5e-5
95
 
96
+ # Resume from checkpoint
97
  poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
98
  ```
99
 
100
+ Training uses BFloat16 mixed precision, gradient checkpointing, `torch.compile`, and cosine LR decay with warmup. Experiments are tracked with MLflow (`mlflow ui` to browse).
101
 
102
  ### Evaluation
103
 
104
  ```bash
105
+ # Full evaluation (ROUGE, BERTScore, topic accuracy, emotion F1)
106
+ poetry run python scripts/evaluate.py
107
+
108
+ # Skip BERTScore for faster runs
109
+ poetry run python scripts/evaluate.py --skip-bertscore
110
+
111
+ # Single task
112
+ poetry run python scripts/evaluate.py --summarization-only
113
  ```
114
 
115
+ ### Inference
116
 
117
  ```bash
118
+ # Command-line
119
  poetry run python scripts/inference.py "Your text to analyze"
120
 
121
  # Gradio web demo
122
  poetry run python scripts/demo_gradio.py
123
  ```
124
 
125
+ ### Docker
126
 
127
  ```bash
 
128
  docker build -t leximind .
 
 
129
  docker run -p 7860:7860 leximind
130
  ```
131
 
132
  ## Project Structure
133
 
134
+ ```
135
+ configs/
136
+ ├── config.yaml # Main Hydra config
137
+ ├── data/datasets.yaml # Dataset paths and tokenizer settings
138
+ ├── model/ # Architecture configs (base, small, large)
139
+ └── training/ # Training configs (dev, medium, full)
140
+
141
+ src/
142
+ ├── models/
143
+ │ ├── encoder.py # Transformer Encoder with Pre-LN RMSNorm
144
+ │ ├── decoder.py # Transformer Decoder with KV-cache
145
+ │ ├── attention.py # Multi-Head Attention + T5 relative position bias
146
+ │ ├── feedforward.py # Gated feed-forward network
147
+ │ ├── positional_encoding.py # Sinusoidal & learned position encodings
148
+ │ ├── t5_layer_norm.py # T5-style RMSNorm
149
+ │ ├── heads.py # Task-specific classification heads
150
+ │ ├── multitask.py # Multi-task model combining all components
151
+ │ └── factory.py # Model builder with FLAN-T5 weight loading
152
  ├── data/
153
+ ├── dataset.py # Dataset classes for all tasks
154
+ ├── dataloader.py # Multi-task dataloader with round-robin sampling
155
+ └── tokenization.py # Tokenizer wrapper
156
+ ├── training/
157
+ ├── trainer.py # Training loop with AMP, grad accumulation, early stopping
158
+ ├── metrics.py # ROUGE, BERTScore, F1, accuracy computation
159
+ └── utils.py # Checkpointing, logging utilities
160
+ ├── inference/
161
+ │ ├── pipeline.py # End-to-end inference pipeline
162
+ └── factory.py # Model loading for inference
163
+ ├── api/ # FastAPI REST endpoint
164
+ └── utils/ # Shared utilities
165
+
166
+ scripts/
167
+ ├── train.py # Training entry point
168
+ ├── evaluate.py # Evaluation with all metrics
169
+ ├── inference.py # CLI inference
170
+ ├── demo_gradio.py # Gradio web UI
171
+ ├── download_data.py # Dataset downloader
172
+ ├── export_model.py # Model export utilities
173
+ ├── export_tokenizer.py # Tokenizer export
174
+ ├── preprocess_data.py # Data preprocessing
175
+ ├── process_books.py # Gutenberg text processing
176
+ ├── eval_rouge.py # ROUGE-only evaluation
177
+ └── visualize_training.py # Training curve plotting
178
+
179
+ tests/ # Pytest suite (data, models, training, inference, utils)
180
+ docs/ # Research paper and architecture notes
181
+ artifacts/ # Tokenizer files and label definitions
182
+ checkpoints/ # Saved model checkpoints
183
  ```
184
 
185
  ## Code Quality
186
 
 
 
 
 
 
187
  ```bash
188
+ poetry run ruff check . # Linting
189
+ poetry run mypy . # Type checking
190
+ poetry run pytest # Test suite
191
+ poetry run pre-commit run --all-files # All checks
192
+ ```
193
 
194
+ ## Key Results
 
195
 
196
+ From the research paper ([docs/research_paper.tex](docs/research_paper.tex)):
 
 
197
 
198
+ - **Multi-task learning helps topic classification** (+3.2% accuracy over single-task) because the small topic dataset (3.4K) benefits from shared encoder representations trained on the larger summarization corpus (49K).
199
+ - **Summarization is robust to MTL**—quality stays comparable whether trained alone or jointly.
200
+ - **Emotion detection shows slight negative transfer** (−0.02 F1), likely due to domain mismatch between Reddit-sourced emotion labels and literary/academic text.
201
+ - **FLAN-T5 pre-training is essential**—random initialization produces dramatically worse results on all tasks.
202
 
203
+ See the paper for full ablations, per-class breakdowns, and discussion of limitations.
 
 
 
 
204
 
205
  ## License
206
 
207
+ GPL-3.0 see [LICENSE](LICENSE) for details.
208
+
209
+ ---
210
+
211
+ *Built by Oliver Perrin · Appalachian State University · 2025–2026*
configs/data/datasets.yaml CHANGED
@@ -4,7 +4,7 @@
4
  processed:
5
  summarization: data/processed/summarization # BookSum + arXiv
6
  emotion: data/processed/emotion # GoEmotions (28 labels)
7
- topic: data/processed/topic # Books + Papers (8 labels)
8
  books: data/processed/books # Gutenberg prose chunks
9
 
10
  tokenizer:
 
4
  processed:
5
  summarization: data/processed/summarization # BookSum + arXiv
6
  emotion: data/processed/emotion # GoEmotions (28 labels)
7
+ topic: data/processed/topic # Books + Papers (7 labels)
8
  books: data/processed/books # Gutenberg prose chunks
9
 
10
  tokenizer:
docs/architecture.md CHANGED
@@ -53,7 +53,7 @@ The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible
53
  | ---- | ------- | ---- | ------ |
54
  | Summarization | BookSum + arXiv | ~90K | Text→Summary |
55
  | Emotion | GoEmotions | ~43K | 28 emotions (multi-label) |
56
- | Topic | Books + Papers | ~50K | 8 categories (Fiction, Science, Technology, etc.) |
57
  | Books | Gutenberg (prose chunks) | ~30K | Literary text |
58
 
59
  ### T5 Tokenizer Differences
 
53
  | ---- | ------- | ---- | ------ |
54
  | Summarization | BookSum + arXiv | ~90K | Text→Summary |
55
  | Emotion | GoEmotions | ~43K | 28 emotions (multi-label) |
56
+ | Topic | Books + Papers | 3.4K | 7 categories (Arts, Business, Fiction, History, Philosophy, Science, Technology) |
57
  | Books | Gutenberg (prose chunks) | ~30K | Literary text |
58
 
59
  ### T5 Tokenizer Differences
docs/paper.tex CHANGED
@@ -59,7 +59,7 @@ Email: perrinot@appstate.edu}}
59
  \maketitle
60
 
61
  \begin{abstract}
62
- This paper presents LexiMind, a multi-task Natural Language Processing (NLP) system that combines a custom-built Transformer architecture with pre-trained weights from Google's FLAN-T5 (Fine-tuned Language Net Text-to-Text Transfer Transformer). The system performs three fundamental NLP tasks simultaneously: abstractive text summarization, multi-label emotion classification, and single-label topic classification. Unlike news-focused models, LexiMind specializes in literary and academic content. For summarization, we train on 49,086 samples combining Goodreads book descriptions (back-cover style blurbs) with arXiv academic paper abstracts. Emotion classification uses 43,410 samples from GoEmotions \cite{demszky2020goemotions}, a dataset of 28 fine-grained emotion labels derived from Reddit comments. Topic classification spans 3,402 samples from 20 Newsgroups, Project Gutenberg literary texts, and scientific papers across 7 categories (Fiction, Science, Technology, Philosophy, History, Psychology, Business). By implementing modern architectural innovations including Pre-Layer Normalization (Pre-LN) with Root Mean Square Layer Normalization (RMSNorm), T5-style relative position bias, FlashAttention via PyTorch 2.0's Scaled Dot-Product Attention (SDPA), gradient checkpointing, and torch.compile optimization, LexiMind achieves efficient training on consumer GPUs while maintaining strong performance. Our final model achieves a BERTScore F1 of 0.83 and ROUGE-1 of 0.31 for summarization, 85.2\% accuracy for topic classification, and F1 of 0.20 for 28-class multi-label emotion detection. The 272M-parameter architecture is constructed from first principles in a bottom-up fashion, with each component (attention mechanisms, feed-forward networks, encoder/decoder blocks) implemented as standalone modules. A factory pattern enables seamless weight transfer from FLAN-T5-base, allowing the system to leverage Google's pre-trained knowledge while maintaining full architectural transparency and customization capability.
63
  \end{abstract}
64
 
65
  \begin{IEEEkeywords}
 
59
  \maketitle
60
 
61
  \begin{abstract}
62
+ This paper presents LexiMind, a multi-task Natural Language Processing (NLP) system that combines a custom-built Transformer architecture with pre-trained weights from Google's FLAN-T5 (Fine-tuned Language Net Text-to-Text Transfer Transformer). The system performs three fundamental NLP tasks simultaneously: abstractive text summarization, multi-label emotion classification, and single-label topic classification. Unlike news-focused models, LexiMind specializes in literary and academic content. For summarization, we train on 49,086 samples combining Goodreads book descriptions (back-cover style blurbs) with arXiv academic paper abstracts. Emotion classification uses 43,410 samples from GoEmotions \cite{demszky2020goemotions}, a dataset of 28 fine-grained emotion labels derived from Reddit comments. Topic classification spans 3,402 samples from 20 Newsgroups, Project Gutenberg literary texts, and scientific papers across 7 categories (Arts, Business, Fiction, History, Philosophy, Science, Technology). By implementing modern architectural innovations including Pre-Layer Normalization (Pre-LN) with Root Mean Square Layer Normalization (RMSNorm), T5-style relative position bias, FlashAttention via PyTorch 2.0's Scaled Dot-Product Attention (SDPA), gradient checkpointing, and torch.compile optimization, LexiMind achieves efficient training on consumer GPUs while maintaining strong performance. Our final model achieves a BERTScore F1 of 0.83 and ROUGE-1 of 0.31 for summarization, 85.2\% accuracy for topic classification, and F1 of 0.20 for 28-class multi-label emotion detection. The 272M-parameter architecture is constructed from first principles in a bottom-up fashion, with each component (attention mechanisms, feed-forward networks, encoder/decoder blocks) implemented as standalone modules. A factory pattern enables seamless weight transfer from FLAN-T5-base, allowing the system to leverage Google's pre-trained knowledge while maintaining full architectural transparency and customization capability.
63
  \end{abstract}
64
 
65
  \begin{IEEEkeywords}
docs/research_paper.tex CHANGED
@@ -1,5 +1,5 @@
1
  % LexiMind: Multi-Task Learning for Literary and Academic Text Understanding
2
- % Research Paper Version - Focus on Experimental Analysis and Novel Contributions
3
  % Author: Oliver Perrin
4
 
5
  \documentclass[conference]{IEEEtran}
@@ -44,7 +44,7 @@ Email: perrinot@appstate.edu}}
44
  \maketitle
45
 
46
  \begin{abstract}
47
- Multi-task learning (MTL) promises improved generalization through shared representations, but its benefits depend heavily on task relatedness and domain characteristics. We investigate whether MTL improves performance on literary and academic text understanding---domains underrepresented in existing benchmarks dominated by news articles. Using a FLAN-T5-base backbone, we jointly train on three tasks: abstractive summarization (49K samples from book descriptions and paper abstracts), topic classification (3.4K samples across 7 categories), and emotion detection (43K samples from GoEmotions). Through systematic ablation studies comparing single-task specialists against multi-task configurations, we find that: (1) MTL provides a +3.2\% accuracy boost for topic classification due to shared encoder representations, (2) summarization quality remains comparable (BERTScore F1 0.83 vs. 0.82 single-task), and (3) emotion detection suffers from negative transfer (-0.02 F1), likely due to domain mismatch between Reddit-sourced emotion labels and literary/academic text. We further ablate the contribution of FLAN-T5 pre-training, showing that transfer learning accounts for 85\% of final performance, with fine-tuning providing crucial domain adaptation. Our analysis reveals that MTL benefits depend critically on dataset size ratios and domain alignment, offering practical guidance for multi-task system design.
48
  \end{abstract}
49
 
50
  \begin{IEEEkeywords}
@@ -55,9 +55,9 @@ Multi-Task Learning, Transfer Learning, Text Summarization, Emotion Classificati
55
  \section{Introduction}
56
  %=============================================================================
57
 
58
- Multi-task learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, hypothesizing that shared representations improve generalization. In NLP, MTL has shown promise for sequence labeling \cite{collobert2011natural}, machine translation \cite{johnson2017google}, and question answering \cite{mccann2018natural}. However, recent work highlights that MTL does not universally help---negative transfer can occur when tasks compete for model capacity \cite{wang2019characterizing, standley2020tasks}.
59
 
60
- We investigate MTL effectiveness in a specific, underexplored domain: \textbf{literary and academic text understanding}. Unlike news articles---which dominate existing benchmarks like CNN/DailyMail \cite{nallapati2016abstractive}---literary and academic texts exhibit distinct characteristics: longer context dependencies, domain-specific vocabulary, and different summary styles (descriptive abstracts vs. extractive headlines).
61
 
62
  Our study addresses three research questions:
63
 
@@ -67,127 +67,140 @@ Our study addresses three research questions:
67
  \item[\textbf{RQ3}] How much does pre-trained knowledge (FLAN-T5) contribute relative to task-specific fine-tuning?
68
  \end{enumerate}
69
 
70
- To answer these questions, we construct \textbf{LexiMind}, a multi-task system built on FLAN-T5-base \cite{chung2022scaling} that performs abstractive summarization, topic classification, and emotion detection. We conduct systematic ablations comparing:
71
- \begin{itemize}
72
- \item Multi-task vs. single-task training
73
- \item With vs. without FLAN-T5 initialization
74
- \item Different task weight configurations
75
- \end{itemize}
76
 
77
- Our key findings are:
78
  \begin{itemize}
79
  \item \textbf{Topic classification benefits most from MTL} (+3.2\% accuracy), leveraging shared encoder representations from the larger summarization dataset.
80
- \item \textbf{Summarization is robust to MTL}, showing minimal degradation despite sharing capacity with classification heads.
81
- \item \textbf{Emotion detection suffers negative transfer} (-0.02 F1), attributed to domain mismatch between GoEmotions' Reddit comments and literary/academic register.
82
- \item \textbf{Transfer learning dominates}: FLAN-T5 initialization provides 85\% of final performance; fine-tuning adds crucial domain adaptation.
83
  \end{itemize}
84
 
 
 
85
  %=============================================================================
86
  \section{Related Work}
87
  %=============================================================================
88
 
89
  \subsection{Multi-Task Learning in NLP}
90
 
91
- Collobert et al. \cite{collobert2011natural} demonstrated that joint training on POS tagging, chunking, and NER improved over single-task models. T5 \cite{raffel2020exploring} unified diverse NLP tasks through text-to-text framing, showing strong transfer across tasks. However, Standley et al. \cite{standley2020tasks} found that naive MTL often underperforms single-task learning, with performance depending on task groupings.
92
 
93
- Recent work on task interference \cite{wang2019characterizing, yu2020gradient} identifies gradient conflicts as a source of negative transfer. Our work contributes empirical evidence for task interactions in the literary/academic domain, an underexplored setting.
94
 
95
- \subsection{Literary and Academic NLP}
96
 
97
- Most summarization benchmarks focus on news \cite{nallapati2016abstractive, narayan2018don}. BookSum \cite{kryscinski2021booksum} introduced chapter-level book summarization, but targets plot summaries rather than descriptive abstracts. arXiv summarization \cite{cohan2018discourse} addresses academic papers but remains single-domain. Our dataset combines book descriptions (back-cover style) with paper abstracts, training models to generate \textit{what it's about} summaries.
 
 
98
 
99
  \subsection{Emotion Detection}
100
 
101
- GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained emotion labels from Reddit comments. Prior work achieves 0.35--0.46 macro F1 using BERT-based classifiers \cite{demszky2020goemotions}. Our lower performance (0.20 F1) reflects the domain shift from conversational Reddit to formal literary/academic text---a finding that informs domain-aware emotion system design.
102
 
103
  %=============================================================================
104
  \section{Experimental Setup}
105
  %=============================================================================
106
 
 
 
 
 
 
 
 
 
 
 
 
107
  \subsection{Datasets}
108
 
109
- Table \ref{tab:datasets} summarizes our datasets, curated to focus on literary and academic content.
110
 
111
  \begin{table}[htbp]
112
  \centering
113
- \caption{Dataset Statistics}
114
  \label{tab:datasets}
115
  \begin{tabular}{llrrr}
116
  \toprule
117
  \textbf{Task} & \textbf{Source} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
118
  \midrule
119
- \multirow{2}{*}{Summarization} & Goodreads descriptions & 24,543 & 1,363 & 1,364 \\
120
- & arXiv abstracts & 24,543 & 1,364 & 1,363 \\
 
121
  \midrule
122
- Topic (7 classes) & Mixed sources & 3,402 & 189 & 189 \\
123
  \midrule
124
- Emotion (28 labels) & GoEmotions & 43,410 & 5,426 & 5,427 \\
125
  \bottomrule
126
  \end{tabular}
127
  \end{table}
128
 
129
- \textbf{Summarization}: We combine Goodreads book descriptions---back-cover style blurbs describing \textit{what a book is about}---with arXiv paper abstracts. This trains descriptive summarization rather than extractive plot recaps.
130
 
131
- \textbf{Topic Classification}: 7-class single-label classification (Fiction, Science, Technology, Philosophy, History, Psychology, Business) from 20 Newsgroups, Project Gutenberg, and scientific papers.
132
-
133
- \textbf{Emotion Detection}: GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained multi-label emotions. We include this to study cross-domain transfer effects.
134
 
135
  \subsection{Model Architecture}
136
 
137
- LexiMind uses FLAN-T5-base (272M parameters) as the backbone:
 
138
  \begin{itemize}
139
  \item 12-layer encoder, 12-layer decoder
140
  \item 768-dimensional hidden states, 12 attention heads
141
- \item T5-style relative position bias
142
- \item Pre-Layer Normalization with RMSNorm
 
143
  \end{itemize}
144
 
145
- Task-specific components:
146
  \begin{itemize}
147
- \item \textbf{Summarization}: Decoder with language modeling head
148
- \item \textbf{Topic}: Linear classifier on encoder [CLS]-equivalent (mean pooling)
149
- \item \textbf{Emotion}: Multi-label classifier with sigmoid activation
150
  \end{itemize}
151
 
 
 
152
  \subsection{Training Configuration}
153
 
154
- All experiments use consistent hyperparameters:
 
155
  \begin{itemize}
156
- \item Optimizer: AdamW, lr=$3\times10^{-5}$, weight decay=0.01
157
- \item Batch size: 40 (effective, via gradient accumulation)
158
- \item Warmup: 300 steps with cosine decay
159
- \item Max epochs: 8 with early stopping (patience=3)
160
- \item Precision: BFloat16 on NVIDIA RTX 4070 (12GB)
 
 
161
  \end{itemize}
162
 
163
- For MTL, task losses are weighted: summarization=1.0, emotion=1.0, topic=0.3 (reduced due to rapid convergence from small dataset size).
 
 
164
 
165
  \subsection{Baselines and Ablations}
166
 
167
  We compare four configurations:
168
 
169
  \begin{enumerate}
170
- \item \textbf{Random/Majority}: Random predictions (classification) or output of ``Summary not available'' (summarization)
171
- \item \textbf{FLAN-T5-base (zero-shot)}: Pre-trained model without fine-tuning
172
- \item \textbf{Single-Task}: Separate models fine-tuned on each task individually
173
- \item \textbf{Multi-Task (LexiMind)}: Joint training on all three tasks
174
  \end{enumerate}
175
 
176
- We also ablate:
177
- \begin{itemize}
178
- \item \textbf{Random init vs. FLAN-T5 init}: Isolate transfer learning contribution
179
- \item \textbf{Task weight variations}: Study effect of loss balancing
180
- \end{itemize}
181
 
182
  \subsection{Evaluation Metrics}
183
 
184
  \begin{itemize}
185
- \item \textbf{Summarization}: ROUGE-1/2/L \cite{lin2004rouge}, BERTScore F1 \cite{zhang2019bertscore}
186
- \item \textbf{Topic}: Accuracy, Macro F1
187
- \item \textbf{Emotion}: Multi-label F1 (sample-averaged)
188
  \end{itemize}
189
 
190
- BERTScore captures semantic similarity even when surface forms differ---crucial for abstractive summarization where paraphrasing is expected.
191
 
192
  %=============================================================================
193
  \section{Results}
@@ -199,7 +212,7 @@ Table \ref{tab:main_results} compares MTL against single-task specialists.
199
 
200
  \begin{table}[htbp]
201
  \centering
202
- \caption{Main Results: Multi-Task vs. Single-Task Performance}
203
  \label{tab:main_results}
204
  \begin{tabular}{llcc}
205
  \toprule
@@ -213,7 +226,7 @@ Table \ref{tab:main_results} compares MTL against single-task specialists.
213
  \multirow{2}{*}{Topic} & Accuracy & 82.0\% & \textbf{85.2\%} \\
214
  & Macro F1 & 0.812 & \textbf{0.847} \\
215
  \midrule
216
- Emotion & Multi-label F1 & \textbf{0.218} & 0.199 \\
217
  \bottomrule
218
  \end{tabular}
219
  \end{table}
@@ -221,14 +234,15 @@ Emotion & Multi-label F1 & \textbf{0.218} & 0.199 \\
221
  \textbf{Key finding}: MTL provides heterogeneous effects across tasks:
222
 
223
  \begin{itemize}
224
- \item \textbf{Topic classification gains +3.2\% accuracy} from MTL. The small topic dataset (3.4K samples) benefits from shared encoder representations learned from the larger summarization corpus (49K samples). This exemplifies positive transfer from high-resource to low-resource tasks.
225
 
226
- \item \textbf{Summarization shows modest improvement} (+0.009 BERTScore F1). The generative task is robust to sharing encoder capacity with classification heads, likely because the decoder remains task-specific.
227
 
228
- \item \textbf{Emotion detection degrades by -0.019 F1}. This negative transfer likely stems from domain mismatch: GoEmotions labels derive from informal Reddit comments, while our encoder representations are shaped by formal literary/academic text from summarization.
229
  \end{itemize}
230
 
231
  \subsection{Baseline Comparisons}
 
232
 
233
  Table \ref{tab:baselines} contextualizes our results against trivial and zero-shot baselines.
234
 
@@ -248,15 +262,17 @@ Single-Task & 0.821 & 82.0\% & 0.218 \\
248
  \end{tabular}
249
  \end{table}
250
 
251
- Fine-tuning provides substantial gains over zero-shot (+0.106 BERTScore, +27\% topic accuracy), demonstrating the importance of domain adaptation even with strong pre-trained models.
 
 
252
 
253
  \subsection{Ablation: Transfer Learning Contribution}
254
 
255
- Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-training.
256
 
257
  \begin{table}[htbp]
258
  \centering
259
- \caption{Effect of Pre-trained Initialization}
260
  \label{tab:transfer_ablation}
261
  \begin{tabular}{lccc}
262
  \toprule
@@ -265,20 +281,20 @@ Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-train
265
  Random & 0.523 & 45.2\% & 0.082 \\
266
  FLAN-T5-base & \textbf{0.830} & \textbf{85.2\%} & \textbf{0.199} \\
267
  \midrule
268
- \textit{Gain from transfer} & +0.307 & +40.0\% & +0.117 \\
269
  \bottomrule
270
  \end{tabular}
271
  \end{table}
272
 
273
- FLAN-T5 initialization accounts for the majority of final performance. Training from random initialization with identical architecture and data yields substantially worse results, confirming that pre-trained linguistic knowledge is essential---not just architectural choices.
274
 
275
- \subsection{Analysis: Per-Class Topic Performance}
276
 
277
- Table \ref{tab:topic_breakdown} reveals per-class patterns in topic classification.
278
 
279
  \begin{table}[htbp]
280
  \centering
281
- \caption{Per-Class Topic Classification}
282
  \label{tab:topic_breakdown}
283
  \begin{tabular}{lccc}
284
  \toprule
@@ -297,38 +313,41 @@ Technology & 0.86 & 0.89 & 0.87 \\
297
  \end{tabular}
298
  \end{table}
299
 
300
- Fiction and Business achieve near-perfect classification (F1 $\geq$ 0.97), while Science shows the most confusion (F1 = 0.65). Error analysis reveals Science samples are frequently misclassified as Technology---an expected confusion given semantic overlap between scientific research and technical applications.
301
 
302
  \subsection{Analysis: Why Does Emotion Detection Underperform?}
 
303
 
304
- Our emotion F1 (0.20) is substantially lower than reported GoEmotions baselines (0.35--0.46) \cite{demszky2020goemotions}. We identify three contributing factors:
305
 
306
  \begin{enumerate}
307
- \item \textbf{Domain shift}: GoEmotions labels were annotated on Reddit comments. Our encoder, shaped by literary book descriptions and academic abstracts, learns representations optimized for formal register---misaligned with Reddit's conversational tone.
308
 
309
- \item \textbf{Label sparsity}: 28 emotion classes with multi-label annotation creates extreme class imbalance. Many emotions (grief, remorse, nervousness) appear in $<$2\% of samples.
310
 
311
- \item \textbf{Encoder-decoder architecture}: GoEmotions baselines use BERT (encoder-only). Our encoder-decoder architecture may be suboptimal for classification, as the encoder is primarily trained to produce representations useful for the decoder.
 
 
312
  \end{enumerate}
313
 
314
- This finding has practical implications: \textbf{domain-specific emotion data is critical} for literary/academic applications. Off-the-shelf emotion classifiers trained on social media transfer poorly to formal text.
315
 
316
  \subsection{Training Dynamics}
317
 
318
- Figure \ref{fig:training_curves} shows training progression over 7 epochs.
319
 
320
  \begin{figure}[htbp]
321
  \centering
322
  \includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
323
- \caption{Training and validation loss. Best checkpoint at epoch 4; later epochs show validation loss plateau, triggering early stopping.}
324
  \label{fig:training_curves}
325
  \end{figure}
326
 
327
  Key observations:
328
  \begin{itemize}
329
- \item Topic classification converges by epoch 3 (99\% training accuracy), validating our reduced task weight (0.3) to prevent gradient dominance.
330
- \item Summarization loss decreases monotonically through epoch 4, then plateaus.
331
- \item Best checkpoint at epoch 4 balances all tasks; later epochs show slight overfitting on the small topic dataset.
332
  \end{itemize}
333
 
334
  %=============================================================================
@@ -337,55 +356,81 @@ Key observations:
337
 
338
  \subsection{When Does MTL Help?}
339
 
340
- Our results support nuanced guidance for MTL system design:
 
 
 
 
341
 
342
- \textbf{MTL helps when}: A small dataset task (topic: 3.4K samples) can leverage representations from a large dataset task (summarization: 49K samples) through shared encoder layers. The topic task effectively benefits from ``free'' pre-training on literary/academic text.
343
 
344
- \textbf{MTL hurts when}: Task domains are misaligned. Emotion detection trained on Reddit comments does not benefit from---and is potentially harmed by---encoder representations shaped by formal literary/academic summarization.
345
 
346
- \textbf{MTL is neutral when}: The primary task (summarization) has sufficient data and a task-specific component (decoder) that insulates it from interference.
 
 
347
 
348
  \subsection{Implications for Practitioners}
349
 
350
- Based on our findings, we recommend:
351
 
352
  \begin{enumerate}
353
- \item \textbf{Audit domain alignment} before combining tasks. If auxiliary tasks come from different domains (e.g., social media vs. academic), negative transfer is likely.
 
 
354
 
355
- \item \textbf{Use task weighting} to prevent small datasets from overfitting. Our 0.3 weight for topic classification prevented gradient dominance while still enabling positive transfer.
356
 
357
- \item \textbf{Consider task-specific components} for high-priority tasks. Summarization's dedicated decoder protected it from classification interference.
358
  \end{enumerate}
359
 
360
  \subsection{Limitations}
 
 
 
361
 
362
  \begin{itemize}
363
- \item \textbf{Single model size}: We study only FLAN-T5-base (272M). Larger models (T5-large, T5-xl) may show different MTL dynamics.
 
 
 
 
364
 
365
- \item \textbf{No human evaluation}: Our summarization metrics (ROUGE, BERTScore) are automatic. Human judgment of summary quality---especially for creative literary text---would strengthen conclusions.
366
 
367
- \item \textbf{Limited task combinations}: We study three specific tasks. Other task groupings might yield different transfer patterns.
 
 
 
 
368
  \end{itemize}
369
 
370
  \subsection{Future Work}
371
 
372
  \begin{itemize}
373
- \item \textbf{Domain-specific emotion data}: Collecting emotion annotations on literary text could dramatically improve emotion detection while maintaining domain coherence.
 
 
374
 
375
- \item \textbf{Gradient analysis}: Measuring gradient conflicts \cite{yu2020gradient} between tasks would provide mechanistic understanding of observed transfer effects.
376
 
377
- \item \textbf{Parameter-efficient fine-tuning}: LoRA \cite{hu2022lora} or adapters could enable per-task specialization while maintaining shared representations.
 
 
 
 
378
  \end{itemize}
379
 
380
  %=============================================================================
381
  \section{Conclusion}
382
  %=============================================================================
383
 
384
- We investigated multi-task learning for literary and academic text understanding, finding heterogeneous transfer effects across tasks. Topic classification benefits substantially from shared representations (+3.2\% accuracy), while emotion detection suffers negative transfer due to domain mismatch (-0.02 F1). Summarization remains robust to multi-task training.
385
 
386
- Our ablations confirm that FLAN-T5 pre-training dominates final performance, but fine-tuning provides essential domain adaptation. These findings offer practical guidance: MTL benefits depend critically on domain alignment and dataset size ratios. Practitioners should audit task compatibility before combining disparate datasets.
387
 
388
- Code, models, and data are available at \url{https://github.com/OliverPerrin/LexiMind}, with a live demo at \url{https://huggingface.co/spaces/OliverPerrin/LexiMind}.
 
389
 
390
  %=============================================================================
391
  % References
@@ -397,25 +442,40 @@ Code, models, and data are available at \url{https://github.com/OliverPerrin/Lex
397
  R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
398
 
399
  \bibitem{collobert2011natural}
400
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, ``Natural language processing (almost) from scratch,'' \textit{Journal of Machine Learning Research}, vol. 12, pp. 2493--2537, 2011.
401
 
402
  \bibitem{johnson2017google}
403
- M. Johnson et al., ``Google's multilingual neural machine translation system: Enabling zero-shot translation,'' \textit{Transactions of the Association for Computational Linguistics}, vol. 5, pp. 339--351, 2017.
404
 
405
  \bibitem{mccann2018natural}
406
- B. McCann, N. S. Keskar, C. Xiong, and R. Socher, ``The natural language decathlon: Multitask learning as question answering,'' \textit{arXiv preprint arXiv:1806.08730}, 2018.
407
-
408
- \bibitem{wang2019characterizing}
409
- A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, ``SuperGLUE: A stickier benchmark for general-purpose language understanding systems,'' in \textit{NeurIPS}, 2019.
410
 
411
  \bibitem{standley2020tasks}
412
  T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, ``Which tasks should be learned together in multi-task learning?'' in \textit{ICML}, 2020.
413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
414
  \bibitem{raffel2020exploring}
415
  C. Raffel et al., ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{JMLR}, vol. 21, no. 140, pp. 1--67, 2020.
416
 
417
  \bibitem{chung2022scaling}
418
- H. W. Chung et al., ``Scaling instruction-finetuned language models,'' \textit{arXiv preprint arXiv:2210.11416}, 2022.
419
 
420
  \bibitem{nallapati2016abstractive}
421
  R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, ``Abstractive text summarization using sequence-to-sequence RNNs and beyond,'' in \textit{CoNLL}, 2016.
@@ -427,13 +487,16 @@ S. Narayan, S. B. Cohen, and M. Lapata, ``Don't give me the details, just the su
427
  W. Kryscinski, N. Rajani, D. Aber, and C. Xiong, ``BookSum: A collection of datasets for long-form narrative summarization,'' in \textit{Findings of EMNLP}, 2021.
428
 
429
  \bibitem{cohan2018discourse}
430
- A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, ``A discourse-aware attention model for abstractive summarization of long documents,'' in \textit{NAACL-HLT}, 2018.
 
 
 
431
 
432
  \bibitem{demszky2020goemotions}
433
  D. Demszky et al., ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{ACL}, 2020.
434
 
435
- \bibitem{yu2020gradient}
436
- T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, ``Gradient surgery for multi-task learning,'' in \textit{NeurIPS}, 2020.
437
 
438
  \bibitem{lin2004rouge}
439
  C.-Y. Lin, ``ROUGE: A package for automatic evaluation of summaries,'' in \textit{Text Summarization Branches Out}, 2004.
@@ -444,6 +507,12 @@ T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, ``BERTScore: Evalua
444
  \bibitem{hu2022lora}
445
  E. J. Hu et al., ``LoRA: Low-rank adaptation of large language models,'' in \textit{ICLR}, 2022.
446
 
 
 
 
 
 
 
447
  \end{thebibliography}
448
 
449
  \end{document}
 
1
  % LexiMind: Multi-Task Learning for Literary and Academic Text Understanding
2
+ % Research Paper - Revised with Experimental Rigor
3
  % Author: Oliver Perrin
4
 
5
  \documentclass[conference]{IEEEtran}
 
44
  \maketitle
45
 
46
  \begin{abstract}
47
+ Multi-task learning (MTL) promises improved generalization through shared representations, but its benefits depend heavily on task relatedness and domain characteristics. We investigate whether MTL improves performance on literary and academic text understanding---domains underrepresented in existing benchmarks dominated by news articles. Using a FLAN-T5-base encoder-decoder backbone (272M parameters), we jointly train on three tasks: abstractive summarization (49K samples: full-text passages $\rightarrow$ descriptive summaries from Goodreads book descriptions and arXiv abstracts), topic classification (3.4K samples across 7 categories), and multi-label emotion detection (43K samples from GoEmotions). Through ablation studies comparing single-task specialists against multi-task configurations, we find that: (1) MTL provides a +3.2\% accuracy boost for topic classification due to shared encoder representations from the larger summarization corpus, (2) summarization quality remains comparable (BERTScore F1 0.83 vs. 0.82 single-task), and (3) emotion detection suffers negative transfer ($-$0.02 F1), which we attribute to domain mismatch between Reddit-sourced emotion labels and literary/academic text, compounded by the 28-class multi-label sparsity and the use of an encoder-decoder (rather than encoder-only) backbone. We further ablate the contribution of FLAN-T5 pre-training versus random initialization, finding that transfer learning accounts for the majority of final performance across all tasks. Our analysis reveals that MTL benefits depend critically on dataset size ratios, domain alignment, and architectural isolation of task-specific components, offering practical guidance for multi-task system design. We note limitations in statistical power (single-seed results on a small topic dataset) and the absence of gradient-conflict mitigation methods such as PCGrad, which we identify as important future work.
48
  \end{abstract}
49
 
50
  \begin{IEEEkeywords}
 
55
  \section{Introduction}
56
  %=============================================================================
57
 
58
+ Multi-task learning (MTL) \cite{caruana1997multitask} trains a single model on multiple related tasks, hypothesizing that shared representations improve generalization. In NLP, MTL has shown promise for sequence labeling \cite{collobert2011natural}, machine translation \cite{johnson2017google}, and question answering \cite{mccann2018natural}. However, recent work highlights that MTL does not universally help---negative transfer can occur when tasks compete for model capacity \cite{standley2020tasks}, and gradient conflicts between tasks can degrade joint optimization \cite{yu2020gradient}.
59
 
60
+ We investigate MTL effectiveness in a specific, underexplored domain: \textbf{literary and academic text understanding}. Unlike news articles---which dominate existing benchmarks like CNN/DailyMail \cite{nallapati2016abstractive} and XSum \cite{narayan2018don}---literary and academic texts exhibit distinct characteristics: longer context dependencies, domain-specific vocabulary, and different summary styles (descriptive abstracts vs. extractive headlines). Recent domain-specific summarization work, including BookSum \cite{kryscinski2021booksum} for narrative summarization and CiteSum \cite{mao2022citesum} for citation-contextualized scientific summaries, demonstrates that domain matters for summarization quality---yet multi-task learning effects within these domains remain unstudied.
61
 
62
  Our study addresses three research questions:
63
 
 
67
  \item[\textbf{RQ3}] How much does pre-trained knowledge (FLAN-T5) contribute relative to task-specific fine-tuning?
68
  \end{enumerate}
69
 
70
+ To answer these questions, we construct \textbf{LexiMind}, a multi-task system built on FLAN-T5-base \cite{chung2022scaling} that performs abstractive summarization, topic classification, and emotion detection. We conduct ablations comparing multi-task vs. single-task training, with vs. without FLAN-T5 initialization, and different task weight configurations. Our primary experimental contribution is the empirical characterization of transfer effects across these heterogeneous tasks:
 
 
 
 
 
71
 
 
72
  \begin{itemize}
73
  \item \textbf{Topic classification benefits most from MTL} (+3.2\% accuracy), leveraging shared encoder representations from the larger summarization dataset.
74
+ \item \textbf{Summarization is robust to MTL}, showing minimal change despite sharing encoder capacity with classification heads.
75
+ \item \textbf{Emotion detection suffers negative transfer} ($-$0.02 F1), attributed to domain mismatch between GoEmotions' Reddit source and the formal literary/academic register.
76
+ \item \textbf{Transfer learning dominates}: FLAN-T5 initialization provides the bulk of final performance; fine-tuning adds crucial domain adaptation.
77
  \end{itemize}
78
 
79
+ We acknowledge important limitations: our results are from single-seed runs, we do not explore gradient-conflict mitigation methods (PCGrad \cite{yu2020gradient}, CAGrad \cite{liu2021conflict}), and our emotion evaluation conflates domain mismatch with multi-label threshold and architecture choices. We discuss these openly in Section~\ref{sec:limitations} and identify them as directions for future work.
80
+
81
  %=============================================================================
82
  \section{Related Work}
83
  %=============================================================================
84
 
85
  \subsection{Multi-Task Learning in NLP}
86
 
87
+ Collobert et al. \cite{collobert2011natural} demonstrated that joint training on POS tagging, chunking, and NER improved over single-task models. T5 \cite{raffel2020exploring} unified diverse NLP tasks through text-to-text framing, showing strong transfer across tasks. However, Standley et al. \cite{standley2020tasks} found that naive MTL often underperforms single-task learning, with performance depending on task groupings. More recently, Aghajanyan et al. \cite{aghajanyan2021muppet} showed that large-scale multi-task pre-finetuning can improve downstream performance, suggesting that the benefits of MTL depend on training scale and task diversity.
88
 
89
+ \textbf{Gradient conflict and loss balancing.} Yu et al. \cite{yu2020gradient} proposed PCGrad, which projects conflicting gradients to reduce interference, while Liu et al. \cite{liu2021conflict} introduced CAGrad for conflict-averse optimization. Chen et al. \cite{chen2018gradnorm} proposed GradNorm for dynamically balancing task losses based on gradient magnitudes. Kendall et al. \cite{kendall2018multi} explored uncertainty-based task weighting. Our work uses fixed loss weights---a simpler but less adaptive approach. We did not explore these gradient-balancing methods; the negative transfer we observe on emotion detection makes them a natural and important follow-up.
90
 
91
+ \textbf{Multi-domain multi-task studies.} Aribandi et al. \cite{aribandi2022ext5} studied extreme multi-task scaling and found that not all tasks contribute positively. Our work provides complementary evidence at smaller scale, showing that even within a three-task setup, transfer effects are heterogeneous and depend on domain alignment.
92
 
93
+ \subsection{Literary and Academic Summarization}
94
+
95
+ Most summarization benchmarks focus on news \cite{nallapati2016abstractive, narayan2018don}. BookSum \cite{kryscinski2021booksum} introduced chapter-level and book-level summarization for literary texts, but targets plot summaries rather than descriptive abstracts. arXiv summarization \cite{cohan2018discourse} addresses academic papers with discourse-aware models. CiteSum \cite{mao2022citesum} leverages citation sentences as summaries for scientific papers. Our summarization setup differs from these: we pair literary source passages (extracted from Project Gutenberg full texts, avg. 3,030 characters) with Goodreads book descriptions (avg. 572 characters) as targets, training the model to generate \textit{what a book is about} rather than plot recaps. For academic text, arXiv paper body text (avg. 3,967 characters) is paired with abstracts (avg. 1,433 characters). The resulting compression ratios (0.19 for literary, 0.36 for academic) are closer to genuine summarization than short paraphrasing.
96
 
97
  \subsection{Emotion Detection}
98
 
99
+ GoEmotions \cite{demszky2020goemotions} provides 28 fine-grained emotion labels from Reddit comments. The original work reports 0.46 macro F1 using BERT-base with per-label thresholds tuned on the validation set. Subsequent work achieves 0.35--0.46 macro F1 depending on the model and threshold strategy. Importantly, all published GoEmotions baselines use encoder-only architectures (BERT, RoBERTa) rather than encoder-decoder models like T5. Our setup differs in both architecture (encoder-decoder with mean-pooled encoder states) and domain (training encoder primarily on literary/academic summarization), making direct comparison to published baselines informative but not fully controlled.
100
 
101
  %=============================================================================
102
  \section{Experimental Setup}
103
  %=============================================================================
104
 
105
+ \subsection{Task Formulations}
106
+ \label{sec:task_formulation}
107
+
108
+ We define three tasks with explicit input-output specifications:
109
+
110
+ \textbf{Summarization (generative).} The input is a passage of source text; the target is a descriptive summary. For literary texts, the source is a passage from a Project Gutenberg full text (mean: 3,030 characters, truncated to 512 tokens), and the target is the corresponding Goodreads book description (mean: 572 characters)---a back-cover style blurb describing \textit{what the book is about}, not a plot recap. For academic texts, the source is a passage from an arXiv paper body (mean: 3,967 characters, truncated to 512 tokens), and the target is the paper's abstract (mean: 1,433 characters, truncated to 512 tokens). This formulation is closer to genuine document summarization than paraphrasing: the average compression ratios are 0.19 (literary) and 0.36 (academic), comparable to standard summarization benchmarks.
111
+
112
+ \textbf{Topic classification (discriminative, single-label).} The input is a text passage; the output is one of 7 classes: \textbf{Arts, Business, Fiction, History, Philosophy, Science, Technology}. Sources include 20 Newsgroups (mapped to our label taxonomy), Project Gutenberg subject metadata (for Fiction and Arts), and arXiv category metadata (for Science and Technology).
113
+
114
+ \textbf{Emotion detection (discriminative, multi-label).} The input is a text passage; the output is a subset of 28 emotion labels from GoEmotions \cite{demszky2020goemotions}. Labels are predicted via sigmoid activation with a fixed threshold of 0.3 during training evaluation and 0.5 during inference. We use a fixed threshold rather than per-class tuning; this simplifies the setup but likely underestimates achievable performance (see Section~\ref{sec:emotion_analysis}).
115
+
116
  \subsection{Datasets}
117
 
118
+ Table \ref{tab:datasets} summarizes dataset statistics.
119
 
120
  \begin{table}[htbp]
121
  \centering
122
+ \caption{Dataset Statistics. Summarization sources are split approximately equally between literary and academic domains.}
123
  \label{tab:datasets}
124
  \begin{tabular}{llrrr}
125
  \toprule
126
  \textbf{Task} & \textbf{Source} & \textbf{Train} & \textbf{Val} & \textbf{Test} \\
127
  \midrule
128
+ \multirow{2}{*}{Summarization} & Goodreads + Gutenberg & $\sim$4K & -- & -- \\
129
+ & arXiv (body $\rightarrow$ abstract) & $\sim$45K & -- & -- \\
130
+ & \textit{Combined} & 49,086 & 2,727 & 2,727 \\
131
  \midrule
132
+ Topic (7 classes) & 20News + Gutenberg + arXiv & 3,402 & 189 & 189 \\
133
  \midrule
134
+ Emotion (28 labels) & GoEmotions (Reddit) & 43,410 & 5,426 & 5,427 \\
135
  \bottomrule
136
  \end{tabular}
137
  \end{table}
138
 
139
+ \textbf{Dataset curation.} Summarization pairs are constructed by matching Gutenberg full texts with Goodreads descriptions via title/author matching, and by pairing arXiv paper bodies with their abstracts. Text is truncated to 512 tokens (max encoder input length). No deduplication was performed across the literary and academic subsets, as they are drawn from disjoint sources. We note that the academic subset is substantially larger ($\sim$45K vs. $\sim$4K literary), creating a domain imbalance within the summarization task. Topic labels are derived from source metadata (arXiv categories, Gutenberg subjects, 20 Newsgroups categories) and mapped to our 7-class taxonomy; no manual annotation was performed. GoEmotions is used as-is from the HuggingFace datasets hub.
140
 
141
+ \textbf{Note on dataset sizes.} The large disparity between topic (3.4K) and summarization (49K) training sets is a key experimental variable: it tests whether a low-resource classification task can benefit from shared representations with a high-resource generative task.
 
 
142
 
143
  \subsection{Model Architecture}
144
 
145
+ LexiMind uses FLAN-T5-base (272M parameters) as the backbone, with a custom reimplementation that loads pre-trained weights via a factory module for architectural transparency:
146
+
147
  \begin{itemize}
148
  \item 12-layer encoder, 12-layer decoder
149
  \item 768-dimensional hidden states, 12 attention heads
150
+ \item T5-style relative position bias (no absolute positional embeddings)
151
+ \item Pre-Layer Normalization with RMSNorm \cite{zhang2019root}
152
+ \item FlashAttention via PyTorch 2.0 SDPA when compatible
153
  \end{itemize}
154
 
155
+ Task-specific heads branch from the shared encoder:
156
  \begin{itemize}
157
+ \item \textbf{Summarization}: Full decoder with language modeling head (cross-entropy loss with label smoothing)
158
+ \item \textbf{Topic}: Linear classifier on mean-pooled encoder hidden states (cross-entropy loss)
159
+ \item \textbf{Emotion}: Linear classifier on mean-pooled encoder hidden states with sigmoid activation (binary cross-entropy loss)
160
  \end{itemize}
161
 
162
+ \textbf{Architectural note.} Using mean-pooled encoder states for classification in an encoder-decoder model is a pragmatic choice for parameter sharing, but may be suboptimal compared to encoder-only architectures (BERT, RoBERTa) where the encoder is fully dedicated to producing classification-ready representations. We discuss this trade-off in Section~\ref{sec:emotion_analysis}.
163
+
164
  \subsection{Training Configuration}
165
 
166
+ All experiments use consistent hyperparameters unless otherwise noted:
167
+
168
  \begin{itemize}
169
+ \item \textbf{Optimizer}: Fused AdamW, lr=$3\times10^{-5}$, weight decay=0.01, $\beta_1$=0.9, $\beta_2$=0.98
170
+ \item \textbf{Batch size}: 10 per step $\times$ 4 gradient accumulation = 40 effective
171
+ \item \textbf{Schedule}: 300-step linear warmup, cosine decay to 0.1$\times$ peak lr
172
+ \item \textbf{Max epochs}: 8 with early stopping (patience=3 on validation loss)
173
+ \item \textbf{Precision}: BFloat16 on NVIDIA RTX 4070 (12GB VRAM)
174
+ \item \textbf{Gradient clipping}: Max norm 1.0
175
+ \item \textbf{Encoder freezing}: Bottom 4 layers frozen for stable transfer learning
176
  \end{itemize}
177
 
178
+ \textbf{Task scheduling.} We use round-robin scheduling: at each training step, the model processes one batch from \textit{each} task sequentially, accumulating gradients before the optimizer step. This ensures all tasks receive equal update frequency regardless of dataset size. We did not explore alternative scheduling strategies (proportional sampling, temperature-based sampling), which is a limitation---proportional or temperature-based sampling could alter optimization dynamics, particularly for the small topic dataset.
179
+
180
+ \textbf{Loss weighting.} Task losses are combined with fixed weights: summarization=1.0, emotion=1.0, topic=0.3. The reduced topic weight was chosen to prevent the small topic dataset (3.4K samples, exhausted in $\sim$85 steps) from dominating gradients through rapid overfitting. We did not explore dynamic weighting methods such as GradNorm \cite{chen2018gradnorm} or uncertainty weighting \cite{kendall2018multi}; given the negative transfer observed on emotion, these methods could potentially improve results and are identified as future work.
181
 
182
  \subsection{Baselines and Ablations}
183
 
184
  We compare four configurations:
185
 
186
  \begin{enumerate}
187
+ \item \textbf{Random/Majority}: Random predictions for classification; for summarization, BERTScore is computed against the reference using a fixed output ``Summary not available'' (producing a baseline that reflects only the BERTScore model's behavior on unrelated text pairs---see Section~\ref{sec:baseline_discussion} for discussion).
188
+ \item \textbf{FLAN-T5-base (zero-shot)}: Pre-trained model with task-appropriate prompts, no fine-tuning.
189
+ \item \textbf{Single-Task}: Separate models fine-tuned on each task individually with identical hyperparameters. The single-task summarization model uses only the summarization dataset; topic and emotion models use only their respective datasets.
190
+ \item \textbf{Multi-Task (LexiMind)}: Joint training on all three tasks with round-robin scheduling.
191
  \end{enumerate}
192
 
193
+ We additionally ablate FLAN-T5 initialization vs. random initialization to isolate transfer learning contribution.
 
 
 
 
194
 
195
  \subsection{Evaluation Metrics}
196
 
197
  \begin{itemize}
198
+ \item \textbf{Summarization}: ROUGE-1/2/L \cite{lin2004rouge} (lexical overlap) and BERTScore F1 \cite{zhang2019bertscore} using RoBERTa-large (semantic similarity). We report BERTScore as the primary metric because abstractive summarization produces paraphrases that ROUGE systematically undervalues.
199
+ \item \textbf{Topic}: Accuracy and Macro F1 (unweighted average across 7 classes).
200
+ \item \textbf{Emotion}: Sample-averaged F1 (computed per-sample as the harmonic mean of per-sample precision and recall, then averaged across all samples). We acknowledge that macro F1 (averaged per-class) and micro F1 (aggregated across all predictions) would provide complementary views; these are not reported in our current evaluation but are discussed in Section~\ref{sec:emotion_analysis}.
201
  \end{itemize}
202
 
203
+ \textbf{Statistical note.} All results are from single training runs. We do not report confidence intervals or variance across seeds. Given the small topic dataset (189 validation samples), the observed +3.2\% accuracy improvement could be within random variance. We flag this as a limitation and recommend multi-seed evaluation for any production deployment.
204
 
205
  %=============================================================================
206
  \section{Results}
 
212
 
213
  \begin{table}[htbp]
214
  \centering
215
+ \caption{Main Results: Multi-Task vs. Single-Task Performance. All results are single-seed. Bold indicates better performance between the two configurations.}
216
  \label{tab:main_results}
217
  \begin{tabular}{llcc}
218
  \toprule
 
226
  \multirow{2}{*}{Topic} & Accuracy & 82.0\% & \textbf{85.2\%} \\
227
  & Macro F1 & 0.812 & \textbf{0.847} \\
228
  \midrule
229
+ Emotion & Sample-avg F1 & \textbf{0.218} & 0.199 \\
230
  \bottomrule
231
  \end{tabular}
232
  \end{table}
 
234
  \textbf{Key finding}: MTL provides heterogeneous effects across tasks:
235
 
236
  \begin{itemize}
237
+ \item \textbf{Topic classification gains +3.2\% accuracy} from MTL. The small topic dataset (3.4K samples) benefits from shared encoder representations learned from the larger summarization corpus (49K samples). This is consistent with known benefits of MTL for low-resource tasks \cite{caruana1997multitask}. However, given the small validation set (189 samples), this gain corresponds to approximately 6 additional correct predictions---within plausible variance without multi-seed confirmation.
238
 
239
+ \item \textbf{Summarization shows modest improvement} (+0.009 BERTScore F1). The generative task is robust to sharing encoder capacity with classification heads, likely because the decoder---which contains half the model's parameters---remains task-specific and insulates summarization from classification interference.
240
 
241
+ \item \textbf{Emotion detection degrades by $-$0.019 F1}. This negative transfer is consistent with domain mismatch: GoEmotions labels derive from informal Reddit comments, while our encoder representations are shaped by formal literary/academic text. However, this also conflates with other factors (Section~\ref{sec:emotion_analysis}).
242
  \end{itemize}
243
 
244
  \subsection{Baseline Comparisons}
245
+ \label{sec:baseline_discussion}
246
 
247
  Table \ref{tab:baselines} contextualizes our results against trivial and zero-shot baselines.
248
 
 
262
  \end{tabular}
263
  \end{table}
264
 
265
+ \textbf{On the random baseline BERTScore (0.412).} BERTScore computes cosine similarity between contextual embeddings from RoBERTa-large. Even unrelated text pairs produce non-zero similarity because (a) common function words and subword tokens share embedding space, and (b) RoBERTa's embeddings have a non-zero mean that inflates cosine similarity. The 0.412 baseline reflects this ``floor'' effect rather than any meaningful semantic overlap. This is consistent with Zhang et al.'s \cite{zhang2019bertscore} observation that BERTScore baselines vary by language and domain.
266
+
267
+ Fine-tuning provides substantial gains over zero-shot across all tasks (+0.106 BERTScore, +27\% topic accuracy, +0.11 emotion F1), demonstrating the importance of domain adaptation even with instruction-tuned models.
268
 
269
  \subsection{Ablation: Transfer Learning Contribution}
270
 
271
+ Table \ref{tab:transfer_ablation} isolates the contribution of FLAN-T5 pre-training by comparing against random initialization with identical architecture and training.
272
 
273
  \begin{table}[htbp]
274
  \centering
275
+ \caption{Effect of Pre-trained Initialization (Multi-Task Setting)}
276
  \label{tab:transfer_ablation}
277
  \begin{tabular}{lccc}
278
  \toprule
 
281
  Random & 0.523 & 45.2\% & 0.082 \\
282
  FLAN-T5-base & \textbf{0.830} & \textbf{85.2\%} & \textbf{0.199} \\
283
  \midrule
284
+ \textit{Absolute gain} & +0.307 & +40.0\% & +0.117 \\
285
  \bottomrule
286
  \end{tabular}
287
  \end{table}
288
 
289
+ FLAN-T5 initialization provides large absolute gains across all tasks. We initially characterized this as ``85\% of final performance,'' but this framing oversimplifies heterogeneous metrics: BERTScore, accuracy, and F1 have different scales and baselines, making percentage attribution across them misleading. A more precise characterization: \textbf{pre-training is necessary for competitive performance}---random initialization produces substantially worse results on all tasks even with identical data and training budget. Fine-tuning provides the remaining domain adaptation that zero-shot pre-training alone cannot achieve.
290
 
291
+ \subsection{Per-Class Topic Analysis}
292
 
293
+ Table \ref{tab:topic_breakdown} reveals per-class patterns in topic classification across the 7 classes.
294
 
295
  \begin{table}[htbp]
296
  \centering
297
+ \caption{Per-Class Topic Classification (Multi-Task, 7 Classes: Arts, Business, Fiction, History, Philosophy, Science, Technology)}
298
  \label{tab:topic_breakdown}
299
  \begin{tabular}{lccc}
300
  \toprule
 
313
  \end{tabular}
314
  \end{table}
315
 
316
+ Fiction and Business achieve near-perfect classification (F1 $\geq$ 0.97), while Science shows the most confusion (F1 = 0.65). Error analysis reveals Science samples are frequently misclassified as Technology---semantically plausible given that scientific research papers often describe technical methods. The Arts class (which covers visual arts, music, drama, and poetry from Gutenberg subject metadata) shows lower recall (0.76), suggesting some arts-related texts are misclassified into adjacent categories.
317
 
318
  \subsection{Analysis: Why Does Emotion Detection Underperform?}
319
+ \label{sec:emotion_analysis}
320
 
321
+ Our emotion sample-averaged F1 (0.20) is substantially lower than reported GoEmotions baselines (0.46 macro F1 with BERT-base \cite{demszky2020goemotions}). We identify four contributing factors, acknowledging that our experimental design does not fully disentangle them:
322
 
323
  \begin{enumerate}
324
+ \item \textbf{Domain shift}: GoEmotions labels were annotated on Reddit comments in conversational register. Our encoder is shaped by literary and academic text through the summarization objective, producing representations optimized for formal text. This domain mismatch is likely the largest factor, but we cannot isolate it without a controlled experiment (e.g., fine-tuning BERT on GoEmotions with our frozen encoder vs. BERT's own encoder).
325
 
326
+ \item \textbf{Label sparsity and class imbalance}: The 28-class multi-label scheme creates extreme imbalance. Rare emotions (grief, remorse, nervousness) appear in $<$2\% of samples. We use a fixed prediction threshold of 0.3 (during training evaluation), without per-class threshold tuning on the validation set---a simplification that the original GoEmotions work \cite{demszky2020goemotions} explicitly optimizes. Per-class threshold tuning could meaningfully improve results.
327
 
328
+ \item \textbf{Architecture mismatch}: Published GoEmotions baselines use encoder-only models (BERT-base), where the full model capacity is dedicated to producing classification-ready representations. Our encoder-decoder architecture optimizes the encoder primarily for producing representations that the decoder can use for summarization---classification heads receive these representations secondarily. The mean-pooling strategy may also be suboptimal; alternatives such as [CLS] token pooling, attention-weighted pooling, or adapter layers \cite{houlsby2019parameter} could yield better classification features.
329
+
330
+ \item \textbf{Metric reporting}: We report sample-averaged F1 (per-sample, then averaged), which is not directly comparable to macro F1 (per-class, then averaged) as reported in the original GoEmotions work. Reporting macro F1, micro F1, and per-label performance would provide a more complete picture. We identify this as a gap in our current evaluation.
331
  \end{enumerate}
332
 
333
+ \textbf{Implication}: Off-the-shelf emotion datasets from social media should not be naively combined with literary/academic tasks in MTL. Domain-specific emotion annotation or domain adaptation techniques are needed for formal text domains.
334
 
335
  \subsection{Training Dynamics}
336
 
337
+ Figure \ref{fig:training_curves} shows training progression over 7 epochs (approximately 6 hours on RTX 4070).
338
 
339
  \begin{figure}[htbp]
340
  \centering
341
  \includegraphics[width=\columnwidth]{figures/training_loss_curve.png}
342
+ \caption{Training and validation loss. Best checkpoint at epoch 4; validation loss plateaus from epochs 5--7, triggering early stopping at epoch 7 (patience=3).}
343
  \label{fig:training_curves}
344
  \end{figure}
345
 
346
  Key observations:
347
  \begin{itemize}
348
+ \item Topic classification converges by epoch 3 (99\% training accuracy), consistent with the small dataset (3.4K) being memorized quickly. The reduced task weight (0.3) prevents topic gradients from dominating updates.
349
+ \item Summarization loss decreases monotonically through epoch 4, then plateaus (best validation summarization loss: 3.698 at epoch 4).
350
+ \item The train-validation gap widens after epoch 4, primarily driven by topic overfitting on the small dataset. The best checkpoint (epoch 4) balances generalization across all tasks.
351
  \end{itemize}
352
 
353
  %=============================================================================
 
356
 
357
  \subsection{When Does MTL Help?}
358
 
359
+ Our results support nuanced, task-dependent guidance:
360
+
361
+ \textbf{MTL helps when}: A small-dataset task (topic: 3.4K samples) shares domain with a large-dataset task (summarization: 49K literary/academic samples). The topic classifier effectively receives ``free'' pre-training on in-domain text through the shared encoder, benefiting from representations tuned to literary and academic vocabulary and structure.
362
+
363
+ \textbf{MTL hurts when}: An auxiliary task's domain is misaligned with the primary training signal. Emotion detection, trained on Reddit comments, does not benefit from encoder representations shaped by formal literary/academic summarization. The round-robin scheduling ensures emotion batches receive equal update frequency, but the encoder's representations are skewed toward the summarization domain by gradient magnitude (summarization loss is substantially larger than classification losses).
364
 
365
+ \textbf{MTL is neutral when}: The primary task (summarization) has sufficient data and a task-specific component (decoder, $\sim$136M parameters) that insulates it from interference. Classification heads are small (single linear layers) and their gradients have limited impact on the shared encoder relative to the decoder's backpropagation signal.
366
 
367
+ \subsection{Comparison to MTL Literature}
368
 
369
+ Our findings align qualitatively with several key results in the MTL literature. Standley et al. \cite{standley2020tasks} showed that task groupings critically affect MTL outcomes---we observe this in the contrast between topic (positive transfer) and emotion (negative transfer). Yu et al. \cite{yu2020gradient} demonstrated that gradient conflicts between tasks explain negative transfer; our round-robin scheduling with fixed weights does not address such conflicts, and methods like PCGrad could potentially mitigate the emotion degradation by projecting away conflicting gradient components. Aribandi et al. \cite{aribandi2022ext5} found diminishing or negative returns from adding more tasks in extreme multi-task settings; our small-scale results are consistent with this pattern.
370
+
371
+ A key difference from the broader MTL literature is our use of an encoder-decoder architecture with mixed generative and discriminative tasks. Most MTL studies use encoder-only models for classification-only task sets. The encoder-decoder setup creates an asymmetry: the summarization task dominates the encoder through decoder backpropagation, while classification tasks receive shared representations as a secondary benefit or detriment. This architectural dynamic deserves further study.
372
 
373
  \subsection{Implications for Practitioners}
374
 
375
+ Based on our findings:
376
 
377
  \begin{enumerate}
378
+ \item \textbf{Audit domain alignment} before combining tasks in MTL. If auxiliary tasks draw from different text domains (e.g., social media vs. academic), negative transfer is likely unless mitigated by gradient-conflict methods or per-task adapters.
379
+
380
+ \item \textbf{Task weighting matters} for preventing small-dataset overfitting. Our reduced weight (0.3) for topic classification prevented gradient dominance while still enabling positive transfer. Dynamic methods (GradNorm \cite{chen2018gradnorm}) may yield better balance automatically.
381
 
382
+ \item \textbf{Architectural isolation protects high-priority tasks}. Summarization's dedicated decoder shielded it from classification interference. For classification tasks, per-task adapter layers \cite{houlsby2019parameter} or LoRA modules \cite{hu2022lora} could provide analogous isolation.
383
 
384
+ \item \textbf{Validate with multiple seeds} before drawing conclusions from MTL comparisons, especially with small validation sets.
385
  \end{enumerate}
386
 
387
  \subsection{Limitations}
388
+ \label{sec:limitations}
389
+
390
+ We identify several limitations that constrain the generalizability of our findings:
391
 
392
  \begin{itemize}
393
+ \item \textbf{Single-seed results}: All experiments are single runs. The +3.2\% topic accuracy gain (on 189 validation samples) could be within random variance. Multi-seed evaluation with confidence intervals is needed to confirm the direction and magnitude of transfer effects.
394
+
395
+ \item \textbf{No gradient-conflict mitigation}: We use fixed loss weights and do not explore PCGrad \cite{yu2020gradient}, CAGrad \cite{liu2021conflict}, GradNorm \cite{chen2018gradnorm}, or uncertainty weighting \cite{kendall2018multi}. These methods are directly relevant to our observed negative transfer on emotion detection and could potentially convert it to positive or neutral transfer.
396
+
397
+ \item \textbf{No encoder-only baseline}: We do not compare against BERT or RoBERTa fine-tuned on GoEmotions or topic classification. Such a comparison would disentangle architecture effects from MTL effects in our classification results.
398
 
399
+ \item \textbf{Emotion evaluation gaps}: We report sample-averaged F1 with a fixed threshold (0.3). Per-class thresholds tuned on validation, per-label metrics, focal loss for class imbalance \cite{lin2017focal}, and calibration analysis would provide more informative evaluation. The conclusion that ``domain mismatch is the primary cause'' of low emotion F1 is plausible but confounded by these design choices.
400
 
401
+ \item \textbf{No human evaluation}: ROUGE and BERTScore are imperfect proxies for summary quality, especially for creative/literary text where stylistic quality matters beyond semantic accuracy.
402
+
403
+ \item \textbf{Single model scale}: We study only FLAN-T5-base (272M parameters). Transfer dynamics may differ at larger scales (T5-large, T5-xl), where increased capacity could reduce task interference.
404
+
405
+ \item \textbf{Summarization domain imbalance}: The $\sim$11:1 ratio of academic to literary samples within the summarization task means the encoder is disproportionately shaped by academic text. This imbalance is not analyzed separately but could affect literary summarization quality.
406
  \end{itemize}
407
 
408
  \subsection{Future Work}
409
 
410
  \begin{itemize}
411
+ \item \textbf{Gradient-conflict mitigation}: Applying PCGrad or CAGrad to test whether emotion negative transfer can be reduced or eliminated. This is the most directly actionable follow-up given our current findings.
412
+
413
+ \item \textbf{Parameter-efficient multi-tasking}: Using per-task LoRA adapters \cite{hu2022lora} or adapter layers \cite{houlsby2019parameter} to provide task-specific specialization while maintaining shared encoder representations. This could reduce interference between tasks with misaligned domains.
414
 
415
+ \item \textbf{Encoder-only comparison}: Fine-tuning BERT/RoBERTa on topic and emotion classification, with and without multi-task training, to disentangle encoder-decoder architecture effects from MTL effects.
416
 
417
+ \item \textbf{Multi-seed evaluation}: Running at least 3--5 seeds per configuration to establish statistical significance of observed transfer effects.
418
+
419
+ \item \textbf{Domain-specific emotion annotation}: Collecting emotion annotations on literary and academic text to study whether in-domain emotion data eliminates the negative transfer.
420
+
421
+ \item \textbf{Improved emotion evaluation}: Per-class threshold tuning, macro/micro F1, class-level analysis, and focal loss to address class imbalance.
422
  \end{itemize}
423
 
424
  %=============================================================================
425
  \section{Conclusion}
426
  %=============================================================================
427
 
428
+ We investigated multi-task learning for literary and academic text understanding, combining abstractive summarization, topic classification, and multi-label emotion detection in an encoder-decoder architecture. Our ablation studies reveal heterogeneous transfer effects: topic classification benefits from shared representations with the larger summarization corpus (+3.2\% accuracy), while emotion detection suffers negative transfer ($-$0.02 F1) due to domain mismatch with Reddit-sourced labels. Summarization quality is robust to multi-task training, insulated by its task-specific decoder.
429
 
430
+ Pre-trained initialization (FLAN-T5) is essential for competitive performance across all tasks, with fine-tuning providing necessary domain adaptation. These findings are consistent with the broader MTL literature on the importance of task compatibility and domain alignment. However, we emphasize the limitations of our single-seed evaluation design and the absence of gradient-conflict mitigation methods, which could alter the negative transfer findings. We provide our code, trained models, and datasets to enable replication and extension.
431
 
432
+ Code and models: \url{https://github.com/OliverPerrin/LexiMind}\\
433
+ Live demo: \url{https://huggingface.co/spaces/OliverPerrin/LexiMind}
434
 
435
  %=============================================================================
436
  % References
 
442
  R. Caruana, ``Multitask learning,'' \textit{Machine Learning}, vol. 28, no. 1, pp. 41--75, 1997.
443
 
444
  \bibitem{collobert2011natural}
445
+ R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, ``Natural language processing (almost) from scratch,'' \textit{JMLR}, vol. 12, pp. 2493--2537, 2011.
446
 
447
  \bibitem{johnson2017google}
448
+ M. Johnson et al., ``Google's multilingual neural machine translation system: Enabling zero-shot translation,'' \textit{TACL}, vol. 5, pp. 339--351, 2017.
449
 
450
  \bibitem{mccann2018natural}
451
+ B. McCann, N. S. Keskar, C. Xiong, and R. Socher, ``The natural language decathlon: Multitask learning as question answering,'' \textit{arXiv:1806.08730}, 2018.
 
 
 
452
 
453
  \bibitem{standley2020tasks}
454
  T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, ``Which tasks should be learned together in multi-task learning?'' in \textit{ICML}, 2020.
455
 
456
+ \bibitem{yu2020gradient}
457
+ T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, ``Gradient surgery for multi-task learning,'' in \textit{NeurIPS}, 2020.
458
+
459
+ \bibitem{liu2021conflict}
460
+ B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, ``Conflict-averse gradient descent for multi-task learning,'' in \textit{NeurIPS}, 2021.
461
+
462
+ \bibitem{chen2018gradnorm}
463
+ Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, ``GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,'' in \textit{ICML}, 2018.
464
+
465
+ \bibitem{kendall2018multi}
466
+ A. Kendall, Y. Gal, and R. Cipolla, ``Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,'' in \textit{CVPR}, 2018.
467
+
468
+ \bibitem{aghajanyan2021muppet}
469
+ A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, L. Zettlemoyer, and S. Gupta, ``Muppet: Massive multi-task representations with pre-finetuning,'' in \textit{EMNLP}, 2021.
470
+
471
+ \bibitem{aribandi2022ext5}
472
+ V. Aribandi et al., ``ExT5: Towards extreme multi-task scaling for transfer learning,'' in \textit{ICLR}, 2022.
473
+
474
  \bibitem{raffel2020exploring}
475
  C. Raffel et al., ``Exploring the limits of transfer learning with a unified text-to-text transformer,'' \textit{JMLR}, vol. 21, no. 140, pp. 1--67, 2020.
476
 
477
  \bibitem{chung2022scaling}
478
+ H. W. Chung et al., ``Scaling instruction-finetuned language models,'' \textit{arXiv:2210.11416}, 2022.
479
 
480
  \bibitem{nallapati2016abstractive}
481
  R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, ``Abstractive text summarization using sequence-to-sequence RNNs and beyond,'' in \textit{CoNLL}, 2016.
 
487
  W. Kryscinski, N. Rajani, D. Aber, and C. Xiong, ``BookSum: A collection of datasets for long-form narrative summarization,'' in \textit{Findings of EMNLP}, 2021.
488
 
489
  \bibitem{cohan2018discourse}
490
+ A. Cohan et al., ``A discourse-aware attention model for abstractive summarization of long documents,'' in \textit{NAACL-HLT}, 2018.
491
+
492
+ \bibitem{mao2022citesum}
493
+ Y. Mao, M. Zhong, and J. Han, ``CiteSum: Citation text-guided scientific extreme summarization and domain adaptation with limited supervision,'' in \textit{EMNLP}, 2022.
494
 
495
  \bibitem{demszky2020goemotions}
496
  D. Demszky et al., ``GoEmotions: A dataset of fine-grained emotions,'' in \textit{ACL}, 2020.
497
 
498
+ \bibitem{zhang2019root}
499
+ B. Zhang and R. Sennrich, ``Root mean square layer normalization,'' in \textit{NeurIPS}, 2019.
500
 
501
  \bibitem{lin2004rouge}
502
  C.-Y. Lin, ``ROUGE: A package for automatic evaluation of summaries,'' in \textit{Text Summarization Branches Out}, 2004.
 
507
  \bibitem{hu2022lora}
508
  E. J. Hu et al., ``LoRA: Low-rank adaptation of large language models,'' in \textit{ICLR}, 2022.
509
 
510
+ \bibitem{houlsby2019parameter}
511
+ N. Houlsby et al., ``Parameter-efficient transfer learning for NLP,'' in \textit{ICML}, 2019.
512
+
513
+ \bibitem{lin2017focal}
514
+ T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\'{a}r, ``Focal loss for dense object detection,'' in \textit{ICCV}, 2017.
515
+
516
  \end{thebibliography}
517
 
518
  \end{document}
scripts/demo_gradio.py CHANGED
@@ -468,12 +468,12 @@ since descriptions paraphrase rather than quote the source text.*
468
 
469
  ### Training Data
470
 
471
- | Dataset | Task | Description |
472
- |---------|------|-------------|
473
- | Goodreads (711k+ blurbs) | Book Descriptions | Back-cover style descriptions matched with Gutenberg texts |
474
- | arXiv | Paper Abstracts | Scientific paper summarization |
475
- | 20 Newsgroups + Gutenberg | Topic Classification | Multi-domain topic categorization |
476
- | GoEmotions | Emotion Detection | 28-class multi-label emotion classification |
477
 
478
  ### Key Design Decision
479
 
@@ -483,9 +483,9 @@ since descriptions paraphrase rather than quote the source text.*
483
 
484
  ### Evaluation Metrics
485
 
486
- - **ROUGE-1/2/L**: Lexical overlap (expected range: 0.15-0.25 for descriptions)
487
  - **BLEU-4**: N-gram precision
488
- - **BERTScore**: Semantic similarity using contextual embeddings (key metric for paraphrasing)
489
 
490
  ### Links
491
 
 
468
 
469
  ### Training Data
470
 
471
+ | Dataset | Task | Samples |
472
+ |---------|------|---------|
473
+ | Gutenberg + Goodreads | Book Descriptions | ~4K literary pairs |
474
+ | arXiv (body → abstract) | Paper Abstracts | ~45K academic pairs |
475
+ | 20 Newsgroups + Gutenberg + arXiv | Topic Classification | 3.4K (7 classes) |
476
+ | GoEmotions (Reddit) | Emotion Detection | 43K (28 labels) |
477
 
478
  ### Key Design Decision
479
 
 
483
 
484
  ### Evaluation Metrics
485
 
486
+ - **ROUGE-1/2/L**: Lexical overlap with reference summaries
487
  - **BLEU-4**: N-gram precision
488
+ - **BERTScore**: Semantic similarity using contextual embeddings (primary metric for abstractive summarization)
489
 
490
  ### Links
491
 
scripts/train.py CHANGED
@@ -5,7 +5,7 @@ Training script for LexiMind.
5
  Simple, clean training with multi-task learning across:
6
  - Summarization (BookSum + arXiv papers)
7
  - Emotion classification (GoEmotions, 28 labels)
8
- - Topic classification (Books + Papers, 8 labels: Fiction, Science, Technology, etc.)
9
 
10
  Usage:
11
  python scripts/train.py training=medium
 
5
  Simple, clean training with multi-task learning across:
6
  - Summarization (BookSum + arXiv papers)
7
  - Emotion classification (GoEmotions, 28 labels)
8
+ - Topic classification (Books + Papers, 7 labels: Arts, Business, Fiction, History, Philosophy, Science, Technology)
9
 
10
  Usage:
11
  python scripts/train.py training=medium