Spaces:

OliverPerrin
/

LexiMind

Running

App Files Files Community

OliverPerrin commited on Feb 27

Commit

df3ebbd

1 Parent(s): c472a19

updated readme, ruff formatted all files

Browse files

Files changed (19) hide show

README.md +88 -114
scripts/build_discovery_dataset.py +72 -67
scripts/demo_gradio.py +63 -58
scripts/download_data.py +542 -326
scripts/evaluate.py +103 -83
scripts/profile_training.py +44 -17
scripts/train.py +122 -76
scripts/train_multiseed.py +42 -23
scripts/visualize_training.py +406 -193
src/data/dataset.py +19 -13
src/models/decoder.py +10 -3
src/models/encoder.py +11 -3
src/models/factory.py +12 -6
src/models/heads.py +4 -2
src/training/metrics.py +57 -59
src/training/trainer.py +73 -50
src/utils/__init__.py +12 -5
src/utils/core.py +9 -7
tests/test_training/test_trainer.py +10 -10

README.md CHANGED Viewed

@@ -11,56 +11,76 @@ pinned: false
 <!-- markdownlint-disable MD025 -->
 # LexiMind
-A multi-task NLP system for literary and academic text understanding. LexiMind performs **abstractive summarization**, **topic classification**, and **emotion detection** using a single encoder-decoder transformer initialized from [FLAN-T5-base](https://huggingface.co/google/flan-t5-base) (272M parameters).
 **[Live Demo](https://huggingface.co/spaces/OliverPerrin/LexiMind)** · **[Model](https://huggingface.co/OliverPerrin/LexiMind-Model)** · **[Discovery Dataset](https://huggingface.co/datasets/OliverPerrin/LexiMind-Discovery)** · **[Research Paper](docs/research_paper.tex)**
-## What It Does
-| Task | Description | Metric |
-| ------ | ------------- | -------- |
-| **Summarization** | Generates back-cover style book descriptions and paper abstracts from source text | BERTScore F1: **0.830** |
-| **Topic Classification** | Classifies passages into 7 categories | Accuracy: **85.2%** |
-| **Emotion Detection** | Identifies emotions from 28 fine-grained labels (multi-label) | Sample-avg F1: **0.199** |
-**Topic labels:** Arts · Business · Fiction · History · Philosophy · Science · Technology
-The model is trained on literary text (Project Gutenberg + Goodreads descriptions), academic papers (arXiv), and emotion-annotated Reddit comments (GoEmotions). For summarization, it learns to produce descriptive summaries—what a book *is about*—rather than plot recaps, by pairing Gutenberg full texts with Goodreads descriptions and arXiv bodies with their abstracts.
 ## Architecture
-LexiMind is a **custom Transformer implementation** that loads pre-trained weights from FLAN-T5-base via a factory module. The architecture is reimplemented from scratch for transparency, not wrapped from HuggingFace.
 | Component | Detail |
-| ----------- | -------- |
 | Backbone | Encoder-Decoder Transformer (272M params) |
-| Encoder / Decoder | 12 layers each |
-| Hidden Dim | 768, 12 attention heads |
-| Position Encoding | T5-style relative position bias |
-| Normalization | RMSNorm (Pre-LN) |
-| Attention | FlashAttention via PyTorch 2.0 SDPA |
-| Summarization Head | Full decoder with language modeling head |
-| Classification Heads | Linear layers on mean-pooled encoder states |
 ### Multi-Task Training
-All three tasks share the encoder. Summarization uses the full encoder-decoder; topic and emotion classification branch off the encoder with lightweight linear heads. Training uses round-robin scheduling (one batch per task per step), fixed loss weights (summarization=1.0, emotion=1.0, topic=0.3), and early stopping.
 ## Training Data
-| Task | Source | Train Samples |
-| ------ | -------- | --------------- |
-| Summarization | Gutenberg + Goodreads (literary) | ~4K |
 | Summarization | arXiv body → abstract (academic) | ~45K |
-| Topic | 20 Newsgroups + Gutenberg + arXiv metadata | 3,402 |
-| Emotion | GoEmotions (Reddit comments, 28 labels) | 43,410 |
 ## Getting Started
 ### Prerequisites
 - Python 3.10+
-- [Poetry](https://python-poetry.org/) for dependency management
 - NVIDIA GPU with CUDA (for training; CPU works for inference)
 ### Installation
@@ -68,59 +88,50 @@ All three tasks share the encoder. Summarization uses the full encoder-decoder;
 ```bash
 git clone https://github.com/OliverPerrin/LexiMind.git
 cd LexiMind
-poetry install
 ```
-### Download Data
-```bash
-poetry run python scripts/download_data.py
-```
-Downloads Goodreads descriptions, arXiv papers, GoEmotions, 20 Newsgroups, and Gutenberg texts.
 ### Training
 ```bash
-# Full training (~45-60 min on RTX 4070 12GB)
-poetry run python scripts/train.py training=full
-# Quick dev run (~10-15 min)
-poetry run python scripts/train.py training=dev
-# Medium run (~30-45 min)
-poetry run python scripts/train.py training=medium
 # Override parameters
-poetry run python scripts/train.py training.optimizer.lr=5e-5
 # Resume from checkpoint
-poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
 ```
-Training uses BFloat16 mixed precision, gradient checkpointing, `torch.compile`, and cosine LR decay with warmup. Experiments are tracked with MLflow (`mlflow ui` to browse).
 ### Evaluation
 ```bash
-# Full evaluation (ROUGE, BERTScore, topic accuracy, emotion F1)
-poetry run python scripts/evaluate.py
-# Skip BERTScore for faster runs
-poetry run python scripts/evaluate.py --skip-bertscore
-# Single task
-poetry run python scripts/evaluate.py --summarization-only
 ```
 ### Inference
 ```bash
 # Command-line
-poetry run python scripts/inference.py "Your text to analyze"
 # Gradio web demo
-poetry run python scripts/demo_gradio.py
 ```
 ### Docker
@@ -133,76 +144,39 @@ docker run -p 7860:7860 leximind
 ## Project Structure
 ```text
-configs/
-├── config.yaml              # Main Hydra config
-├── data/datasets.yaml       # Dataset paths and tokenizer settings
-├── model/                   # Architecture configs (base, small, large)
-└── training/                # Training configs (dev, medium, full)
 src/
-├── models/
-│   ├── encoder.py           # Transformer Encoder with Pre-LN RMSNorm
-│   ├── decoder.py           # Transformer Decoder with KV-cache
-│   ├── attention.py         # Multi-Head Attention + T5 relative position bias
-│   ├── feedforward.py       # Gated feed-forward network
-│   ├── positional_encoding.py  # Sinusoidal & learned position encodings
-│   ├── t5_layer_norm.py     # T5-style RMSNorm
-│   ├── heads.py             # Task-specific classification heads
-│   ├── multitask.py         # Multi-task model combining all components
-│   └── factory.py           # Model builder with FLAN-T5 weight loading
-├── data/
-│   ├── dataset.py           # Dataset classes for all tasks
-│   ├── dataloader.py        # Multi-task dataloader with round-robin sampling
-│   └── tokenization.py      # Tokenizer wrapper
-├── training/
-│   ├── trainer.py           # Training loop with AMP, grad accumulation, early stopping
-│   ├── metrics.py           # ROUGE, BERTScore, F1, accuracy computation
-│   └── utils.py             # Checkpointing, logging utilities
-├── inference/
-│   ├── pipeline.py          # End-to-end inference pipeline
-│   └── factory.py           # Model loading for inference
-├── api/                     # FastAPI REST endpoint
-└── utils/                   # Shared utilities
 scripts/
-├── train.py                 # Training entry point
-├── evaluate.py              # Evaluation with all metrics
-├── inference.py             # CLI inference
-├── demo_gradio.py           # Gradio web UI
-├── download_data.py         # Dataset downloader
-├── export_model.py          # Model export utilities
-├── export_tokenizer.py      # Tokenizer export
-├── preprocess_data.py       # Data preprocessing
-├── process_books.py         # Gutenberg text processing
-├── eval_rouge.py            # ROUGE-only evaluation
-└── visualize_training.py    # Training curve plotting
-tests/                       # Pytest suite (data, models, training, inference, utils)
-docs/                        # Research paper and architecture notes
-artifacts/                   # Tokenizer files and label definitions
-checkpoints/                 # Saved model checkpoints
 ```
 ## Code Quality
 ```bash
-poetry run ruff check .     # Linting
-poetry run mypy .           # Type checking
-poetry run pytest           # Test suite
-poetry run pre-commit run --all-files  # All checks
 ```
-## Key Results
-From the research paper ([docs/research_paper.tex](docs/research_paper.tex)):
-- **Multi-task learning helps topic classification** (+3.2% accuracy over single-task) because the small topic dataset (3.4K) benefits from shared encoder representations trained on the larger summarization corpus (49K).
-- **Summarization is robust to MTL**—quality stays comparable whether trained alone or jointly.
-- **Emotion detection shows slight negative transfer** (−0.02 F1), likely due to domain mismatch between Reddit-sourced emotion labels and literary/academic text.
-- **FLAN-T5 pre-training is essential**—random initialization produces dramatically worse results on all tasks.
-See the paper for full ablations, per-class breakdowns, and discussion of limitations.
 ## License
 GPL-3.0 — see [LICENSE](LICENSE) for details.

 <!-- markdownlint-disable MD025 -->
 # LexiMind
+A multi-task NLP system for literary and academic text understanding. LexiMind jointly performs **abstractive summarization**, **topic classification**, and **multi-label emotion detection** using a single encoder-decoder transformer initialized from [FLAN-T5-base](https://huggingface.co/google/flan-t5-base) (272M parameters).
 **[Live Demo](https://huggingface.co/spaces/OliverPerrin/LexiMind)** · **[Model](https://huggingface.co/OliverPerrin/LexiMind-Model)** · **[Discovery Dataset](https://huggingface.co/datasets/OliverPerrin/LexiMind-Discovery)** · **[Research Paper](docs/research_paper.tex)**
+## Results
+| Task | Metric | Score |
+| ---- | ------ | ----- |
+| Summarization | ROUGE-1 / ROUGE-L | 0.309 / 0.185 |
+| Summarization (academic) | ROUGE-1 | 0.319 |
+| Summarization (literary) | ROUGE-1 | 0.206 |
+| Topic Classification | Accuracy (95% CI) | 85.7% (80.4–91.0%) |
+| Emotion Detection | Sample-avg F1 | 0.352 |
+| Emotion Detection (tuned thresholds) | Sample-avg F1 / Macro F1 | 0.503 / 0.294 |
+Trained for 8 epochs on an RTX 4070 12GB (~9 hours) with BFloat16 mixed precision, `torch.compile`, and cosine LR decay.
+## Key Findings
+From the [research paper](docs/research_paper.tex):
+- **Naive MTL produces mixed results**: topic classification benefits (+3.7% accuracy), but emotion detection suffers negative transfer (−0.02 F1) under mean pooling with round-robin scheduling.
+- **Learned attention pooling + temperature sampling eliminates negative transfer entirely**: emotion F1 improves from 0.199 → 0.352 (+77%), surpassing the single-task baseline (0.218).
+- **Summarization is robust to MTL** — quality remains stable across configurations.
+- **FLAN-T5 pre-training is essential** — random initialization produces dramatically worse results on all tasks.
+- **Domain gap matters**: academic summaries (ROUGE-1: 0.319) substantially outperform literary (0.206), driven by an 11:1 training data imbalance.
 ## Architecture
+LexiMind is a **from-scratch PyTorch Transformer** that loads pre-trained FLAN-T5-base weights layer by layer via a custom factory module — no HuggingFace model wrappers.
 | Component | Detail |
+| --------- | ------ |
 | Backbone | Encoder-Decoder Transformer (272M params) |
+| Encoder / Decoder | 12 layers each, 768d, 12 attention heads |
+| Normalization | RMSNorm (Pre-LN, T5-style) |
+| Attention | FlashAttention via PyTorch SDPA + T5 relative position bias |
+| FFN | Gated-GELU (wi\_0, wi\_1, wo) |
+| Summarization | Full decoder → language modeling head |
+| Emotion (28-class multi-label) | Learned attention pooling → linear head |
+| Topic (7-class) | Mean pooling → linear head |
 ### Multi-Task Training
+All three tasks share the encoder. Summarization uses the full encoder-decoder; classification heads branch off the encoder output. Key training details:
+- **Temperature-based task sampling** (α=0.5): allocates training steps proportional to dataset size, preventing large tasks from dominating
+- **Attention pooling** for emotion: a learned query attends over encoder outputs, focusing on emotionally salient tokens rather than averaging the full sequence
+- **Fixed loss weights**: summarization=1.0, emotion=1.0, topic=0.3 (reduced to prevent overfitting on the small topic dataset)
+- **Frozen encoder layers 0–3**: preserves FLAN-T5's language understanding in lower layers
+- **Gradient conflict diagnostics**: optional inter-task gradient cosine similarity monitoring
+See [docs/architecture.md](docs/architecture.md) for full implementation details, weight loading tables, and training configuration rationale.
 ## Training Data
+| Task | Source | Samples |
+| ---- | ------ | ------- |
+| Summarization | Gutenberg + Goodreads descriptions (literary) | ~4K |
 | Summarization | arXiv body → abstract (academic) | ~45K |
+| Topic | Gutenberg + arXiv metadata → 7 categories | 3,402 |
+| Emotion | GoEmotions — Reddit comments, 28 labels | 43,410 |
+For summarization, the model learns to produce descriptive summaries — what a book *is about* — rather than plot recaps, by pairing Gutenberg full texts with Goodreads descriptions and arXiv papers with their abstracts.
 ## Getting Started
 ### Prerequisites
 - Python 3.10+
 - NVIDIA GPU with CUDA (for training; CPU works for inference)
 ### Installation
 ```bash
 git clone https://github.com/OliverPerrin/LexiMind.git
 cd LexiMind
+pip install -r requirements.txt
 ```
 ### Training
 ```bash
+# Full training (~9 hours on RTX 4070 12GB)
+python scripts/train.py training=full
+# Quick dev run
+python scripts/train.py training=dev
 # Override parameters
+python scripts/train.py training=full training.optimizer.lr=5e-5
 # Resume from checkpoint
+python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
 ```
+Experiments are tracked with MLflow (`mlflow ui` to browse).
 ### Evaluation
 ```bash
+python scripts/evaluate.py
+python scripts/evaluate.py --skip-bertscore    # faster
+python scripts/evaluate.py --tune-thresholds   # per-class threshold tuning
 ```
 ### Inference
 ```bash
 # Command-line
+python scripts/inference.py "Your text to analyze"
 # Gradio web demo
+python scripts/demo_gradio.py
+```
+### Profiling
+```bash
+# Profile GPU usage (CUDA kernels, memory, Chrome trace)
+python scripts/profile_training.py
 ```
 ### Docker
 ## Project Structure
 ```text
 src/
+├── models/          # Encoder, decoder, attention, FFN, heads, factory
+├── data/            # Datasets, dataloaders, tokenization, cross-task dedup
+├── training/        # Trainer (AMP, grad accum, temperature sampling), metrics
+├── inference/       # Pipeline + factory for checkpoint loading
+├── api/             # FastAPI REST endpoint
+└── utils/           # Device detection, checkpointing, label I/O
 scripts/
+├── train.py                    # Hydra training entry point
+├── evaluate.py                 # Full evaluation suite
+├── inference.py                # CLI inference
+├── demo_gradio.py              # Gradio discovery demo
+├── profile_training.py         # PyTorch profiler
+├── train_multiseed.py          # Multi-seed training with aggregation
+├── visualize_training.py       # Training curve visualization
+├── download_data.py            # Dataset downloader
+└── build_discovery_dataset.py  # Pre-compute discovery dataset
+configs/             # Hydra configs (model, training, data)
+docs/                # Research paper + architecture documentation
+tests/               # Pytest suite
 ```
 ## Code Quality
 ```bash
+ruff check .                     # Linting
+mypy src/ scripts/ tests/        # Type checking
+pytest                           # Tests
+pre-commit run --all-files       # All checks
 ```
 ## License
 GPL-3.0 — see [LICENSE](LICENSE) for details.

scripts/build_discovery_dataset.py CHANGED Viewed

@@ -29,134 +29,140 @@ from src.inference.factory import create_inference_pipeline  # noqa: E402
 # --------------- Data Loading ---------------
 def load_academic_papers(data_dir: Path, max_samples: int = 300) -> list[dict]:
     """Load academic paper samples from the training data."""
     summ_file = data_dir / "summarization" / "train.jsonl"
     if not summ_file.exists():
         print(f"  Warning: {summ_file} not found")
         return []
     academic = []
     with open(summ_file) as f:
         for line in f:
             item = json.loads(line)
             if item.get("type") != "academic":
                 continue
             text = item.get("source", "")
             if len(text) < 500:
                 continue
             # Use title from data
             title = item.get("title", "Research Paper")
-            academic.append({
-                "text": text[:2000],
-                "title": title,
-                "reference_summary": item.get("summary", "")[:500]
-            })
     random.seed(42)
     samples = random.sample(academic, min(max_samples, len(academic)))
     results = []
     for i, item in enumerate(samples):
-        results.append({
-            "id": f"paper_{i}",
-            "title": item["title"],
-            "text": item["text"],
-            "source_type": "academic",
-            "dataset": "arxiv",
-            "reference_summary": item["reference_summary"]
-        })
     print(f"  Loaded {len(results)} academic papers")
     return results
 def load_literary(data_dir: Path, max_samples: int = 300) -> list[dict]:
     """Load literary samples from the training data.
     Training data now contains Goodreads descriptions (back-cover style)
     instead of plot summaries.
     """
     summ_file = data_dir / "summarization" / "train.jsonl"
     if not summ_file.exists():
         print(f"  Warning: {summ_file} not found")
         return []
     literary = []
     seen_titles = set()
     with open(summ_file) as f:
         for line in f:
             item = json.loads(line)
             if item.get("type") != "literary":
                 continue
             title = item.get("title", "")
             if not title or title in seen_titles:
                 continue
             text = item.get("source", "")
             summary = item.get("summary", "")
             if len(text) < 500 or len(summary) < 50:
                 continue
             seen_titles.add(title)
-            literary.append({
-                "text": text[:2000],
-                "title": title,
-                "reference_summary": summary[:600]
-            })
     random.seed(42)
     samples = random.sample(literary, min(max_samples, len(literary)))
     results = []
     for i, item in enumerate(samples):
-        results.append({
-            "id": f"literary_{i}",
-            "title": item["title"],
-            "text": item["text"],
-            "source_type": "literary",
-            "dataset": "goodreads",
-            "reference_summary": item["reference_summary"],
-        })
     print(f"  Loaded {len(results)} literary works (unique titles)")
     return results
 # --------------- Inference ---------------
 def run_inference(pipeline: Any, samples: list[dict]) -> list[dict]:
     """Run model inference on all samples."""
     results = []
     for sample in tqdm(samples, desc="Running inference"):
         text = sample["text"]
         # Get model predictions using correct pipeline methods
         summaries = pipeline.summarize([text])
         topics = pipeline.predict_topics([text])
         emotions = pipeline.predict_emotions([text])
         # Extract first result from each list
         summary = summaries[0] if summaries else ""
         topic = topics[0] if topics else None
         emotion = emotions[0] if emotions else None
         # Get primary emotion (highest confidence if any detected)
         primary_emotion = "neutral"
         emotion_confidence = 0.0
         if emotion and emotion.labels:
             primary_emotion = emotion.labels[0]
             emotion_confidence = emotion.scores[0]
         result = {
             "id": sample["id"],
             "title": sample["title"],
@@ -170,24 +176,25 @@ def run_inference(pipeline: Any, samples: list[dict]) -> list[dict]:
             "generated_summary": summary,
             "reference_summary": sample.get("reference_summary", ""),
         }
         results.append(result)
     # Print distribution stats
     topic_dist: dict[str, int] = defaultdict(int)
     emotion_dist: dict[str, int] = defaultdict(int)
     for r in results:
         topic_dist[r["topic"]] += 1
         emotion_dist[r["emotion"]] += 1
     print(f"\nTopic distribution: {dict(topic_dist)}")
     print(f"Emotion distribution: {dict(emotion_dist)}")
     return results
 def main():
     import argparse
     parser = argparse.ArgumentParser(description="Build discovery dataset for HuggingFace Space")
     parser.add_argument("--data-dir", type=Path, default=Path("data/processed"))
     parser.add_argument("--checkpoint", type=Path, default=Path("checkpoints/best.pt"))
@@ -197,41 +204,39 @@ def main():
     parser.add_argument("--push-to-hub", action="store_true", help="Push to HuggingFace Hub")
     parser.add_argument("--hub-repo", type=str, default="OliverPerrin/LexiMind-Discovery")
     args = parser.parse_args()
     print("Loading data samples from training data...")
     print("(Data has already been filtered by download_data.py)")
     # Load samples from training data
     papers = load_academic_papers(args.data_dir, args.num_papers)
     literary = load_literary(args.data_dir, args.num_literary)
     all_samples = papers + literary
     print(f"\nTotal samples: {len(all_samples)} ({len(papers)} papers, {len(literary)} literary)")
     if not all_samples:
         print("ERROR: No samples loaded! Check if data/processed exists and has data.")
         print("Run: python scripts/download_data.py --task summarization")
         return
     # Load model and run inference
     print(f"\nLoading model from {args.checkpoint}...")
     labels_path = Path("artifacts/labels.json")
     pipeline, labels = create_inference_pipeline(
-        args.checkpoint,
-        labels_path,
-        device="cuda" if torch.cuda.is_available() else "cpu"
     )
     print("Running inference on all samples...")
     results = run_inference(pipeline, all_samples)
     # Save locally
     print(f"\nSaving to {args.output}...")
     args.output.parent.mkdir(parents=True, exist_ok=True)
     with open(args.output, "w") as f:
         for item in results:
             f.write(json.dumps(item) + "\n")
     # Push to HuggingFace Hub
     if args.push_to_hub:
         print(f"\nPushing to HuggingFace Hub: {args.hub_repo}")
@@ -239,10 +244,10 @@ def main():
         dataset.push_to_hub(
             args.hub_repo,
             private=False,
-            commit_message="Rebuild with Goodreads descriptions (back-cover style)"
         )
         print(f"Dataset available at: https://huggingface.co/datasets/{args.hub_repo}")
     print("\nDone!")

 # --------------- Data Loading ---------------
 def load_academic_papers(data_dir: Path, max_samples: int = 300) -> list[dict]:
     """Load academic paper samples from the training data."""
     summ_file = data_dir / "summarization" / "train.jsonl"
     if not summ_file.exists():
         print(f"  Warning: {summ_file} not found")
         return []
     academic = []
     with open(summ_file) as f:
         for line in f:
             item = json.loads(line)
             if item.get("type") != "academic":
                 continue
             text = item.get("source", "")
             if len(text) < 500:
                 continue
             # Use title from data
             title = item.get("title", "Research Paper")
+            academic.append(
+                {
+                    "text": text[:2000],
+                    "title": title,
+                    "reference_summary": item.get("summary", "")[:500],
+                }
+            )
     random.seed(42)
     samples = random.sample(academic, min(max_samples, len(academic)))
     results = []
     for i, item in enumerate(samples):
+        results.append(
+            {
+                "id": f"paper_{i}",
+                "title": item["title"],
+                "text": item["text"],
+                "source_type": "academic",
+                "dataset": "arxiv",
+                "reference_summary": item["reference_summary"],
+            }
+        )
     print(f"  Loaded {len(results)} academic papers")
     return results
 def load_literary(data_dir: Path, max_samples: int = 300) -> list[dict]:
     """Load literary samples from the training data.
     Training data now contains Goodreads descriptions (back-cover style)
     instead of plot summaries.
     """
     summ_file = data_dir / "summarization" / "train.jsonl"
     if not summ_file.exists():
         print(f"  Warning: {summ_file} not found")
         return []
     literary = []
     seen_titles = set()
     with open(summ_file) as f:
         for line in f:
             item = json.loads(line)
             if item.get("type") != "literary":
                 continue
             title = item.get("title", "")
             if not title or title in seen_titles:
                 continue
             text = item.get("source", "")
             summary = item.get("summary", "")
             if len(text) < 500 or len(summary) < 50:
                 continue
             seen_titles.add(title)
+            literary.append(
+                {"text": text[:2000], "title": title, "reference_summary": summary[:600]}
+            )
     random.seed(42)
     samples = random.sample(literary, min(max_samples, len(literary)))
     results = []
     for i, item in enumerate(samples):
+        results.append(
+            {
+                "id": f"literary_{i}",
+                "title": item["title"],
+                "text": item["text"],
+                "source_type": "literary",
+                "dataset": "goodreads",
+                "reference_summary": item["reference_summary"],
+            }
+        )
     print(f"  Loaded {len(results)} literary works (unique titles)")
     return results
 # --------------- Inference ---------------
 def run_inference(pipeline: Any, samples: list[dict]) -> list[dict]:
     """Run model inference on all samples."""
     results = []
     for sample in tqdm(samples, desc="Running inference"):
         text = sample["text"]
         # Get model predictions using correct pipeline methods
         summaries = pipeline.summarize([text])
         topics = pipeline.predict_topics([text])
         emotions = pipeline.predict_emotions([text])
         # Extract first result from each list
         summary = summaries[0] if summaries else ""
         topic = topics[0] if topics else None
         emotion = emotions[0] if emotions else None
         # Get primary emotion (highest confidence if any detected)
         primary_emotion = "neutral"
         emotion_confidence = 0.0
         if emotion and emotion.labels:
             primary_emotion = emotion.labels[0]
             emotion_confidence = emotion.scores[0]
         result = {
             "id": sample["id"],
             "title": sample["title"],
             "generated_summary": summary,
             "reference_summary": sample.get("reference_summary", ""),
         }
         results.append(result)
     # Print distribution stats
     topic_dist: dict[str, int] = defaultdict(int)
     emotion_dist: dict[str, int] = defaultdict(int)
     for r in results:
         topic_dist[r["topic"]] += 1
         emotion_dist[r["emotion"]] += 1
     print(f"\nTopic distribution: {dict(topic_dist)}")
     print(f"Emotion distribution: {dict(emotion_dist)}")
     return results
 def main():
     import argparse
     parser = argparse.ArgumentParser(description="Build discovery dataset for HuggingFace Space")
     parser.add_argument("--data-dir", type=Path, default=Path("data/processed"))
     parser.add_argument("--checkpoint", type=Path, default=Path("checkpoints/best.pt"))
     parser.add_argument("--push-to-hub", action="store_true", help="Push to HuggingFace Hub")
     parser.add_argument("--hub-repo", type=str, default="OliverPerrin/LexiMind-Discovery")
     args = parser.parse_args()
     print("Loading data samples from training data...")
     print("(Data has already been filtered by download_data.py)")
     # Load samples from training data
     papers = load_academic_papers(args.data_dir, args.num_papers)
     literary = load_literary(args.data_dir, args.num_literary)
     all_samples = papers + literary
     print(f"\nTotal samples: {len(all_samples)} ({len(papers)} papers, {len(literary)} literary)")
     if not all_samples:
         print("ERROR: No samples loaded! Check if data/processed exists and has data.")
         print("Run: python scripts/download_data.py --task summarization")
         return
     # Load model and run inference
     print(f"\nLoading model from {args.checkpoint}...")
     labels_path = Path("artifacts/labels.json")
     pipeline, labels = create_inference_pipeline(
+        args.checkpoint, labels_path, device="cuda" if torch.cuda.is_available() else "cpu"
     )
     print("Running inference on all samples...")
     results = run_inference(pipeline, all_samples)
     # Save locally
     print(f"\nSaving to {args.output}...")
     args.output.parent.mkdir(parents=True, exist_ok=True)
     with open(args.output, "w") as f:
         for item in results:
             f.write(json.dumps(item) + "\n")
     # Push to HuggingFace Hub
     if args.push_to_hub:
         print(f"\nPushing to HuggingFace Hub: {args.hub_repo}")
         dataset.push_to_hub(
             args.hub_repo,
             private=False,
+            commit_message="Rebuild with Goodreads descriptions (back-cover style)",
         )
         print(f"Dataset available at: https://huggingface.co/datasets/{args.hub_repo}")
     print("\nDone!")

scripts/demo_gradio.py CHANGED Viewed

@@ -27,8 +27,12 @@ print(f"Loaded {len(_dataset)} items")
 ALL_ITEMS: list[dict[str, Any]] = [dict(row) for row in _dataset]
 # Extract unique topics and emotions FROM THE DATASET (what model predicted)
-DATASET_TOPICS: list[str] = sorted(set(str(item["topic"]) for item in ALL_ITEMS if item.get("topic")))
-DATASET_EMOTIONS: list[str] = sorted(set(str(item["emotion"]) for item in ALL_ITEMS if item.get("emotion")))
 # Load ALL possible labels from labels.json (what the model CAN predict)
 _labels_path = Path(__file__).parent.parent / "artifacts" / "labels.json"
@@ -90,19 +94,19 @@ def format_item_card(item: dict) -> str:
     title = item.get("title", "Unknown")
     source_type = item.get("source_type", "unknown")
     dataset_name = item.get("dataset", "").title()
     # Icon based on type
     if source_type == "academic":
         type_label = "Research Paper"
     else:
         type_label = "Literature"
     # Topic and emotion with confidence
     topic = item.get("topic", "Unknown")
     topic_conf = item.get("topic_confidence", 0)
     emotion = item.get("emotion", "Unknown")
     emotion_conf = item.get("emotion_confidence", 0)
     # Summary - check if using reference or generated
     use_reference = item.get("use_reference_summary", False)
     if use_reference or source_type == "literary":
@@ -111,17 +115,21 @@ def format_item_card(item: dict) -> str:
     else:
         summary = item.get("generated_summary", "")
         summary_label = "**AI-Generated Description:**"
     if not summary:
         summary = "No summary available."
     # Truncate summary if too long
     if len(summary) > 400:
-        summary = summary[:400].rsplit(' ', 1)[0] + "..."
     # Preview of original text
-    text_preview = item.get("text", "")[:400] + "..." if len(item.get("text", "")) > 400 else item.get("text", "")
     return f"""### **{title}**
 <small>*{type_label}* from {dataset_name}</small>
@@ -147,24 +155,24 @@ def browse_by_topic(topic: str) -> str:
     items = get_items_by_topic(topic)
     if not items:
         return "No items found for this topic."
     # Group by type
     literary = [i for i in items if i.get("source_type") == "literary"]
     academic = [i for i in items if i.get("source_type") == "academic"]
     result = f"## {topic if topic != 'All' else 'All Topics'}\n\n"
     result += f"*Found {len(items)} items ({len(literary)} literary, {len(academic)} academic)*\n\n"
     if literary:
         result += "### Literary Works\n\n"
         for item in literary[:25]:  # Limit to avoid huge pages
             result += format_item_card(item)
     if academic:
         result += "### Academic Papers\n\n"
         for item in academic[:25]:
             result += format_item_card(item)
     return result
@@ -173,23 +181,23 @@ def browse_by_emotion(emotion: str) -> str:
     items = get_items_by_emotion(emotion)
     if not items:
         return "No items found for this emotion."
     literary = [i for i in items if i.get("source_type") == "literary"]
     academic = [i for i in items if i.get("source_type") == "academic"]
     result = f"## Feeling {emotion.title() if emotion != 'All' else 'All Emotions'}?\n\n"
     result += f"*Found {len(items)} items ({len(literary)} literary, {len(academic)} academic)*\n\n"
     if literary:
         result += "### Literary Works\n\n"
         for item in literary[:25]:
             result += format_item_card(item)
     if academic:
         result += "### Academic Papers\n\n"
         for item in academic[:25]:
             result += format_item_card(item)
     return result
@@ -197,24 +205,25 @@ def search_items(query: str) -> str:
     """Search items by text content."""
     if not query or len(query) < 3:
         return "Enter at least 3 characters to search."
     query_lower = query.lower()
     matches = [
-        item for item in ALL_ITEMS
         if query_lower in item.get("text", "").lower()
         or query_lower in item.get("generated_summary", "").lower()
         or query_lower in item.get("title", "").lower()
     ]
     if not matches:
         return f"No results found for '{query}'."
     result = f"## Search Results for '{query}'\n\n"
     result += f"*Found {len(matches)} matching items*\n\n"
     for item in matches[:30]:
         result += format_item_card(item)
     return result
@@ -226,9 +235,8 @@ with gr.Blocks(
     css="""
     .result-box { max-height: 700px; overflow-y: auto; }
     h3 { margin-top: 0.5em !important; }
-    """
 ) as demo:
     gr.Markdown(
         """
         # LexiMind
@@ -237,79 +245,75 @@ with gr.Blocks(
         Browse **{total_count}** texts — {lit_count} classic books and {paper_count} research papers — analyzed by a multi-task transformer.
         ---
-        """.format(
-            total_count=len(ALL_ITEMS),
-            lit_count=len(BOOKS),
-            paper_count=len(PAPERS)
-        )
     )
     with gr.Tabs():
         # ===================== TAB 1: BROWSE BY TOPIC =====================
         with gr.Tab("By Topic"):
             gr.Markdown("*Select a topic to explore related books and papers*")
             topic_dropdown = gr.Dropdown(
                 choices=["All"] + TOPICS,
                 value="All",
                 label="Select Topic",
                 interactive=True,
             )
             topic_results = gr.Markdown(
                 value=browse_by_topic("All"),
                 elem_classes=["result-box"],
             )
             topic_dropdown.change(
                 fn=browse_by_topic,
                 inputs=[topic_dropdown],
                 outputs=[topic_results],
             )
         # ===================== TAB 2: BROWSE BY EMOTION =====================
         with gr.Tab("By Emotion"):
             gr.Markdown("*Find books and papers that evoke specific emotions*")
             emotion_dropdown = gr.Dropdown(
                 choices=["All"] + [e.title() for e in EMOTIONS],
                 value="All",
                 label="Select Emotion",
                 interactive=True,
             )
             emotion_results = gr.Markdown(
                 value=browse_by_emotion("All"),
                 elem_classes=["result-box"],
             )
             emotion_dropdown.change(
                 fn=lambda e: browse_by_emotion(e.lower() if e != "All" else "All"),
                 inputs=[emotion_dropdown],
                 outputs=[emotion_results],
             )
         # ===================== TAB 3: SEARCH =====================
         with gr.Tab("Search"):
             gr.Markdown("*Search through all books and papers by keyword*")
             search_input = gr.Textbox(
                 placeholder="Enter keywords to search...",
                 label="Search",
                 interactive=True,
             )
             search_results = gr.Markdown(
                 value="Enter at least 3 characters to search.",
                 elem_classes=["result-box"],
             )
             search_input.change(
                 fn=search_items,
                 inputs=[search_input],
                 outputs=[search_results],
             )
         # ===================== TAB 4: METRICS =====================
         with gr.Tab("Metrics"):
             gr.Markdown(
@@ -319,10 +323,10 @@ with gr.Blocks(
                 Computed on held-out validation data.
                 """
             )
             # Summarization Metrics
             gr.Markdown("#### Summarization")
             if METRICS.get("summarization"):
                 summ = METRICS["summarization"]
                 summ_md = """
@@ -341,10 +345,10 @@ with gr.Blocks(
                 gr.Markdown(summ_md)
             else:
                 gr.Markdown("*Summarization metrics not available. Run evaluation script.*")
             # Topic Classification Metrics
             gr.Markdown("#### Topic Classification")
             if METRICS.get("topic"):
                 topic = METRICS["topic"]
                 topic_md = """
@@ -359,10 +363,10 @@ with gr.Blocks(
                 gr.Markdown(topic_md)
             else:
                 gr.Markdown("*Topic classification metrics not available.*")
             # Emotion Detection Metrics
             gr.Markdown("#### Emotion Detection")
             if METRICS.get("emotion"):
                 emotion = METRICS["emotion"]
                 emotion_md = """
@@ -374,17 +378,19 @@ with gr.Blocks(
 *28-label multi-label classification from GoEmotions.*
 """.format(
-                    sample_f1=emotion.get("sample_avg_f1", emotion.get("f1", emotion.get("multilabel_f1", 0))),
                     macro_f1=emotion.get("macro_f1", 0),
                     micro_f1=emotion.get("micro_f1", 0),
                 )
                 gr.Markdown(emotion_md)
             else:
                 gr.Markdown("*Emotion detection metrics not available.*")
             # Dataset Statistics
             gr.Markdown("#### Dataset Statistics")
             gr.Markdown(f"""
 | Statistic | Value |
 |-----------|-------|
@@ -394,7 +400,7 @@ with gr.Blocks(
 | Topics | {len(TOPICS)} |
 | Emotions | {len(EMOTIONS)} |
 """)
         # ===================== TAB 5: ABOUT =====================
         with gr.Tab("About"):
             gr.Markdown(
@@ -420,4 +426,3 @@ with gr.Blocks(
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)

 ALL_ITEMS: list[dict[str, Any]] = [dict(row) for row in _dataset]
 # Extract unique topics and emotions FROM THE DATASET (what model predicted)
+DATASET_TOPICS: list[str] = sorted(
+    set(str(item["topic"]) for item in ALL_ITEMS if item.get("topic"))
+)
+DATASET_EMOTIONS: list[str] = sorted(
+    set(str(item["emotion"]) for item in ALL_ITEMS if item.get("emotion"))
+)
 # Load ALL possible labels from labels.json (what the model CAN predict)
 _labels_path = Path(__file__).parent.parent / "artifacts" / "labels.json"
     title = item.get("title", "Unknown")
     source_type = item.get("source_type", "unknown")
     dataset_name = item.get("dataset", "").title()
     # Icon based on type
     if source_type == "academic":
         type_label = "Research Paper"
     else:
         type_label = "Literature"
     # Topic and emotion with confidence
     topic = item.get("topic", "Unknown")
     topic_conf = item.get("topic_confidence", 0)
     emotion = item.get("emotion", "Unknown")
     emotion_conf = item.get("emotion_confidence", 0)
     # Summary - check if using reference or generated
     use_reference = item.get("use_reference_summary", False)
     if use_reference or source_type == "literary":
     else:
         summary = item.get("generated_summary", "")
         summary_label = "**AI-Generated Description:**"
     if not summary:
         summary = "No summary available."
     # Truncate summary if too long
     if len(summary) > 400:
+        summary = summary[:400].rsplit(" ", 1)[0] + "..."
     # Preview of original text
+    text_preview = (
+        item.get("text", "")[:400] + "..."
+        if len(item.get("text", "")) > 400
+        else item.get("text", "")
+    )
     return f"""### **{title}**
 <small>*{type_label}* from {dataset_name}</small>
     items = get_items_by_topic(topic)
     if not items:
         return "No items found for this topic."
     # Group by type
     literary = [i for i in items if i.get("source_type") == "literary"]
     academic = [i for i in items if i.get("source_type") == "academic"]
     result = f"## {topic if topic != 'All' else 'All Topics'}\n\n"
     result += f"*Found {len(items)} items ({len(literary)} literary, {len(academic)} academic)*\n\n"
     if literary:
         result += "### Literary Works\n\n"
         for item in literary[:25]:  # Limit to avoid huge pages
             result += format_item_card(item)
     if academic:
         result += "### Academic Papers\n\n"
         for item in academic[:25]:
             result += format_item_card(item)
     return result
     items = get_items_by_emotion(emotion)
     if not items:
         return "No items found for this emotion."
     literary = [i for i in items if i.get("source_type") == "literary"]
     academic = [i for i in items if i.get("source_type") == "academic"]
     result = f"## Feeling {emotion.title() if emotion != 'All' else 'All Emotions'}?\n\n"
     result += f"*Found {len(items)} items ({len(literary)} literary, {len(academic)} academic)*\n\n"
     if literary:
         result += "### Literary Works\n\n"
         for item in literary[:25]:
             result += format_item_card(item)
     if academic:
         result += "### Academic Papers\n\n"
         for item in academic[:25]:
             result += format_item_card(item)
     return result
     """Search items by text content."""
     if not query or len(query) < 3:
         return "Enter at least 3 characters to search."
     query_lower = query.lower()
     matches = [
+        item
+        for item in ALL_ITEMS
         if query_lower in item.get("text", "").lower()
         or query_lower in item.get("generated_summary", "").lower()
         or query_lower in item.get("title", "").lower()
     ]
     if not matches:
         return f"No results found for '{query}'."
     result = f"## Search Results for '{query}'\n\n"
     result += f"*Found {len(matches)} matching items*\n\n"
     for item in matches[:30]:
         result += format_item_card(item)
     return result
     css="""
     .result-box { max-height: 700px; overflow-y: auto; }
     h3 { margin-top: 0.5em !important; }
+    """,
 ) as demo:
     gr.Markdown(
         """
         # LexiMind
         Browse **{total_count}** texts — {lit_count} classic books and {paper_count} research papers — analyzed by a multi-task transformer.
         ---
+        """.format(total_count=len(ALL_ITEMS), lit_count=len(BOOKS), paper_count=len(PAPERS))
     )
     with gr.Tabs():
         # ===================== TAB 1: BROWSE BY TOPIC =====================
         with gr.Tab("By Topic"):
             gr.Markdown("*Select a topic to explore related books and papers*")
             topic_dropdown = gr.Dropdown(
                 choices=["All"] + TOPICS,
                 value="All",
                 label="Select Topic",
                 interactive=True,
             )
             topic_results = gr.Markdown(
                 value=browse_by_topic("All"),
                 elem_classes=["result-box"],
             )
             topic_dropdown.change(
                 fn=browse_by_topic,
                 inputs=[topic_dropdown],
                 outputs=[topic_results],
             )
         # ===================== TAB 2: BROWSE BY EMOTION =====================
         with gr.Tab("By Emotion"):
             gr.Markdown("*Find books and papers that evoke specific emotions*")
             emotion_dropdown = gr.Dropdown(
                 choices=["All"] + [e.title() for e in EMOTIONS],
                 value="All",
                 label="Select Emotion",
                 interactive=True,
             )
             emotion_results = gr.Markdown(
                 value=browse_by_emotion("All"),
                 elem_classes=["result-box"],
             )
             emotion_dropdown.change(
                 fn=lambda e: browse_by_emotion(e.lower() if e != "All" else "All"),
                 inputs=[emotion_dropdown],
                 outputs=[emotion_results],
             )
         # ===================== TAB 3: SEARCH =====================
         with gr.Tab("Search"):
             gr.Markdown("*Search through all books and papers by keyword*")
             search_input = gr.Textbox(
                 placeholder="Enter keywords to search...",
                 label="Search",
                 interactive=True,
             )
             search_results = gr.Markdown(
                 value="Enter at least 3 characters to search.",
                 elem_classes=["result-box"],
             )
             search_input.change(
                 fn=search_items,
                 inputs=[search_input],
                 outputs=[search_results],
             )
         # ===================== TAB 4: METRICS =====================
         with gr.Tab("Metrics"):
             gr.Markdown(
                 Computed on held-out validation data.
                 """
             )
             # Summarization Metrics
             gr.Markdown("#### Summarization")
             if METRICS.get("summarization"):
                 summ = METRICS["summarization"]
                 summ_md = """
                 gr.Markdown(summ_md)
             else:
                 gr.Markdown("*Summarization metrics not available. Run evaluation script.*")
             # Topic Classification Metrics
             gr.Markdown("#### Topic Classification")
             if METRICS.get("topic"):
                 topic = METRICS["topic"]
                 topic_md = """
                 gr.Markdown(topic_md)
             else:
                 gr.Markdown("*Topic classification metrics not available.*")
             # Emotion Detection Metrics
             gr.Markdown("#### Emotion Detection")
             if METRICS.get("emotion"):
                 emotion = METRICS["emotion"]
                 emotion_md = """
 *28-label multi-label classification from GoEmotions.*
 """.format(
+                    sample_f1=emotion.get(
+                        "sample_avg_f1", emotion.get("f1", emotion.get("multilabel_f1", 0))
+                    ),
                     macro_f1=emotion.get("macro_f1", 0),
                     micro_f1=emotion.get("micro_f1", 0),
                 )
                 gr.Markdown(emotion_md)
             else:
                 gr.Markdown("*Emotion detection metrics not available.*")
             # Dataset Statistics
             gr.Markdown("#### Dataset Statistics")
             gr.Markdown(f"""
 | Statistic | Value |
 |-----------|-------|
 | Topics | {len(TOPICS)} |
 | Emotions | {len(EMOTIONS)} |
 """)
         # ===================== TAB 5: ABOUT =====================
         with gr.Tab("About"):
             gr.Markdown(
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)

scripts/download_data.py CHANGED Viewed

@@ -45,63 +45,128 @@ OUTPUT_DIR = Path(__file__).parent.parent / "data" / "processed"
 # 28 emotions from GoEmotions - works for all text types
 EMOTION_LABELS = [
-    "admiration", "amusement", "anger", "annoyance", "approval", "caring",
-    "confusion", "curiosity", "desire", "disappointment", "disapproval",
-    "disgust", "embarrassment", "excitement", "fear", "gratitude", "grief",
-    "joy", "love", "nervousness", "optimism", "pride", "realization",
-    "relief", "remorse", "sadness", "surprise", "neutral",
 ]
 # New topic labels for books + papers + blogs
 TOPIC_LABELS = [
-    "Fiction",           # Novels, short stories, literary fiction
-    "Science",           # Physics, chemistry, biology, nature
-    "Technology",        # CS, engineering, programming, AI/ML
-    "Philosophy",        # Ethics, logic, metaphysics, epistemology
-    "History",           # Historical texts, biographies, memoirs
-    "Psychology",        # Mind, behavior, self-help, mental health
-    "Business",          # Economics, finance, entrepreneurship
-    "Arts",              # Music, visual arts, film, architecture
 ]
 # arXiv category → our topic mapping
 ARXIV_CATEGORY_MAP = {
     # Computer Science
-    "cs.AI": "Technology", "cs.CL": "Technology", "cs.CV": "Technology",
-    "cs.LG": "Technology", "cs.NE": "Technology", "cs.RO": "Technology",
-    "cs.SE": "Technology", "cs.PL": "Technology", "cs.DB": "Technology",
-    "cs.DS": "Technology", "cs.CR": "Technology", "cs.DC": "Technology",
-    "cs.HC": "Technology", "cs.IR": "Technology", "cs.IT": "Technology",
-    "cs.MA": "Technology", "cs.MM": "Technology", "cs.NI": "Technology",
-    "cs.OS": "Technology", "cs.PF": "Technology", "cs.SY": "Technology",
     # Physics
-    "physics": "Science", "astro-ph": "Science", "cond-mat": "Science",
-    "gr-qc": "Science", "hep-ex": "Science", "hep-lat": "Science",
-    "hep-ph": "Science", "hep-th": "Science", "math-ph": "Science",
-    "nlin": "Science", "nucl-ex": "Science", "nucl-th": "Science",
     "quant-ph": "Science",
     # Math
     "math": "Science",
     # Biology/Medicine
-    "q-bio": "Science", "stat": "Science",
     # Economics/Finance
-    "econ": "Business", "q-fin": "Business",
     # Electrical Engineering
     "eess": "Technology",
 }
 # Gutenberg subject → our topic mapping
 GUTENBERG_SUBJECT_MAP = {
-    "fiction": "Fiction", "novel": "Fiction", "stories": "Fiction",
-    "poetry": "Arts", "drama": "Arts", "plays": "Arts",
-    "science": "Science", "physics": "Science", "chemistry": "Science",
-    "biology": "Science", "nature": "Science", "astronomy": "Science",
-    "philosophy": "Philosophy", "ethics": "Philosophy", "logic": "Philosophy",
-    "history": "History", "biography": "History", "memoir": "History",
-    "psychology": "Psychology", "mind": "Psychology",
-    "economics": "Business", "business": "Business", "finance": "Business",
-    "art": "Arts", "music": "Arts", "architecture": "Arts",
-    "technology": "Technology", "engineering": "Technology",
 }
@@ -118,12 +183,69 @@ def write_jsonl(records: list[dict[str, Any]], path: Path, desc: str = "Writing"
 # Common English words for detection
 ENGLISH_WORDS = {
-    "the", "and", "of", "to", "a", "in", "that", "is", "was", "he", "she", "it",
-    "for", "with", "as", "his", "her", "they", "be", "at", "on", "have", "had",
-    "this", "but", "not", "from", "by", "or", "an", "said", "were", "been",
-    "would", "could", "which", "their", "there", "what", "when", "who", "will",
-    "more", "if", "no", "out", "so", "up", "into", "than", "them", "can", "only",
-    "other", "new", "some", "very", "just", "over", "such", "also", "its", "then",
 }
 # Non-English language patterns
@@ -144,72 +266,126 @@ NON_ENGLISH_PATTERNS = [
 # Patterns that indicate garbage/metadata text
 GARBAGE_PATTERNS = [
-    r"^Page \d+:",           # Page corrections
-    r"changed to",           # Errata
-    r"Punctuation has been", # Editorial notes
-    r"^\[.*\]$",             # Bracketed notes
-    r"^Note\.?[-—]",         # Notes
-    r"^follows:",            # "as follows:"
-    r"CHAPTER [IVXLC]+\.",   # Chapter headers only
-    r"^\*\*\*",              # Project Gutenberg markers
-    r"^End of.*Project",     # End markers
-    r"^Produced by",         # Production credits
-    r"transcriber",          # Transcriber notes
-    r"eBook",                # eBook references
-    r"©|copyright",          # Copyright notices
-    r"^INDEX",               # Index pages
     r"^\d+\.\s+\w+,\s+\d+",  # Index entries like "1. Name, 234"
-    r"(syn\.|var\.|sp\.)",   # Botanical abbreviations
-    r"[A-Z][a-z]+aceae",     # Botanical family names
-    r"\(\s*syn\s+",          # Synonym references
 ]
 # Patterns that indicate technical manuals/instructions (not narrative)
 TECHNICAL_PATTERNS = [
     r"\d+\.\s+It\s+(is|has|can)",  # Numbered features "1. It is a..."
-    r"^\d+(st|nd|rd|th)\.",        # "1st. 2nd. 3rd."
-    r"Mesh\.?\s*\d+",              # Mesh sizes (pottery)
     r"\d+\s*(oz|lb|kg|g|ml|mm|cm|inch)",  # Measurements
-    r"Parts?\s*:?\s*\d+",          # "Parts: 50"
-    r"Method of Using",            # Instructions
-    r"How to\s+\w+",               # How-to guides
-    r"Step\s+\d+",                 # Step-by-step
-    r"wire.*address",              # Business instructions
-    r"orders?\s+should\s+be",      # Order instructions
-    r"specifications?",            # Technical specs
-    r"(Front|Back)\s+Focus",       # Camera terms
-    r"Rack and Pinion",            # Mechanical terms
 ]
 # Shakespeare and plays to exclude (model hallucinates on Early Modern English)
 EXCLUDED_TITLES = {
     # Shakespeare
-    "King Lear", "Hamlet", "Macbeth", "Othello", "Romeo and Juliet",
-    "A Midsummer Night's Dream", "The Tempest", "Julius Caesar",
-    "The Merchant of Venice", "Twelfth Night", "Much Ado About Nothing",
-    "As You Like It", "The Taming of the Shrew", "Antony and Cleopatra",
-    "Coriolanus", "Cymbeline", "Timon of Athens", "Troilus and Cressida",
-    "Measure for Measure", "All's Well That Ends Well", "Pericles",
-    "The Winter's Tale", "The Comedy of Errors", "Two Gentlemen of Verona",
-    "Love's Labour's Lost", "The Merry Wives of Windsor", "Henry IV",
-    "Henry V", "Henry VI", "Henry VIII", "Richard II", "Richard III",
-    "King John", "Titus Andronicus",
     # French plays
-    "Tartuffe", "Phaedra", "Cyrano de Bergerac", "Cyrano De Bergerac",
-    "Le Misanthrope", "The School for Wives", "The Miser", "The Imaginary Invalid",
-    "Andromaque", "Britannicus", "Bérénice", "Le Cid",
     # Greek/Roman plays
-    "Oedipus Rex", "Oedipus the King", "Antigone", "Electra", "Medea",
-    "The Bacchae", "The Oresteia", "Agamemnon", "Prometheus Bound",
     # Other classic plays
-    "The Importance of Being Earnest", "Pygmalion", "Doctor Faustus",
-    "Waiting for Godot", "Death of a Salesman", "A Streetcar Named Desire",
-    "The Glass Menagerie", "Our Town", "Long Day's Journey Into Night",
-    "Who's Afraid of Virginia Woolf", "The Crucible", "Cat on a Hot Tin Roof",
     # Verse/poetic epics
-    "Idylls of the King", "Paradise Lost", "Paradise Regained",
-    "The Divine Comedy", "Inferno", "Purgatorio", "Paradiso",
-    "The Faerie Queene", "Beowulf",
 }
@@ -227,25 +403,25 @@ def is_quality_text(text: str) -> bool:
     for pattern in GARBAGE_PATTERNS:
         if re.search(pattern, text, re.IGNORECASE | re.MULTILINE):
             return False
     # Reject technical manuals/instructions
     if is_technical_manual(text):
         return False
     # Must have reasonable length
     if len(text) < 300:
         return False
     # Must have sentences (not just fragments)
-    sentences = re.split(r'[.!?]+', text)
     if len(sentences) < 4:
         return False
     # Check for too many special characters
     special_ratio = len(re.findall(r'[^\w\s.,!?\'"()-]', text)) / max(len(text), 1)
     if special_ratio > 0.08:
         return False
     return True
@@ -263,7 +439,7 @@ def is_play_text(text: str) -> bool:
         r"^[A-Z]{2,}\.\s",  # Character names like "HAMLET."
         r"Alarum|Flourish|Sennet",  # Stage directions
     ]
-    lines = text.split('\n')[:10]
     play_indicators = 0
     for line in lines:
         for pattern in play_patterns:
@@ -275,182 +451,182 @@ def is_play_text(text: str) -> bool:
 def is_english_text(text: str, min_ratio: float = 0.08, max_foreign: int = 5) -> bool:
     """
     Check if text is primarily English.
     Args:
         text: Text to check
         min_ratio: Minimum ratio of common English words
         max_foreign: Maximum number of foreign word matches before rejecting
     Returns:
         True if text appears to be English
     """
     if not text or len(text) < 100:
         return False
     text_lower = text.lower()
     words = text_lower.split()
     if len(words) < 20:
         return False
     # Check for excessive non-English words
     for pattern in NON_ENGLISH_PATTERNS:
         matches = len(re.findall(pattern, text_lower))
         if matches > max_foreign:
             return False
     # Check for sufficient English words
     english_count = sum(1 for w in words if w.strip(".,!?;:'\"") in ENGLISH_WORDS)
     ratio = english_count / len(words)
     return ratio >= min_ratio
 def normalize_title(title: str) -> str:
     """Normalize a book title for matching."""
     # Remove common prefixes/suffixes
-    title = re.sub(r'^(The|A|An)\s+', '', title, flags=re.IGNORECASE)
-    title = re.sub(r'\s*\([^)]*\)\s*', '', title)  # Remove parentheticals
-    title = re.sub(r'\s*:.+$', '', title)  # Remove subtitles
-    title = re.sub(r'[^\w\s]', '', title)  # Remove punctuation
     return title.lower().strip()
 # -------- SUMMARIZATION: BOOKS + ARXIV ----------
 def download_goodreads_descriptions() -> dict[str, dict]:
     """
     Download Goodreads book descriptions - back-cover style blurbs.
     These are "what the book is about" descriptions, not plot summaries.
     Returns dict mapping normalized title -> {title, description}
     """
     print("\nLoading Goodreads book descriptions...")
     descriptions = {}
     # Try multiple sources
     datasets_to_try = [
         "booksouls/goodreads-book-descriptions",
         "Skelebor/book_titles_and_descriptions_en_clean",
     ]
     for ds_name in datasets_to_try:
         try:
             print(f"    Loading {ds_name}...")
             ds = load_dataset(ds_name, split="train")
             for item in tqdm(ds, desc="Goodreads", leave=False):
                 title = item.get("title", "")
                 description = item.get("description", "")
                 if not title or not description:
                     continue
                 # Skip very short descriptions (not useful for training)
                 if len(description) < 100:
                     continue
                 # Skip very long descriptions (truncate later)
                 if len(description) > 2000:
                     description = description[:2000]
                 # Skip plays and excluded titles
                 if is_excluded_title(title):
                     continue
                 # Skip non-English descriptions
                 if not is_english_text(description):
                     continue
                 norm_title = normalize_title(title)
                 if norm_title and norm_title not in descriptions:
                     descriptions[norm_title] = {
                         "title": title,
                         "description": description,
                     }
             print(f"    Loaded {len(descriptions):,} descriptions from {ds_name}")
         except Exception as e:
             print(f"    {ds_name} failed: {e}")
     print(f"    Total: {len(descriptions):,} unique book descriptions")
     return descriptions
 def download_book_descriptions(
-    goodreads_descriptions: dict[str, dict],
-    max_samples: int = 20000
 ) -> list[dict[str, Any]]:
     """
     Download book description data by matching Gutenberg texts with Goodreads descriptions.
     This gives us (book_excerpt, book_description) training pairs where descriptions
     are back-cover style "what is this book about" blurbs, not plot summaries.
     """
     print("\nMatching Gutenberg books with Goodreads descriptions...")
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     matched_titles = set()
     skipped_quality = 0
     skipped_play = 0
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Matching books", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata_raw = item.get("METADATA", "") or "{}"
         # Parse metadata
         try:
             metadata = json.loads(metadata_raw) if isinstance(metadata_raw, str) else metadata_raw
         except (json.JSONDecodeError, TypeError):
             metadata = {}
         # Get title
         title = metadata.get("title", "") if isinstance(metadata, dict) else ""
         if not title:
             continue
         # Check if we have a Goodreads description for this book
         norm_title = normalize_title(title)
         if norm_title not in goodreads_descriptions:
             continue
         # Skip if already matched this book
         if norm_title in matched_titles:
             continue
         goodreads_data = goodreads_descriptions[norm_title]
         # Skip plays and excluded titles
         if is_excluded_title(title):
             skipped_play += 1
             continue
         if not text or len(text) < 2000:
             continue
         # Get a clean excerpt from the book (skip front matter)
-        paragraphs = re.split(r'\n\s*\n', text)
         excerpt_parts = []
         total_len = 0
         for para in paragraphs[10:]:  # Skip front matter
             para = para.strip()
             if len(para) < 100:
                 continue
             # Quality check on paragraph
             if not is_english_text(para):
                 continue
@@ -460,112 +636,119 @@ def download_book_descriptions(
             if not is_quality_text(para) and len(para) > 300:
                 skipped_quality += 1
                 continue
             excerpt_parts.append(para)
             total_len += len(para)
             if total_len >= 3000:
                 break
         if total_len < 1000:
             continue
         book_excerpt = "\n\n".join(excerpt_parts)[:4000]
         matched_titles.add(norm_title)
-        records.append({
-            "source": book_excerpt,
-            "summary": goodreads_data["description"][:800],  # Back-cover blurbs are shorter
-            "type": "literary",
-            "title": goodreads_data["title"],
-        })
     print(f"    Matched {len(records):,} books with descriptions")
     print(f"    Skipped: {skipped_quality} quality, {skipped_play} plays")
     return records
 # Keep BookSum for additional literary training (chapter summaries are still useful)
 def download_booksum(max_samples: int = 20000) -> list[dict[str, Any]]:
     """Download BookSum - literary chapter summarization (English only, quality filtered).
     Note: These are chapter-level plot summaries, useful as supplementary training data.
     The primary book training comes from Goodreads descriptions (back-cover style).
     """
     print("\nLoading BookSum (supplementary literary data)...")
     all_records: list[dict[str, Any]] = []
     booksum = load_dataset("kmfoda/booksum")
     for split_name in booksum.keys():
         split = str(split_name)
         data = booksum[split_name]
         limit = max_samples if "train" in split else max_samples // 10
         indices = random.sample(range(len(data)), min(len(data), limit))
         records = []
         skipped_language = 0
         skipped_excluded = 0
         skipped_play = 0
         for i in tqdm(indices, desc=f"BookSum {split}", leave=False):
             item = data[i]
             chapter = item.get("chapter", "")
             summary = item.get("summary_text") or item.get("summary", "")
             # Extract book title from book_id (e.g., "The Last of the Mohicans.chapters 1-2")
             book_id = item.get("book_id", "")
             book_title = book_id.split(".")[0] if "." in book_id else book_id
             chapter_name = item.get("summary_id", "") or item.get("summary_name", "")
             if not (chapter and summary and len(chapter) > 300):
                 continue
             # Filter: excluded titles (Shakespeare, plays, etc.)
             if is_excluded_title(book_title):
                 skipped_excluded += 1
                 continue
             # Filter: play text format
             if is_play_text(chapter):
                 skipped_play += 1
                 continue
             # Filter: English only
             if not is_english_text(chapter):
                 skipped_language += 1
                 continue
             # Filter: quality text
             if not is_quality_text(chapter):
                 continue
-            records.append({
-                "source": chapter[:4000],
-                "summary": summary,
-                "type": "literary",
-                "split": split,
-                "title": book_title,
-                "chapter": chapter_name,
-            })
         all_records.extend(records)
-        print(f"    {split}: {len(records):,} (skipped {skipped_language} non-English, {skipped_excluded} excluded, {skipped_play} plays)")
     return all_records
 def clean_arxiv_text(text: str) -> str:
     """Clean arXiv LaTeX-style text to make it more readable."""
     import re
     # Remove LaTeX math placeholders
-    text = re.sub(r'@xmath\d+', '', text)
-    text = re.sub(r'@xcite', '', text)
     # Remove excessive whitespace
-    text = re.sub(r'\s+', ' ', text)
     # Remove LaTeX commands
-    text = re.sub(r'\\[a-zA-Z]+\{[^}]*\}', '', text)
-    text = re.sub(r'\\[a-zA-Z]+', '', text)
     return text.strip()
@@ -573,19 +756,19 @@ def extract_paper_title(abstract: str) -> str:
     """Extract a meaningful title from the first sentence of an abstract."""
     # Clean the abstract first
     abstract = clean_arxiv_text(abstract)
     # Get the first sentence (up to first period, question mark, or newline)
-    first_sentence = re.split(r'[.!?\n]', abstract)[0].strip()
     # Truncate if too long
     if len(first_sentence) > 100:
         # Try to cut at a natural word boundary
-        first_sentence = first_sentence[:100].rsplit(' ', 1)[0] + '...'
     # Capitalize first letter
     if first_sentence:
         first_sentence = first_sentence[0].upper() + first_sentence[1:]
     return first_sentence or "Untitled Paper"
@@ -593,202 +776,222 @@ def download_arxiv_summarization(max_samples: int = 50000) -> list[dict[str, Any
     """
     Download arXiv papers for academic summarization only (English only).
     Note: This dataset doesn't have categories, so can't be used for topic classification.
     Returns: summarization_records
     """
     print("\nLoading arXiv (academic papers for summarization)...")
     print("  Loading dataset (this may take a minute)...")
     arxiv = load_dataset("ccdv/arxiv-summarization", split="train")
     summ_records: list[dict[str, Any]] = []
     skipped_language = 0
     indices = list(range(len(arxiv)))
     random.shuffle(indices)
     print("  Processing papers...")
-    for i in tqdm(indices[:max_samples * 2], desc="arXiv", leave=False):
         if len(summ_records) >= max_samples:
             break
         item = arxiv[i]
         # Get abstract and article
         abstract = item.get("abstract", "")
         article = item.get("article", "")
         if not abstract or len(abstract) < 100:
             continue
         # Clean LaTeX artifacts
         abstract = clean_arxiv_text(abstract)
         article = clean_arxiv_text(article)
         # Skip if still has too many weird characters after cleaning
-        if '@' in abstract or '@' in article[:500]:
             continue
         # Filter: English only
         if not is_english_text(article[:1000]):
             skipped_language += 1
             continue
         # Summarization: article → abstract
         if article and len(article) > 500:
             # Extract title from abstract
             paper_title = extract_paper_title(abstract)
-            summ_records.append({
-                "source": article[:4000],
-                "summary": abstract,
-                "type": "academic",
-                "title": paper_title,
-            })
     print(f"    Summarization: {len(summ_records):,} (skipped {skipped_language} non-English)")
     return summ_records
 def download_topics_from_datasets(max_samples: int = 50000) -> list[dict[str, Any]]:
     """
     Download topic classification data from multiple sources with real categories.
     Sources:
     - 20 Newsgroups (classic topic classification)
     - Wikipedia (article categories)
     """
     print("\nLoading topic classification datasets...")
     records: list[dict[str, Any]] = []
     # 20 Newsgroups - classic topic dataset
     print("  Loading 20 Newsgroups...")
     try:
         newsgroups = load_dataset("SetFit/20_newsgroups", split="train")
         # Map 20 newsgroups categories to our 8 topics
         newsgroup_map = {
             # Science
-            "sci.crypt": "Science", "sci.electronics": "Science",
-            "sci.med": "Science", "sci.space": "Science",
-            # Technology
-            "comp.graphics": "Technology", "comp.os.ms-windows.misc": "Technology",
-            "comp.sys.ibm.pc.hardware": "Technology", "comp.sys.mac.hardware": "Technology",
             "comp.windows.x": "Technology",
             # Philosophy/Religion
-            "alt.atheism": "Philosophy", "soc.religion.christian": "Philosophy",
             "talk.religion.misc": "Philosophy",
             # History/Politics
-            "talk.politics.guns": "History", "talk.politics.mideast": "History",
             "talk.politics.misc": "History",
             # Business
             "misc.forsale": "Business",
             # Sports/Recreation
-            "rec.autos": "Arts", "rec.motorcycles": "Arts",
-            "rec.sport.baseball": "Arts", "rec.sport.hockey": "Arts",
         }
         for item in tqdm(newsgroups, desc="20 Newsgroups", leave=False):
             if len(records) >= max_samples:
                 break
             label_name = item.get("label_text", "")
             text = item.get("text", "")
             if label_name in newsgroup_map and text and len(text) > 100:
-                records.append({
-                    "text": text[:1500],
-                    "topic": newsgroup_map[label_name],
-                    "source": "newsgroups",
-                })
         print(f"    20 Newsgroups: {len(records):,}")
     except Exception as e:
         print(f"    20 Newsgroups failed: {e}")
     # Add from Gutenberg for Fiction
     gutenberg_topics = download_gutenberg_topics(max_samples // 4)
     records.extend(gutenberg_topics)
     # Add from scientific papers abstract dataset for more Science/Tech
     print("  Loading scientific papers...")
     try:
         sci_papers = load_dataset("scientific_papers", "arxiv", split="train", streaming=True)
         sci_count = 0
-        for item in tqdm(sci_papers, desc="Scientific papers", leave=False, total=max_samples//4):
             if sci_count >= max_samples // 4:
                 break
             abstract = item.get("abstract", "")
             if abstract and len(abstract) > 100:
                 # Alternate between Science and Technology
                 topic = "Science" if sci_count % 2 == 0 else "Technology"
-                records.append({
-                    "text": abstract[:1500],
-                    "topic": topic,
-                    "source": "scientific_papers",
-                })
                 sci_count += 1
         print(f"    Scientific papers: {sci_count:,}")
     except Exception as e:
         print(f"    Scientific papers failed: {e}")
     return records
 def download_summarization(max_books: int = 20000, max_arxiv: int = 50000) -> None:
     """Download all summarization data (books + arxiv, NO news).
     Book data now uses Goodreads descriptions (back-cover blurbs) instead of
     plot summaries. This trains the model to describe "what the book is about"
     rather than summarizing the plot.
     """
     print("\nDownloading Summarization Data...")
     out_dir = OUTPUT_DIR / "summarization"
     all_records: list[dict[str, Any]] = []
     # Goodreads descriptions - primary book training data (back-cover style)
     goodreads_descriptions = download_goodreads_descriptions()
     book_records = download_book_descriptions(goodreads_descriptions, max_books)
     all_records.extend(book_records)
     # Optional: Add some BookSum for additional literary variety
     # These are chapter summaries, not back-cover style, so keep limited
     # booksum_records = download_booksum(max_books // 4)
     # all_records.extend(booksum_records)
     # arXiv - academic (abstracts are already "what is this paper about")
     arxiv_summ = download_arxiv_summarization(max_arxiv)
     all_records.extend(arxiv_summ)
     # Shuffle and split
     random.shuffle(all_records)
     # Split by original split if available, else 90/5/5
-    train_records = [r for r in all_records if r.get("split", "train") == "train" or "split" not in r]
     val_records = [r for r in all_records if r.get("split") == "validation"]
     test_records = [r for r in all_records if r.get("split") == "test"]
     # If no split info, do 90/5/5
     if len(val_records) < 100:
         n = len(train_records)
         random.shuffle(train_records)
-        val_records = train_records[int(n*0.9):int(n*0.95)]
-        test_records = train_records[int(n*0.95):]
-        train_records = train_records[:int(n*0.9)]
     # Remove split key before saving
     for r in train_records + val_records + test_records:
         r.pop("split", None)
     write_jsonl(train_records, out_dir / "train.jsonl", "train")
     write_jsonl(val_records, out_dir / "validation.jsonl", "val")
     write_jsonl(test_records, out_dir / "test.jsonl", "test")
     # Print breakdown
-    literary_count = sum(1 for r in train_records + val_records + test_records if r.get("type") == "literary")
-    academic_count = sum(1 for r in train_records + val_records + test_records if r.get("type") == "academic")
     print(f"\n  Total summarization: {len(train_records) + len(val_records) + len(test_records):,}")
     print(f"    Literary (book descriptions): {literary_count:,}")
     print(f"    Academic (paper abstracts): {academic_count:,}")
@@ -796,10 +999,11 @@ def download_summarization(max_books: int = 20000, max_arxiv: int = 50000) -> No
 # ------------ TOPIC CLASSIFICATION ------------
 def download_topics(max_samples: int = 50000) -> None:
     """
     Download topic classification data from multiple sources.
     Sources:
     - 20 Newsgroups (classic topic dataset)
     - Gutenberg books (Fiction)
@@ -807,49 +1011,49 @@ def download_topics(max_samples: int = 50000) -> None:
     """
     print("\nDownloading Topic Classification...")
     out_dir = OUTPUT_DIR / "topic"
     # Get topic records from various sources
     all_records = download_topics_from_datasets(max_samples)
     # Balance topics
     topic_counts: dict[str, list] = {t: [] for t in TOPIC_LABELS}
     for r in all_records:
         topic = r.get("topic")
         if topic in topic_counts:
             topic_counts[topic].append(r)
     # Print distribution before balancing
     print("\n  Topic distribution (before balancing):")
     for topic, records in topic_counts.items():
         print(f"    {topic}: {len(records):,}")
     # Balance to min count (with some tolerance) - only from topics that have data
     counts_with_data = [len(v) for v in topic_counts.values() if v]
     if not counts_with_data:
         print("  Warning: No topic data found!")
         return
     min_count = min(counts_with_data)
     target_count = min(min_count, max_samples // len(TOPIC_LABELS))
     balanced: list[dict[str, Any]] = []
     for _topic, records in topic_counts.items():
         if records:
             random.shuffle(records)
             balanced.extend(records[:target_count])
     random.shuffle(balanced)
     # Split 90/5/5
     n = len(balanced)
-    train_records = balanced[:int(n*0.9)]
-    val_records = balanced[int(n*0.9):int(n*0.95)]
-    test_records = balanced[int(n*0.95):]
     write_jsonl(train_records, out_dir / "train.jsonl", "train")
     write_jsonl(val_records, out_dir / "validation.jsonl", "val")
     write_jsonl(test_records, out_dir / "test.jsonl", "test")
     # Save labels - only labels that have data
     used_labels = [t for t in TOPIC_LABELS if topic_counts.get(t)]
     (out_dir / "labels.json").write_text(json.dumps(used_labels, indent=2))
@@ -859,82 +1063,85 @@ def download_topics(max_samples: int = 50000) -> None:
 def download_gutenberg_topics(max_samples: int = 30000) -> list[dict[str, Any]]:
     """Extract topic-labeled samples from Gutenberg books (English only)."""
     print("\nLoading Gutenberg for topic classification...")
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         print("  Trying pg19...")
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     skipped_language = 0
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Gutenberg topics", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata = item.get("METADATA", {}) or {}
         if not text or len(text) < 1000:
             continue
         # Try to determine topic from metadata
         subjects = ""
         if isinstance(metadata, dict):
             subjects = str(metadata.get("subjects", "")).lower()
             subjects += " " + str(metadata.get("subject", "")).lower()
             subjects += " " + str(metadata.get("category", "")).lower()
         topic = None
         for keyword, mapped_topic in GUTENBERG_SUBJECT_MAP.items():
             if keyword in subjects:
                 topic = mapped_topic
                 break
         # Default fiction for novels without clear subject
         if not topic and ("novel" in subjects or not subjects.strip()):
             topic = "Fiction"
         if topic:
             # Get a clean paragraph as sample
-            paragraphs = re.split(r'\n\s*\n', text)
             for para in paragraphs[5:]:  # Skip front matter
                 para = para.strip()
-                if 200 < len(para) < 1500 and para.count('.') >= 2:
                     # Filter: English only
                     if not is_english_text(para):
                         skipped_language += 1
                         break
-                    records.append({
-                        "text": para,
-                        "topic": topic,
-                        "source": "gutenberg",
-                    })
                     break
     print(f"    Gutenberg topics: {len(records):,} (skipped {skipped_language} non-English)")
     return records
 # ------------ EMOTIONS (unchanged) -------------
 def download_emotions() -> None:
     """Download GoEmotions for emotion classification."""
     print("\nDownloading Emotions (GoEmotions)...")
     out_dir = OUTPUT_DIR / "emotion"
     ds = load_dataset("google-research-datasets/go_emotions", "simplified")
     for split_name in ds.keys():
         split = str(split_name)
         data = ds[split_name]
         records: list[dict[str, Any]] = []
         for item in tqdm(data, desc=split, leave=False):
             text = item.get("text", "")
@@ -944,7 +1151,7 @@ def download_emotions() -> None:
                 if emotions:
                     records.append({"text": text, "emotions": emotions})
         write_jsonl(records, out_dir / f"{split}.jsonl", split)
     (out_dir / "labels.json").write_text(json.dumps(EMOTION_LABELS, indent=2))
     print(f"  {len(EMOTION_LABELS)} emotion labels saved")
@@ -952,12 +1159,23 @@ def download_emotions() -> None:
 # --------------- GUTENBERG BOOKS (for language modeling) ---------------
 GUTENBERG_JUNK_PATTERNS = [
-    r"Project Gutenberg", r"www\.gutenberg\.org", r"This ebook is for",
-    r"Gutenberg License", r"^\*\*\* START OF", r"^\*\*\* END OF",
-    r"Produced by", r"Transcriber's Note", r"TABLE OF CONTENTS",
-    r"^\s*CHAPTER\s+[IVXLC\d]+", r"^\s*Chapter\s+[IVXLC\d]+",
-    r"^\s*BOOK\s+[IVXLC\d]+", r"^\s*PREFACE\s*$", r"^\s*INTRODUCTION\s*$",
-    r"E-text prepared by", r"Internet Archive", r"Distributed Proofreaders",
 ]
 GUTENBERG_JUNK_REGEX = re.compile("|".join(GUTENBERG_JUNK_PATTERNS), re.IGNORECASE)
@@ -968,7 +1186,7 @@ def is_clean_prose(text: str) -> bool:
         return False
     if GUTENBERG_JUNK_REGEX.search(text):
         return False
-    if text.count('.') < 2:
         return False
     uppercase_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
     if uppercase_ratio > 0.3:
@@ -987,68 +1205,66 @@ def download_gutenberg(max_samples: int = 30000) -> None:
     print("\nDownloading Gutenberg Books (English only)...")
     out_dir = OUTPUT_DIR / "books"
     out_dir.mkdir(parents=True, exist_ok=True)
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Books", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata_raw = item.get("METADATA", "") or "{}"
         # Parse metadata - it's stored as JSON string
         try:
             metadata = json.loads(metadata_raw) if isinstance(metadata_raw, str) else metadata_raw
         except (json.JSONDecodeError, TypeError):
             metadata = {}
         # Extract title and author
         title = metadata.get("title", "") if isinstance(metadata, dict) else ""
         author = metadata.get("author", "") if isinstance(metadata, dict) else ""
         if not title:
             title = item.get("title", f"Unknown Book #{i}")
         if not text or len(text) < 1000:
             continue
-        paragraphs = re.split(r'\n\s*\n', text)
         for para in paragraphs:
             para = para.strip()
             if is_clean_prose(para):
-                records.append({
-                    "text": para,
-                    "title": title,
-                    "author": author,
-                    "type": "gutenberg"
-                })
                 if len(records) >= max_samples:
                     break
     random.shuffle(records)
     n = len(records)
-    write_jsonl(records[:int(n*0.9)], out_dir / "train.jsonl", "train")
-    write_jsonl(records[int(n*0.9):int(n*0.95)], out_dir / "validation.jsonl", "val")
-    write_jsonl(records[int(n*0.95):], out_dir / "test.jsonl", "test")
 # ------------ MAIN ------------
 def main() -> None:
     parser = argparse.ArgumentParser(description="Download LexiMind datasets")
     parser.add_argument(
         "--task",
         choices=["all", "summarization", "emotion", "topic", "gutenberg"],
         default="all",
-        help="Dataset to download"
     )
     parser.add_argument("--max-books", type=int, default=40000, help="Max BookSum samples")
     parser.add_argument("--max-arxiv", type=int, default=50000, help="Max arXiv samples")
@@ -1056,14 +1272,14 @@ def main() -> None:
     parser.add_argument("--max-topics", type=int, default=50000, help="Max topic samples")
     parser.add_argument("--seed", type=int, default=42, help="Random seed")
     args = parser.parse_args()
     random.seed(args.seed)
     print("=" * 60)
     print("LexiMind Dataset Download")
     print("Books + Academic Papers + Topic Classification")
     print("=" * 60)
     if args.task in ["all", "summarization"]:
         download_summarization(args.max_books, args.max_arxiv)
     if args.task in ["all", "emotion"]:
@@ -1072,7 +1288,7 @@ def main() -> None:
         download_topics(args.max_topics)
     if args.task in ["all", "gutenberg"]:
         download_gutenberg(args.max_gutenberg)
     print("\n" + "=" * 60)
     print("Download complete!")
     print("=" * 60)

 # 28 emotions from GoEmotions - works for all text types
 EMOTION_LABELS = [
+    "admiration",
+    "amusement",
+    "anger",
+    "annoyance",
+    "approval",
+    "caring",
+    "confusion",
+    "curiosity",
+    "desire",
+    "disappointment",
+    "disapproval",
+    "disgust",
+    "embarrassment",
+    "excitement",
+    "fear",
+    "gratitude",
+    "grief",
+    "joy",
+    "love",
+    "nervousness",
+    "optimism",
+    "pride",
+    "realization",
+    "relief",
+    "remorse",
+    "sadness",
+    "surprise",
+    "neutral",
 ]
 # New topic labels for books + papers + blogs
 TOPIC_LABELS = [
+    "Fiction",  # Novels, short stories, literary fiction
+    "Science",  # Physics, chemistry, biology, nature
+    "Technology",  # CS, engineering, programming, AI/ML
+    "Philosophy",  # Ethics, logic, metaphysics, epistemology
+    "History",  # Historical texts, biographies, memoirs
+    "Psychology",  # Mind, behavior, self-help, mental health
+    "Business",  # Economics, finance, entrepreneurship
+    "Arts",  # Music, visual arts, film, architecture
 ]
 # arXiv category → our topic mapping
 ARXIV_CATEGORY_MAP = {
     # Computer Science
+    "cs.AI": "Technology",
+    "cs.CL": "Technology",
+    "cs.CV": "Technology",
+    "cs.LG": "Technology",
+    "cs.NE": "Technology",
+    "cs.RO": "Technology",
+    "cs.SE": "Technology",
+    "cs.PL": "Technology",
+    "cs.DB": "Technology",
+    "cs.DS": "Technology",
+    "cs.CR": "Technology",
+    "cs.DC": "Technology",
+    "cs.HC": "Technology",
+    "cs.IR": "Technology",
+    "cs.IT": "Technology",
+    "cs.MA": "Technology",
+    "cs.MM": "Technology",
+    "cs.NI": "Technology",
+    "cs.OS": "Technology",
+    "cs.PF": "Technology",
+    "cs.SY": "Technology",
     # Physics
+    "physics": "Science",
+    "astro-ph": "Science",
+    "cond-mat": "Science",
+    "gr-qc": "Science",
+    "hep-ex": "Science",
+    "hep-lat": "Science",
+    "hep-ph": "Science",
+    "hep-th": "Science",
+    "math-ph": "Science",
+    "nlin": "Science",
+    "nucl-ex": "Science",
+    "nucl-th": "Science",
     "quant-ph": "Science",
     # Math
     "math": "Science",
     # Biology/Medicine
+    "q-bio": "Science",
+    "stat": "Science",
     # Economics/Finance
+    "econ": "Business",
+    "q-fin": "Business",
     # Electrical Engineering
     "eess": "Technology",
 }
 # Gutenberg subject → our topic mapping
 GUTENBERG_SUBJECT_MAP = {
+    "fiction": "Fiction",
+    "novel": "Fiction",
+    "stories": "Fiction",
+    "poetry": "Arts",
+    "drama": "Arts",
+    "plays": "Arts",
+    "science": "Science",
+    "physics": "Science",
+    "chemistry": "Science",
+    "biology": "Science",
+    "nature": "Science",
+    "astronomy": "Science",
+    "philosophy": "Philosophy",
+    "ethics": "Philosophy",
+    "logic": "Philosophy",
+    "history": "History",
+    "biography": "History",
+    "memoir": "History",
+    "psychology": "Psychology",
+    "mind": "Psychology",
+    "economics": "Business",
+    "business": "Business",
+    "finance": "Business",
+    "art": "Arts",
+    "music": "Arts",
+    "architecture": "Arts",
+    "technology": "Technology",
+    "engineering": "Technology",
 }
 # Common English words for detection
 ENGLISH_WORDS = {
+    "the",
+    "and",
+    "of",
+    "to",
+    "a",
+    "in",
+    "that",
+    "is",
+    "was",
+    "he",
+    "she",
+    "it",
+    "for",
+    "with",
+    "as",
+    "his",
+    "her",
+    "they",
+    "be",
+    "at",
+    "on",
+    "have",
+    "had",
+    "this",
+    "but",
+    "not",
+    "from",
+    "by",
+    "or",
+    "an",
+    "said",
+    "were",
+    "been",
+    "would",
+    "could",
+    "which",
+    "their",
+    "there",
+    "what",
+    "when",
+    "who",
+    "will",
+    "more",
+    "if",
+    "no",
+    "out",
+    "so",
+    "up",
+    "into",
+    "than",
+    "them",
+    "can",
+    "only",
+    "other",
+    "new",
+    "some",
+    "very",
+    "just",
+    "over",
+    "such",
+    "also",
+    "its",
+    "then",
 }
 # Non-English language patterns
 # Patterns that indicate garbage/metadata text
 GARBAGE_PATTERNS = [
+    r"^Page \d+:",  # Page corrections
+    r"changed to",  # Errata
+    r"Punctuation has been",  # Editorial notes
+    r"^\[.*\]$",  # Bracketed notes
+    r"^Note\.?[-—]",  # Notes
+    r"^follows:",  # "as follows:"
+    r"CHAPTER [IVXLC]+\.",  # Chapter headers only
+    r"^\*\*\*",  # Project Gutenberg markers
+    r"^End of.*Project",  # End markers
+    r"^Produced by",  # Production credits
+    r"transcriber",  # Transcriber notes
+    r"eBook",  # eBook references
+    r"©|copyright",  # Copyright notices
+    r"^INDEX",  # Index pages
     r"^\d+\.\s+\w+,\s+\d+",  # Index entries like "1. Name, 234"
+    r"(syn\.|var\.|sp\.)",  # Botanical abbreviations
+    r"[A-Z][a-z]+aceae",  # Botanical family names
+    r"\(\s*syn\s+",  # Synonym references
 ]
 # Patterns that indicate technical manuals/instructions (not narrative)
 TECHNICAL_PATTERNS = [
     r"\d+\.\s+It\s+(is|has|can)",  # Numbered features "1. It is a..."
+    r"^\d+(st|nd|rd|th)\.",  # "1st. 2nd. 3rd."
+    r"Mesh\.?\s*\d+",  # Mesh sizes (pottery)
     r"\d+\s*(oz|lb|kg|g|ml|mm|cm|inch)",  # Measurements
+    r"Parts?\s*:?\s*\d+",  # "Parts: 50"
+    r"Method of Using",  # Instructions
+    r"How to\s+\w+",  # How-to guides
+    r"Step\s+\d+",  # Step-by-step
+    r"wire.*address",  # Business instructions
+    r"orders?\s+should\s+be",  # Order instructions
+    r"specifications?",  # Technical specs
+    r"(Front|Back)\s+Focus",  # Camera terms
+    r"Rack and Pinion",  # Mechanical terms
 ]
 # Shakespeare and plays to exclude (model hallucinates on Early Modern English)
 EXCLUDED_TITLES = {
     # Shakespeare
+    "King Lear",
+    "Hamlet",
+    "Macbeth",
+    "Othello",
+    "Romeo and Juliet",
+    "A Midsummer Night's Dream",
+    "The Tempest",
+    "Julius Caesar",
+    "The Merchant of Venice",
+    "Twelfth Night",
+    "Much Ado About Nothing",
+    "As You Like It",
+    "The Taming of the Shrew",
+    "Antony and Cleopatra",
+    "Coriolanus",
+    "Cymbeline",
+    "Timon of Athens",
+    "Troilus and Cressida",
+    "Measure for Measure",
+    "All's Well That Ends Well",
+    "Pericles",
+    "The Winter's Tale",
+    "The Comedy of Errors",
+    "Two Gentlemen of Verona",
+    "Love's Labour's Lost",
+    "The Merry Wives of Windsor",
+    "Henry IV",
+    "Henry V",
+    "Henry VI",
+    "Henry VIII",
+    "Richard II",
+    "Richard III",
+    "King John",
+    "Titus Andronicus",
     # French plays
+    "Tartuffe",
+    "Phaedra",
+    "Cyrano de Bergerac",
+    "Cyrano De Bergerac",
+    "Le Misanthrope",
+    "The School for Wives",
+    "The Miser",
+    "The Imaginary Invalid",
+    "Andromaque",
+    "Britannicus",
+    "Bérénice",
+    "Le Cid",
     # Greek/Roman plays
+    "Oedipus Rex",
+    "Oedipus the King",
+    "Antigone",
+    "Electra",
+    "Medea",
+    "The Bacchae",
+    "The Oresteia",
+    "Agamemnon",
+    "Prometheus Bound",
     # Other classic plays
+    "The Importance of Being Earnest",
+    "Pygmalion",
+    "Doctor Faustus",
+    "Waiting for Godot",
+    "Death of a Salesman",
+    "A Streetcar Named Desire",
+    "The Glass Menagerie",
+    "Our Town",
+    "Long Day's Journey Into Night",
+    "Who's Afraid of Virginia Woolf",
+    "The Crucible",
+    "Cat on a Hot Tin Roof",
     # Verse/poetic epics
+    "Idylls of the King",
+    "Paradise Lost",
+    "Paradise Regained",
+    "The Divine Comedy",
+    "Inferno",
+    "Purgatorio",
+    "Paradiso",
+    "The Faerie Queene",
+    "Beowulf",
 }
     for pattern in GARBAGE_PATTERNS:
         if re.search(pattern, text, re.IGNORECASE | re.MULTILINE):
             return False
     # Reject technical manuals/instructions
     if is_technical_manual(text):
         return False
     # Must have reasonable length
     if len(text) < 300:
         return False
     # Must have sentences (not just fragments)
+    sentences = re.split(r"[.!?]+", text)
     if len(sentences) < 4:
         return False
     # Check for too many special characters
     special_ratio = len(re.findall(r'[^\w\s.,!?\'"()-]', text)) / max(len(text), 1)
     if special_ratio > 0.08:
         return False
     return True
         r"^[A-Z]{2,}\.\s",  # Character names like "HAMLET."
         r"Alarum|Flourish|Sennet",  # Stage directions
     ]
+    lines = text.split("\n")[:10]
     play_indicators = 0
     for line in lines:
         for pattern in play_patterns:
 def is_english_text(text: str, min_ratio: float = 0.08, max_foreign: int = 5) -> bool:
     """
     Check if text is primarily English.
     Args:
         text: Text to check
         min_ratio: Minimum ratio of common English words
         max_foreign: Maximum number of foreign word matches before rejecting
     Returns:
         True if text appears to be English
     """
     if not text or len(text) < 100:
         return False
     text_lower = text.lower()
     words = text_lower.split()
     if len(words) < 20:
         return False
     # Check for excessive non-English words
     for pattern in NON_ENGLISH_PATTERNS:
         matches = len(re.findall(pattern, text_lower))
         if matches > max_foreign:
             return False
     # Check for sufficient English words
     english_count = sum(1 for w in words if w.strip(".,!?;:'\"") in ENGLISH_WORDS)
     ratio = english_count / len(words)
     return ratio >= min_ratio
 def normalize_title(title: str) -> str:
     """Normalize a book title for matching."""
     # Remove common prefixes/suffixes
+    title = re.sub(r"^(The|A|An)\s+", "", title, flags=re.IGNORECASE)
+    title = re.sub(r"\s*\([^)]*\)\s*", "", title)  # Remove parentheticals
+    title = re.sub(r"\s*:.+$", "", title)  # Remove subtitles
+    title = re.sub(r"[^\w\s]", "", title)  # Remove punctuation
     return title.lower().strip()
 # -------- SUMMARIZATION: BOOKS + ARXIV ----------
 def download_goodreads_descriptions() -> dict[str, dict]:
     """
     Download Goodreads book descriptions - back-cover style blurbs.
     These are "what the book is about" descriptions, not plot summaries.
     Returns dict mapping normalized title -> {title, description}
     """
     print("\nLoading Goodreads book descriptions...")
     descriptions = {}
     # Try multiple sources
     datasets_to_try = [
         "booksouls/goodreads-book-descriptions",
         "Skelebor/book_titles_and_descriptions_en_clean",
     ]
     for ds_name in datasets_to_try:
         try:
             print(f"    Loading {ds_name}...")
             ds = load_dataset(ds_name, split="train")
             for item in tqdm(ds, desc="Goodreads", leave=False):
                 title = item.get("title", "")
                 description = item.get("description", "")
                 if not title or not description:
                     continue
                 # Skip very short descriptions (not useful for training)
                 if len(description) < 100:
                     continue
                 # Skip very long descriptions (truncate later)
                 if len(description) > 2000:
                     description = description[:2000]
                 # Skip plays and excluded titles
                 if is_excluded_title(title):
                     continue
                 # Skip non-English descriptions
                 if not is_english_text(description):
                     continue
                 norm_title = normalize_title(title)
                 if norm_title and norm_title not in descriptions:
                     descriptions[norm_title] = {
                         "title": title,
                         "description": description,
                     }
             print(f"    Loaded {len(descriptions):,} descriptions from {ds_name}")
         except Exception as e:
             print(f"    {ds_name} failed: {e}")
     print(f"    Total: {len(descriptions):,} unique book descriptions")
     return descriptions
 def download_book_descriptions(
+    goodreads_descriptions: dict[str, dict], max_samples: int = 20000
 ) -> list[dict[str, Any]]:
     """
     Download book description data by matching Gutenberg texts with Goodreads descriptions.
     This gives us (book_excerpt, book_description) training pairs where descriptions
     are back-cover style "what is this book about" blurbs, not plot summaries.
     """
     print("\nMatching Gutenberg books with Goodreads descriptions...")
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     matched_titles = set()
     skipped_quality = 0
     skipped_play = 0
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Matching books", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata_raw = item.get("METADATA", "") or "{}"
         # Parse metadata
         try:
             metadata = json.loads(metadata_raw) if isinstance(metadata_raw, str) else metadata_raw
         except (json.JSONDecodeError, TypeError):
             metadata = {}
         # Get title
         title = metadata.get("title", "") if isinstance(metadata, dict) else ""
         if not title:
             continue
         # Check if we have a Goodreads description for this book
         norm_title = normalize_title(title)
         if norm_title not in goodreads_descriptions:
             continue
         # Skip if already matched this book
         if norm_title in matched_titles:
             continue
         goodreads_data = goodreads_descriptions[norm_title]
         # Skip plays and excluded titles
         if is_excluded_title(title):
             skipped_play += 1
             continue
         if not text or len(text) < 2000:
             continue
         # Get a clean excerpt from the book (skip front matter)
+        paragraphs = re.split(r"\n\s*\n", text)
         excerpt_parts = []
         total_len = 0
         for para in paragraphs[10:]:  # Skip front matter
             para = para.strip()
             if len(para) < 100:
                 continue
             # Quality check on paragraph
             if not is_english_text(para):
                 continue
             if not is_quality_text(para) and len(para) > 300:
                 skipped_quality += 1
                 continue
             excerpt_parts.append(para)
             total_len += len(para)
             if total_len >= 3000:
                 break
         if total_len < 1000:
             continue
         book_excerpt = "\n\n".join(excerpt_parts)[:4000]
         matched_titles.add(norm_title)
+        records.append(
+            {
+                "source": book_excerpt,
+                "summary": goodreads_data["description"][:800],  # Back-cover blurbs are shorter
+                "type": "literary",
+                "title": goodreads_data["title"],
+            }
+        )
     print(f"    Matched {len(records):,} books with descriptions")
     print(f"    Skipped: {skipped_quality} quality, {skipped_play} plays")
     return records
 # Keep BookSum for additional literary training (chapter summaries are still useful)
 def download_booksum(max_samples: int = 20000) -> list[dict[str, Any]]:
     """Download BookSum - literary chapter summarization (English only, quality filtered).
     Note: These are chapter-level plot summaries, useful as supplementary training data.
     The primary book training comes from Goodreads descriptions (back-cover style).
     """
     print("\nLoading BookSum (supplementary literary data)...")
     all_records: list[dict[str, Any]] = []
     booksum = load_dataset("kmfoda/booksum")
     for split_name in booksum.keys():
         split = str(split_name)
         data = booksum[split_name]
         limit = max_samples if "train" in split else max_samples // 10
         indices = random.sample(range(len(data)), min(len(data), limit))
         records = []
         skipped_language = 0
         skipped_excluded = 0
         skipped_play = 0
         for i in tqdm(indices, desc=f"BookSum {split}", leave=False):
             item = data[i]
             chapter = item.get("chapter", "")
             summary = item.get("summary_text") or item.get("summary", "")
             # Extract book title from book_id (e.g., "The Last of the Mohicans.chapters 1-2")
             book_id = item.get("book_id", "")
             book_title = book_id.split(".")[0] if "." in book_id else book_id
             chapter_name = item.get("summary_id", "") or item.get("summary_name", "")
             if not (chapter and summary and len(chapter) > 300):
                 continue
             # Filter: excluded titles (Shakespeare, plays, etc.)
             if is_excluded_title(book_title):
                 skipped_excluded += 1
                 continue
             # Filter: play text format
             if is_play_text(chapter):
                 skipped_play += 1
                 continue
             # Filter: English only
             if not is_english_text(chapter):
                 skipped_language += 1
                 continue
             # Filter: quality text
             if not is_quality_text(chapter):
                 continue
+            records.append(
+                {
+                    "source": chapter[:4000],
+                    "summary": summary,
+                    "type": "literary",
+                    "split": split,
+                    "title": book_title,
+                    "chapter": chapter_name,
+                }
+            )
         all_records.extend(records)
+        print(
+            f"    {split}: {len(records):,} (skipped {skipped_language} non-English, {skipped_excluded} excluded, {skipped_play} plays)"
+        )
     return all_records
 def clean_arxiv_text(text: str) -> str:
     """Clean arXiv LaTeX-style text to make it more readable."""
     import re
     # Remove LaTeX math placeholders
+    text = re.sub(r"@xmath\d+", "", text)
+    text = re.sub(r"@xcite", "", text)
     # Remove excessive whitespace
+    text = re.sub(r"\s+", " ", text)
     # Remove LaTeX commands
+    text = re.sub(r"\\[a-zA-Z]+\{[^}]*\}", "", text)
+    text = re.sub(r"\\[a-zA-Z]+", "", text)
     return text.strip()
     """Extract a meaningful title from the first sentence of an abstract."""
     # Clean the abstract first
     abstract = clean_arxiv_text(abstract)
     # Get the first sentence (up to first period, question mark, or newline)
+    first_sentence = re.split(r"[.!?\n]", abstract)[0].strip()
     # Truncate if too long
     if len(first_sentence) > 100:
         # Try to cut at a natural word boundary
+        first_sentence = first_sentence[:100].rsplit(" ", 1)[0] + "..."
     # Capitalize first letter
     if first_sentence:
         first_sentence = first_sentence[0].upper() + first_sentence[1:]
     return first_sentence or "Untitled Paper"
     """
     Download arXiv papers for academic summarization only (English only).
     Note: This dataset doesn't have categories, so can't be used for topic classification.
     Returns: summarization_records
     """
     print("\nLoading arXiv (academic papers for summarization)...")
     print("  Loading dataset (this may take a minute)...")
     arxiv = load_dataset("ccdv/arxiv-summarization", split="train")
     summ_records: list[dict[str, Any]] = []
     skipped_language = 0
     indices = list(range(len(arxiv)))
     random.shuffle(indices)
     print("  Processing papers...")
+    for i in tqdm(indices[: max_samples * 2], desc="arXiv", leave=False):
         if len(summ_records) >= max_samples:
             break
         item = arxiv[i]
         # Get abstract and article
         abstract = item.get("abstract", "")
         article = item.get("article", "")
         if not abstract or len(abstract) < 100:
             continue
         # Clean LaTeX artifacts
         abstract = clean_arxiv_text(abstract)
         article = clean_arxiv_text(article)
         # Skip if still has too many weird characters after cleaning
+        if "@" in abstract or "@" in article[:500]:
             continue
         # Filter: English only
         if not is_english_text(article[:1000]):
             skipped_language += 1
             continue
         # Summarization: article → abstract
         if article and len(article) > 500:
             # Extract title from abstract
             paper_title = extract_paper_title(abstract)
+            summ_records.append(
+                {
+                    "source": article[:4000],
+                    "summary": abstract,
+                    "type": "academic",
+                    "title": paper_title,
+                }
+            )
     print(f"    Summarization: {len(summ_records):,} (skipped {skipped_language} non-English)")
     return summ_records
 def download_topics_from_datasets(max_samples: int = 50000) -> list[dict[str, Any]]:
     """
     Download topic classification data from multiple sources with real categories.
     Sources:
     - 20 Newsgroups (classic topic classification)
     - Wikipedia (article categories)
     """
     print("\nLoading topic classification datasets...")
     records: list[dict[str, Any]] = []
     # 20 Newsgroups - classic topic dataset
     print("  Loading 20 Newsgroups...")
     try:
         newsgroups = load_dataset("SetFit/20_newsgroups", split="train")
         # Map 20 newsgroups categories to our 8 topics
         newsgroup_map = {
             # Science
+            "sci.crypt": "Science",
+            "sci.electronics": "Science",
+            "sci.med": "Science",
+            "sci.space": "Science",
+            # Technology
+            "comp.graphics": "Technology",
+            "comp.os.ms-windows.misc": "Technology",
+            "comp.sys.ibm.pc.hardware": "Technology",
+            "comp.sys.mac.hardware": "Technology",
             "comp.windows.x": "Technology",
             # Philosophy/Religion
+            "alt.atheism": "Philosophy",
+            "soc.religion.christian": "Philosophy",
             "talk.religion.misc": "Philosophy",
             # History/Politics
+            "talk.politics.guns": "History",
+            "talk.politics.mideast": "History",
             "talk.politics.misc": "History",
             # Business
             "misc.forsale": "Business",
             # Sports/Recreation
+            "rec.autos": "Arts",
+            "rec.motorcycles": "Arts",
+            "rec.sport.baseball": "Arts",
+            "rec.sport.hockey": "Arts",
         }
         for item in tqdm(newsgroups, desc="20 Newsgroups", leave=False):
             if len(records) >= max_samples:
                 break
             label_name = item.get("label_text", "")
             text = item.get("text", "")
             if label_name in newsgroup_map and text and len(text) > 100:
+                records.append(
+                    {
+                        "text": text[:1500],
+                        "topic": newsgroup_map[label_name],
+                        "source": "newsgroups",
+                    }
+                )
         print(f"    20 Newsgroups: {len(records):,}")
     except Exception as e:
         print(f"    20 Newsgroups failed: {e}")
     # Add from Gutenberg for Fiction
     gutenberg_topics = download_gutenberg_topics(max_samples // 4)
     records.extend(gutenberg_topics)
     # Add from scientific papers abstract dataset for more Science/Tech
     print("  Loading scientific papers...")
     try:
         sci_papers = load_dataset("scientific_papers", "arxiv", split="train", streaming=True)
         sci_count = 0
+        for item in tqdm(sci_papers, desc="Scientific papers", leave=False, total=max_samples // 4):
             if sci_count >= max_samples // 4:
                 break
             abstract = item.get("abstract", "")
             if abstract and len(abstract) > 100:
                 # Alternate between Science and Technology
                 topic = "Science" if sci_count % 2 == 0 else "Technology"
+                records.append(
+                    {
+                        "text": abstract[:1500],
+                        "topic": topic,
+                        "source": "scientific_papers",
+                    }
+                )
                 sci_count += 1
         print(f"    Scientific papers: {sci_count:,}")
     except Exception as e:
         print(f"    Scientific papers failed: {e}")
     return records
 def download_summarization(max_books: int = 20000, max_arxiv: int = 50000) -> None:
     """Download all summarization data (books + arxiv, NO news).
     Book data now uses Goodreads descriptions (back-cover blurbs) instead of
     plot summaries. This trains the model to describe "what the book is about"
     rather than summarizing the plot.
     """
     print("\nDownloading Summarization Data...")
     out_dir = OUTPUT_DIR / "summarization"
     all_records: list[dict[str, Any]] = []
     # Goodreads descriptions - primary book training data (back-cover style)
     goodreads_descriptions = download_goodreads_descriptions()
     book_records = download_book_descriptions(goodreads_descriptions, max_books)
     all_records.extend(book_records)
     # Optional: Add some BookSum for additional literary variety
     # These are chapter summaries, not back-cover style, so keep limited
     # booksum_records = download_booksum(max_books // 4)
     # all_records.extend(booksum_records)
     # arXiv - academic (abstracts are already "what is this paper about")
     arxiv_summ = download_arxiv_summarization(max_arxiv)
     all_records.extend(arxiv_summ)
     # Shuffle and split
     random.shuffle(all_records)
     # Split by original split if available, else 90/5/5
+    train_records = [
+        r for r in all_records if r.get("split", "train") == "train" or "split" not in r
+    ]
     val_records = [r for r in all_records if r.get("split") == "validation"]
     test_records = [r for r in all_records if r.get("split") == "test"]
     # If no split info, do 90/5/5
     if len(val_records) < 100:
         n = len(train_records)
         random.shuffle(train_records)
+        val_records = train_records[int(n * 0.9) : int(n * 0.95)]
+        test_records = train_records[int(n * 0.95) :]
+        train_records = train_records[: int(n * 0.9)]
     # Remove split key before saving
     for r in train_records + val_records + test_records:
         r.pop("split", None)
     write_jsonl(train_records, out_dir / "train.jsonl", "train")
     write_jsonl(val_records, out_dir / "validation.jsonl", "val")
     write_jsonl(test_records, out_dir / "test.jsonl", "test")
     # Print breakdown
+    literary_count = sum(
+        1 for r in train_records + val_records + test_records if r.get("type") == "literary"
+    )
+    academic_count = sum(
+        1 for r in train_records + val_records + test_records if r.get("type") == "academic"
+    )
     print(f"\n  Total summarization: {len(train_records) + len(val_records) + len(test_records):,}")
     print(f"    Literary (book descriptions): {literary_count:,}")
     print(f"    Academic (paper abstracts): {academic_count:,}")
 # ------------ TOPIC CLASSIFICATION ------------
 def download_topics(max_samples: int = 50000) -> None:
     """
     Download topic classification data from multiple sources.
     Sources:
     - 20 Newsgroups (classic topic dataset)
     - Gutenberg books (Fiction)
     """
     print("\nDownloading Topic Classification...")
     out_dir = OUTPUT_DIR / "topic"
     # Get topic records from various sources
     all_records = download_topics_from_datasets(max_samples)
     # Balance topics
     topic_counts: dict[str, list] = {t: [] for t in TOPIC_LABELS}
     for r in all_records:
         topic = r.get("topic")
         if topic in topic_counts:
             topic_counts[topic].append(r)
     # Print distribution before balancing
     print("\n  Topic distribution (before balancing):")
     for topic, records in topic_counts.items():
         print(f"    {topic}: {len(records):,}")
     # Balance to min count (with some tolerance) - only from topics that have data
     counts_with_data = [len(v) for v in topic_counts.values() if v]
     if not counts_with_data:
         print("  Warning: No topic data found!")
         return
     min_count = min(counts_with_data)
     target_count = min(min_count, max_samples // len(TOPIC_LABELS))
     balanced: list[dict[str, Any]] = []
     for _topic, records in topic_counts.items():
         if records:
             random.shuffle(records)
             balanced.extend(records[:target_count])
     random.shuffle(balanced)
     # Split 90/5/5
     n = len(balanced)
+    train_records = balanced[: int(n * 0.9)]
+    val_records = balanced[int(n * 0.9) : int(n * 0.95)]
+    test_records = balanced[int(n * 0.95) :]
     write_jsonl(train_records, out_dir / "train.jsonl", "train")
     write_jsonl(val_records, out_dir / "validation.jsonl", "val")
     write_jsonl(test_records, out_dir / "test.jsonl", "test")
     # Save labels - only labels that have data
     used_labels = [t for t in TOPIC_LABELS if topic_counts.get(t)]
     (out_dir / "labels.json").write_text(json.dumps(used_labels, indent=2))
 def download_gutenberg_topics(max_samples: int = 30000) -> list[dict[str, Any]]:
     """Extract topic-labeled samples from Gutenberg books (English only)."""
     print("\nLoading Gutenberg for topic classification...")
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         print("  Trying pg19...")
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     skipped_language = 0
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Gutenberg topics", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata = item.get("METADATA", {}) or {}
         if not text or len(text) < 1000:
             continue
         # Try to determine topic from metadata
         subjects = ""
         if isinstance(metadata, dict):
             subjects = str(metadata.get("subjects", "")).lower()
             subjects += " " + str(metadata.get("subject", "")).lower()
             subjects += " " + str(metadata.get("category", "")).lower()
         topic = None
         for keyword, mapped_topic in GUTENBERG_SUBJECT_MAP.items():
             if keyword in subjects:
                 topic = mapped_topic
                 break
         # Default fiction for novels without clear subject
         if not topic and ("novel" in subjects or not subjects.strip()):
             topic = "Fiction"
         if topic:
             # Get a clean paragraph as sample
+            paragraphs = re.split(r"\n\s*\n", text)
             for para in paragraphs[5:]:  # Skip front matter
                 para = para.strip()
+                if 200 < len(para) < 1500 and para.count(".") >= 2:
                     # Filter: English only
                     if not is_english_text(para):
                         skipped_language += 1
                         break
+                    records.append(
+                        {
+                            "text": para,
+                            "topic": topic,
+                            "source": "gutenberg",
+                        }
+                    )
                     break
     print(f"    Gutenberg topics: {len(records):,} (skipped {skipped_language} non-English)")
     return records
 # ------------ EMOTIONS (unchanged) -------------
 def download_emotions() -> None:
     """Download GoEmotions for emotion classification."""
     print("\nDownloading Emotions (GoEmotions)...")
     out_dir = OUTPUT_DIR / "emotion"
     ds = load_dataset("google-research-datasets/go_emotions", "simplified")
     for split_name in ds.keys():
         split = str(split_name)
         data = ds[split_name]
         records: list[dict[str, Any]] = []
         for item in tqdm(data, desc=split, leave=False):
             text = item.get("text", "")
                 if emotions:
                     records.append({"text": text, "emotions": emotions})
         write_jsonl(records, out_dir / f"{split}.jsonl", split)
     (out_dir / "labels.json").write_text(json.dumps(EMOTION_LABELS, indent=2))
     print(f"  {len(EMOTION_LABELS)} emotion labels saved")
 # --------------- GUTENBERG BOOKS (for language modeling) ---------------
 GUTENBERG_JUNK_PATTERNS = [
+    r"Project Gutenberg",
+    r"www\.gutenberg\.org",
+    r"This ebook is for",
+    r"Gutenberg License",
+    r"^\*\*\* START OF",
+    r"^\*\*\* END OF",
+    r"Produced by",
+    r"Transcriber's Note",
+    r"TABLE OF CONTENTS",
+    r"^\s*CHAPTER\s+[IVXLC\d]+",
+    r"^\s*Chapter\s+[IVXLC\d]+",
+    r"^\s*BOOK\s+[IVXLC\d]+",
+    r"^\s*PREFACE\s*$",
+    r"^\s*INTRODUCTION\s*$",
+    r"E-text prepared by",
+    r"Internet Archive",
+    r"Distributed Proofreaders",
 ]
 GUTENBERG_JUNK_REGEX = re.compile("|".join(GUTENBERG_JUNK_PATTERNS), re.IGNORECASE)
         return False
     if GUTENBERG_JUNK_REGEX.search(text):
         return False
+    if text.count(".") < 2:
         return False
     uppercase_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
     if uppercase_ratio > 0.3:
     print("\nDownloading Gutenberg Books (English only)...")
     out_dir = OUTPUT_DIR / "books"
     out_dir.mkdir(parents=True, exist_ok=True)
     try:
         gutenberg = load_dataset("sedthh/gutenberg_english", split="train")
     except Exception:
         gutenberg = load_dataset("pg19", split="train")
     records: list[dict[str, Any]] = []
     indices = list(range(len(gutenberg)))
     random.shuffle(indices)
     for i in tqdm(indices, desc="Books", leave=False):
         if len(records) >= max_samples:
             break
         item = gutenberg[i]
         text = item.get("TEXT", "") or item.get("text", "")
         metadata_raw = item.get("METADATA", "") or "{}"
         # Parse metadata - it's stored as JSON string
         try:
             metadata = json.loads(metadata_raw) if isinstance(metadata_raw, str) else metadata_raw
         except (json.JSONDecodeError, TypeError):
             metadata = {}
         # Extract title and author
         title = metadata.get("title", "") if isinstance(metadata, dict) else ""
         author = metadata.get("author", "") if isinstance(metadata, dict) else ""
         if not title:
             title = item.get("title", f"Unknown Book #{i}")
         if not text or len(text) < 1000:
             continue
+        paragraphs = re.split(r"\n\s*\n", text)
         for para in paragraphs:
             para = para.strip()
             if is_clean_prose(para):
+                records.append(
+                    {"text": para, "title": title, "author": author, "type": "gutenberg"}
+                )
                 if len(records) >= max_samples:
                     break
     random.shuffle(records)
     n = len(records)
+    write_jsonl(records[: int(n * 0.9)], out_dir / "train.jsonl", "train")
+    write_jsonl(records[int(n * 0.9) : int(n * 0.95)], out_dir / "validation.jsonl", "val")
+    write_jsonl(records[int(n * 0.95) :], out_dir / "test.jsonl", "test")
 # ------------ MAIN ------------
 def main() -> None:
     parser = argparse.ArgumentParser(description="Download LexiMind datasets")
     parser.add_argument(
         "--task",
         choices=["all", "summarization", "emotion", "topic", "gutenberg"],
         default="all",
+        help="Dataset to download",
     )
     parser.add_argument("--max-books", type=int, default=40000, help="Max BookSum samples")
     parser.add_argument("--max-arxiv", type=int, default=50000, help="Max arXiv samples")
     parser.add_argument("--max-topics", type=int, default=50000, help="Max topic samples")
     parser.add_argument("--seed", type=int, default=42, help="Random seed")
     args = parser.parse_args()
     random.seed(args.seed)
     print("=" * 60)
     print("LexiMind Dataset Download")
     print("Books + Academic Papers + Topic Classification")
     print("=" * 60)
     if args.task in ["all", "summarization"]:
         download_summarization(args.max_books, args.max_arxiv)
     if args.task in ["all", "emotion"]:
         download_topics(args.max_topics)
     if args.task in ["all", "gutenberg"]:
         download_gutenberg(args.max_gutenberg)
     print("\n" + "=" * 60)
     print("Download complete!")
     print("=" * 60)

scripts/evaluate.py CHANGED Viewed

@@ -65,34 +65,34 @@ def evaluate_summarization(
     print("\n" + "=" * 60)
     print("SUMMARIZATION EVALUATION")
     print("=" * 60)
     # Load data - try to get domain info from the raw JSONL
     raw_data = []
     with open(data_path) as f:
         for line in f:
             if line.strip():
                 raw_data.append(json.loads(line))
     data = load_summarization_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
         raw_data = raw_data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Generate summaries
     predictions = []
     references = []
     domains = []  # Track domain for per-domain breakdown
     for i in tqdm(range(0, len(data), batch_size), desc="Generating summaries"):
-        batch = data[i:i + batch_size]
         sources = [ex.source for ex in batch]
         refs = [ex.summary for ex in batch]
         preds = pipeline.summarize(sources)
         predictions.extend(preds)
         references.extend(refs)
         # Track domain if available
         for j in range(len(batch)):
             idx = i + j
@@ -101,14 +101,14 @@ def evaluate_summarization(
                 domains.append(domain)
             else:
                 domains.append("unknown")
     # Calculate overall metrics
     print("\nCalculating ROUGE scores...")
     rouge_scores = calculate_rouge(predictions, references)
     print("Calculating BLEU score...")
     bleu = calculate_bleu(predictions, references)
     metrics: dict = {
         "rouge1": rouge_scores["rouge1"],
         "rouge2": rouge_scores["rouge2"],
@@ -116,14 +116,14 @@ def evaluate_summarization(
         "bleu4": bleu,
         "num_samples": len(predictions),
     }
     if include_bertscore:
         print("Calculating BERTScore (this may take a few minutes)...")
         bert_scores = calculate_bertscore(predictions, references)
         metrics["bertscore_precision"] = bert_scores["precision"]
         metrics["bertscore_recall"] = bert_scores["recall"]
         metrics["bertscore_f1"] = bert_scores["f1"]
     # Per-domain breakdown
     unique_domains = sorted(set(domains))
     if len(unique_domains) > 1:
@@ -150,25 +150,26 @@ def evaluate_summarization(
                 dm["bertscore_f1"] = d_bert["f1"]
             domain_metrics[domain] = dm
         metrics["per_domain"] = domain_metrics
     # Bootstrap confidence intervals
     if compute_bootstrap:
         try:
             from rouge_score import rouge_scorer
-            scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
             per_sample_r1 = []
             per_sample_rL = []
             for pred, ref in zip(predictions, references, strict=True):
                 scores = scorer.score(ref, pred)
-                per_sample_r1.append(scores['rouge1'].fmeasure)
-                per_sample_rL.append(scores['rougeL'].fmeasure)
             r1_mean, r1_lo, r1_hi = bootstrap_confidence_interval(per_sample_r1)
             rL_mean, rL_lo, rL_hi = bootstrap_confidence_interval(per_sample_rL)
             metrics["rouge1_ci"] = {"mean": r1_mean, "lower": r1_lo, "upper": r1_hi}
             metrics["rougeL_ci"] = {"mean": rL_mean, "lower": rL_lo, "upper": rL_hi}
         except ImportError:
             pass
     # Print results
     print("\n" + "-" * 40)
     print("SUMMARIZATION RESULTS:")
@@ -181,27 +182,29 @@ def evaluate_summarization(
         print(f"  BERTScore P: {metrics['bertscore_precision']:.4f}")
         print(f"  BERTScore R: {metrics['bertscore_recall']:.4f}")
         print(f"  BERTScore F: {metrics['bertscore_f1']:.4f}")
     if "per_domain" in metrics:
         print("\n  Per-Domain Breakdown:")
         for domain, dm in metrics["per_domain"].items():
             bs_str = f", BS-F1={dm['bertscore_f1']:.4f}" if "bertscore_f1" in dm else ""
-            print(f"    {domain} (n={dm['num_samples']}): R1={dm['rouge1']:.4f}, RL={dm['rougeL']:.4f}, B4={dm['bleu4']:.4f}{bs_str}")
     if "rouge1_ci" in metrics:
         ci = metrics["rouge1_ci"]
         print(f"\n  ROUGE-1 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Show examples
     print("\n" + "-" * 40)
     print("SAMPLE OUTPUTS:")
     print("-" * 40)
     for i in range(min(3, len(predictions))):
-        print(f"\nExample {i+1}:")
         print(f"  Source:    {data[i].source[:100]}...")
         print(f"  Generated: {predictions[i][:150]}...")
         print(f"  Reference: {references[i][:150]}...")
     return metrics
@@ -214,62 +217,64 @@ def evaluate_emotion(
     compute_bootstrap: bool = False,
 ) -> dict:
     """Evaluate emotion detection with comprehensive multi-label metrics.
     Reports sample-averaged F1, macro F1, micro F1, and per-class breakdown.
     Optionally tunes per-class thresholds on the evaluation set.
     """
     print("\n" + "=" * 60)
     print("EMOTION DETECTION EVALUATION")
     print("=" * 60)
     # Load data (returns EmotionExample dataclass objects)
     data = load_emotion_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Get predictions - collect raw logits for threshold tuning
     all_preds = []
     all_refs = []
     all_logits_list = []
     for i in tqdm(range(0, len(data), batch_size), desc="Predicting emotions"):
-        batch = data[i:i + batch_size]
         texts = [ex.text for ex in batch]
         refs = [set(ex.emotions) for ex in batch]
         preds = pipeline.predict_emotions(texts)
         pred_sets = [set(p.labels) for p in preds]
         all_preds.extend(pred_sets)
         all_refs.extend(refs)
         # Also get raw logits for threshold tuning
         if tune_thresholds:
             encoded = pipeline.tokenizer.batch_encode(texts)
             input_ids = encoded["input_ids"].to(pipeline.device)
             attention_mask = encoded["attention_mask"].to(pipeline.device)
             with torch.inference_mode():
-                logits = pipeline.model.forward("emotion", {"input_ids": input_ids, "attention_mask": attention_mask})
                 all_logits_list.append(logits.cpu())
     # Calculate metrics
     all_emotions = sorted(pipeline.emotion_labels)
     def to_binary(emotion_sets, labels):
         return [[1 if e in es else 0 for e in labels] for es in emotion_sets]
     pred_binary = torch.tensor(to_binary(all_preds, all_emotions))
     ref_binary = torch.tensor(to_binary(all_refs, all_emotions))
     # Core metrics: sample-avg F1, macro F1, micro F1
     sample_f1 = multilabel_f1(pred_binary, ref_binary)
     macro_f1 = multilabel_macro_f1(pred_binary, ref_binary)
     micro_f1 = multilabel_micro_f1(pred_binary, ref_binary)
     # Per-class metrics
     per_class = multilabel_per_class_metrics(pred_binary, ref_binary, class_names=all_emotions)
     metrics: dict = {
         "sample_avg_f1": sample_f1,
         "macro_f1": macro_f1,
@@ -278,7 +283,7 @@ def evaluate_emotion(
         "num_classes": len(all_emotions),
         "per_class": per_class,
     }
     # Per-class threshold tuning
     if tune_thresholds and all_logits_list:
         print("\nTuning per-class thresholds...")
@@ -288,7 +293,7 @@ def evaluate_emotion(
             name: thresh for name, thresh in zip(all_emotions, best_thresholds, strict=True)
         }
         metrics["tuned_macro_f1"] = tuned_macro_f1
         # Also compute tuned sample-avg F1
         probs = torch.sigmoid(all_logits)
         tuned_preds = torch.zeros_like(probs)
@@ -296,7 +301,7 @@ def evaluate_emotion(
             tuned_preds[:, c] = (probs[:, c] >= t).float()
         metrics["tuned_sample_avg_f1"] = multilabel_f1(tuned_preds, ref_binary)
         metrics["tuned_micro_f1"] = multilabel_micro_f1(tuned_preds, ref_binary)
     # Bootstrap confidence intervals
     if compute_bootstrap:
         # Compute per-sample F1 for bootstrapping
@@ -313,7 +318,7 @@ def evaluate_emotion(
                 per_sample_f1s.append(2 * p * r / (p + r) if (p + r) > 0 else 0.0)
         mean, lo, hi = bootstrap_confidence_interval(per_sample_f1s)
         metrics["sample_f1_ci"] = {"mean": mean, "lower": lo, "upper": hi}
     # Print results
     print("\n" + "-" * 40)
     print("EMOTION DETECTION RESULTS:")
@@ -322,23 +327,25 @@ def evaluate_emotion(
     print(f"  Macro F1:      {metrics['macro_f1']:.4f}")
     print(f"  Micro F1:      {metrics['micro_f1']:.4f}")
     print(f"  Num Classes:   {metrics['num_classes']}")
     if "tuned_macro_f1" in metrics:
         print("\n  After per-class threshold tuning:")
         print(f"    Tuned Macro F1:      {metrics['tuned_macro_f1']:.4f}")
         print(f"    Tuned Sample-avg F1: {metrics['tuned_sample_avg_f1']:.4f}")
         print(f"    Tuned Micro F1:      {metrics['tuned_micro_f1']:.4f}")
     if "sample_f1_ci" in metrics:
         ci = metrics["sample_f1_ci"]
         print(f"\n  Sample F1 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Print top-10 per-class performance
     print("\n  Per-class F1 (top 10 by support):")
     sorted_classes = sorted(per_class.items(), key=lambda x: x[1]["support"], reverse=True)
     for name, m in sorted_classes[:10]:
-        print(f"    {name:20s}: P={m['precision']:.3f} R={m['recall']:.3f} F1={m['f1']:.3f} (n={m['support']})")
     return metrics
@@ -353,61 +360,63 @@ def evaluate_topic(
     print("\n" + "=" * 60)
     print("TOPIC CLASSIFICATION EVALUATION")
     print("=" * 60)
     # Load data (returns TopicExample dataclass objects)
     data = load_topic_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Get predictions
     all_preds = []
     all_refs = []
     for i in tqdm(range(0, len(data), batch_size), desc="Predicting topics"):
-        batch = data[i:i + batch_size]
         texts = [ex.text for ex in batch]
         refs = [ex.topic for ex in batch]
         preds = pipeline.predict_topics(texts)
         pred_labels = [p.label for p in preds]
         all_preds.extend(pred_labels)
         all_refs.extend(refs)
     # Calculate metrics
     accuracy = accuracy_score(all_refs, all_preds)
     macro_f1 = f1_score(all_refs, all_preds, average="macro", zero_division=0)
     metrics: dict = {
         "accuracy": accuracy,
         "macro_f1": macro_f1,
         "num_samples": len(all_preds),
     }
     # Bootstrap confidence intervals for accuracy
     if compute_bootstrap:
-        per_sample_correct = [1.0 if p == r else 0.0 for p, r in zip(all_preds, all_refs, strict=True)]
         mean, lo, hi = bootstrap_confidence_interval(per_sample_correct)
         metrics["accuracy_ci"] = {"mean": mean, "lower": lo, "upper": hi}
     # Print results
     print("\n" + "-" * 40)
     print("TOPIC CLASSIFICATION RESULTS:")
     print("-" * 40)
-    print(f"  Accuracy:  {metrics['accuracy']:.4f} ({metrics['accuracy']*100:.1f}%)")
     print(f"  Macro F1:  {metrics['macro_f1']:.4f}")
     if "accuracy_ci" in metrics:
         ci = metrics["accuracy_ci"]
         print(f"  Accuracy 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Classification report
     print("\n" + "-" * 40)
     print("PER-CLASS METRICS:")
     print("-" * 40)
     print(classification_report(all_refs, all_preds, zero_division=0))
     return metrics
@@ -418,20 +427,28 @@ def main():
     parser.add_argument("--data-dir", type=Path, default=Path("data/processed"))
     parser.add_argument("--output", type=Path, default=Path("outputs/evaluation_report.json"))
     parser.add_argument("--max-samples", type=int, default=None, help="Limit samples per task")
-    parser.add_argument("--include-bertscore", action="store_true", help="Include BERTScore (slow, optional)")
-    parser.add_argument("--tune-thresholds", action="store_true", help="Tune per-class emotion thresholds on val set")
-    parser.add_argument("--bootstrap", action="store_true", help="Compute bootstrap confidence intervals")
     parser.add_argument("--summarization-only", action="store_true")
     parser.add_argument("--emotion-only", action="store_true")
     parser.add_argument("--topic-only", action="store_true")
     args = parser.parse_args()
     print("=" * 60)
     print("LexiMind Evaluation")
     print("=" * 60)
     start_time = time.perf_counter()
     # Load model
     print(f"\nLoading model from {args.checkpoint}...")
     device = "cuda" if torch.cuda.is_available() else "cpu"
@@ -443,12 +460,12 @@ def main():
     print(f"  Device: {device}")
     print(f"  Topics: {labels.topic}")
     print(f"  Emotions: {len(labels.emotion)} classes")
     results = {}
     # Determine which tasks to evaluate
     eval_all = not (args.summarization_only or args.emotion_only or args.topic_only)
     # Evaluate summarization
     if eval_all or args.summarization_only:
         val_path = args.data_dir / "summarization" / "validation.jsonl"
@@ -456,14 +473,15 @@ def main():
             val_path = args.data_dir / "summarization" / "val.jsonl"
         if val_path.exists():
             results["summarization"] = evaluate_summarization(
-                pipeline, val_path,
                 max_samples=args.max_samples,
                 include_bertscore=args.include_bertscore,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: summarization validation data not found, skipping")
     # Evaluate emotion
     if eval_all or args.emotion_only:
         val_path = args.data_dir / "emotion" / "validation.jsonl"
@@ -471,14 +489,15 @@ def main():
             val_path = args.data_dir / "emotion" / "val.jsonl"
         if val_path.exists():
             results["emotion"] = evaluate_emotion(
-                pipeline, val_path,
                 max_samples=args.max_samples,
                 tune_thresholds=args.tune_thresholds,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: emotion validation data not found, skipping")
     # Evaluate topic
     if eval_all or args.topic_only:
         val_path = args.data_dir / "topic" / "validation.jsonl"
@@ -486,30 +505,31 @@ def main():
             val_path = args.data_dir / "topic" / "val.jsonl"
         if val_path.exists():
             results["topic"] = evaluate_topic(
-                pipeline, val_path,
                 max_samples=args.max_samples,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: topic validation data not found, skipping")
     # Save results
     print("\n" + "=" * 60)
     print("SAVING RESULTS")
     print("=" * 60)
     args.output.parent.mkdir(parents=True, exist_ok=True)
     with open(args.output, "w") as f:
         json.dump(results, f, indent=2)
     print(f"  Saved to: {args.output}")
     # Final summary
     elapsed = time.perf_counter() - start_time
     print("\n" + "=" * 60)
     print("EVALUATION COMPLETE")
     print("=" * 60)
-    print(f"  Time: {elapsed/60:.1f} minutes")
     if "summarization" in results:
         s = results["summarization"]
         print("\n  Summarization:")
@@ -519,14 +539,14 @@ def main():
         print(f"    BLEU-4:  {s['bleu4']:.4f}")
         if "bertscore_f1" in s:
             print(f"    BERTScore F1: {s['bertscore_f1']:.4f}")
     if "emotion" in results:
         e = results["emotion"]
         print("\n  Emotion:")
         print(f"    Sample-avg F1: {e['sample_avg_f1']:.4f}")
         print(f"    Macro F1:      {e['macro_f1']:.4f}")
         print(f"    Micro F1:      {e['micro_f1']:.4f}")
     if "topic" in results:
         print("\n  Topic:")
         print(f"    Accuracy: {results['topic']['accuracy']:.2%}")

     print("\n" + "=" * 60)
     print("SUMMARIZATION EVALUATION")
     print("=" * 60)
     # Load data - try to get domain info from the raw JSONL
     raw_data = []
     with open(data_path) as f:
         for line in f:
             if line.strip():
                 raw_data.append(json.loads(line))
     data = load_summarization_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
         raw_data = raw_data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Generate summaries
     predictions = []
     references = []
     domains = []  # Track domain for per-domain breakdown
     for i in tqdm(range(0, len(data), batch_size), desc="Generating summaries"):
+        batch = data[i : i + batch_size]
         sources = [ex.source for ex in batch]
         refs = [ex.summary for ex in batch]
         preds = pipeline.summarize(sources)
         predictions.extend(preds)
         references.extend(refs)
         # Track domain if available
         for j in range(len(batch)):
             idx = i + j
                 domains.append(domain)
             else:
                 domains.append("unknown")
     # Calculate overall metrics
     print("\nCalculating ROUGE scores...")
     rouge_scores = calculate_rouge(predictions, references)
     print("Calculating BLEU score...")
     bleu = calculate_bleu(predictions, references)
     metrics: dict = {
         "rouge1": rouge_scores["rouge1"],
         "rouge2": rouge_scores["rouge2"],
         "bleu4": bleu,
         "num_samples": len(predictions),
     }
     if include_bertscore:
         print("Calculating BERTScore (this may take a few minutes)...")
         bert_scores = calculate_bertscore(predictions, references)
         metrics["bertscore_precision"] = bert_scores["precision"]
         metrics["bertscore_recall"] = bert_scores["recall"]
         metrics["bertscore_f1"] = bert_scores["f1"]
     # Per-domain breakdown
     unique_domains = sorted(set(domains))
     if len(unique_domains) > 1:
                 dm["bertscore_f1"] = d_bert["f1"]
             domain_metrics[domain] = dm
         metrics["per_domain"] = domain_metrics
     # Bootstrap confidence intervals
     if compute_bootstrap:
         try:
             from rouge_score import rouge_scorer
+            scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
             per_sample_r1 = []
             per_sample_rL = []
             for pred, ref in zip(predictions, references, strict=True):
                 scores = scorer.score(ref, pred)
+                per_sample_r1.append(scores["rouge1"].fmeasure)
+                per_sample_rL.append(scores["rougeL"].fmeasure)
             r1_mean, r1_lo, r1_hi = bootstrap_confidence_interval(per_sample_r1)
             rL_mean, rL_lo, rL_hi = bootstrap_confidence_interval(per_sample_rL)
             metrics["rouge1_ci"] = {"mean": r1_mean, "lower": r1_lo, "upper": r1_hi}
             metrics["rougeL_ci"] = {"mean": rL_mean, "lower": rL_lo, "upper": rL_hi}
         except ImportError:
             pass
     # Print results
     print("\n" + "-" * 40)
     print("SUMMARIZATION RESULTS:")
         print(f"  BERTScore P: {metrics['bertscore_precision']:.4f}")
         print(f"  BERTScore R: {metrics['bertscore_recall']:.4f}")
         print(f"  BERTScore F: {metrics['bertscore_f1']:.4f}")
     if "per_domain" in metrics:
         print("\n  Per-Domain Breakdown:")
         for domain, dm in metrics["per_domain"].items():
             bs_str = f", BS-F1={dm['bertscore_f1']:.4f}" if "bertscore_f1" in dm else ""
+            print(
+                f"    {domain} (n={dm['num_samples']}): R1={dm['rouge1']:.4f}, RL={dm['rougeL']:.4f}, B4={dm['bleu4']:.4f}{bs_str}"
+            )
     if "rouge1_ci" in metrics:
         ci = metrics["rouge1_ci"]
         print(f"\n  ROUGE-1 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Show examples
     print("\n" + "-" * 40)
     print("SAMPLE OUTPUTS:")
     print("-" * 40)
     for i in range(min(3, len(predictions))):
+        print(f"\nExample {i + 1}:")
         print(f"  Source:    {data[i].source[:100]}...")
         print(f"  Generated: {predictions[i][:150]}...")
         print(f"  Reference: {references[i][:150]}...")
     return metrics
     compute_bootstrap: bool = False,
 ) -> dict:
     """Evaluate emotion detection with comprehensive multi-label metrics.
     Reports sample-averaged F1, macro F1, micro F1, and per-class breakdown.
     Optionally tunes per-class thresholds on the evaluation set.
     """
     print("\n" + "=" * 60)
     print("EMOTION DETECTION EVALUATION")
     print("=" * 60)
     # Load data (returns EmotionExample dataclass objects)
     data = load_emotion_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Get predictions - collect raw logits for threshold tuning
     all_preds = []
     all_refs = []
     all_logits_list = []
     for i in tqdm(range(0, len(data), batch_size), desc="Predicting emotions"):
+        batch = data[i : i + batch_size]
         texts = [ex.text for ex in batch]
         refs = [set(ex.emotions) for ex in batch]
         preds = pipeline.predict_emotions(texts)
         pred_sets = [set(p.labels) for p in preds]
         all_preds.extend(pred_sets)
         all_refs.extend(refs)
         # Also get raw logits for threshold tuning
         if tune_thresholds:
             encoded = pipeline.tokenizer.batch_encode(texts)
             input_ids = encoded["input_ids"].to(pipeline.device)
             attention_mask = encoded["attention_mask"].to(pipeline.device)
             with torch.inference_mode():
+                logits = pipeline.model.forward(
+                    "emotion", {"input_ids": input_ids, "attention_mask": attention_mask}
+                )
                 all_logits_list.append(logits.cpu())
     # Calculate metrics
     all_emotions = sorted(pipeline.emotion_labels)
     def to_binary(emotion_sets, labels):
         return [[1 if e in es else 0 for e in labels] for es in emotion_sets]
     pred_binary = torch.tensor(to_binary(all_preds, all_emotions))
     ref_binary = torch.tensor(to_binary(all_refs, all_emotions))
     # Core metrics: sample-avg F1, macro F1, micro F1
     sample_f1 = multilabel_f1(pred_binary, ref_binary)
     macro_f1 = multilabel_macro_f1(pred_binary, ref_binary)
     micro_f1 = multilabel_micro_f1(pred_binary, ref_binary)
     # Per-class metrics
     per_class = multilabel_per_class_metrics(pred_binary, ref_binary, class_names=all_emotions)
     metrics: dict = {
         "sample_avg_f1": sample_f1,
         "macro_f1": macro_f1,
         "num_classes": len(all_emotions),
         "per_class": per_class,
     }
     # Per-class threshold tuning
     if tune_thresholds and all_logits_list:
         print("\nTuning per-class thresholds...")
             name: thresh for name, thresh in zip(all_emotions, best_thresholds, strict=True)
         }
         metrics["tuned_macro_f1"] = tuned_macro_f1
         # Also compute tuned sample-avg F1
         probs = torch.sigmoid(all_logits)
         tuned_preds = torch.zeros_like(probs)
             tuned_preds[:, c] = (probs[:, c] >= t).float()
         metrics["tuned_sample_avg_f1"] = multilabel_f1(tuned_preds, ref_binary)
         metrics["tuned_micro_f1"] = multilabel_micro_f1(tuned_preds, ref_binary)
     # Bootstrap confidence intervals
     if compute_bootstrap:
         # Compute per-sample F1 for bootstrapping
                 per_sample_f1s.append(2 * p * r / (p + r) if (p + r) > 0 else 0.0)
         mean, lo, hi = bootstrap_confidence_interval(per_sample_f1s)
         metrics["sample_f1_ci"] = {"mean": mean, "lower": lo, "upper": hi}
     # Print results
     print("\n" + "-" * 40)
     print("EMOTION DETECTION RESULTS:")
     print(f"  Macro F1:      {metrics['macro_f1']:.4f}")
     print(f"  Micro F1:      {metrics['micro_f1']:.4f}")
     print(f"  Num Classes:   {metrics['num_classes']}")
     if "tuned_macro_f1" in metrics:
         print("\n  After per-class threshold tuning:")
         print(f"    Tuned Macro F1:      {metrics['tuned_macro_f1']:.4f}")
         print(f"    Tuned Sample-avg F1: {metrics['tuned_sample_avg_f1']:.4f}")
         print(f"    Tuned Micro F1:      {metrics['tuned_micro_f1']:.4f}")
     if "sample_f1_ci" in metrics:
         ci = metrics["sample_f1_ci"]
         print(f"\n  Sample F1 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Print top-10 per-class performance
     print("\n  Per-class F1 (top 10 by support):")
     sorted_classes = sorted(per_class.items(), key=lambda x: x[1]["support"], reverse=True)
     for name, m in sorted_classes[:10]:
+        print(
+            f"    {name:20s}: P={m['precision']:.3f} R={m['recall']:.3f} F1={m['f1']:.3f} (n={m['support']})"
+        )
     return metrics
     print("\n" + "=" * 60)
     print("TOPIC CLASSIFICATION EVALUATION")
     print("=" * 60)
     # Load data (returns TopicExample dataclass objects)
     data = load_topic_jsonl(str(data_path))
     if max_samples:
         data = data[:max_samples]
     print(f"Evaluating on {len(data)} samples...")
     # Get predictions
     all_preds = []
     all_refs = []
     for i in tqdm(range(0, len(data), batch_size), desc="Predicting topics"):
+        batch = data[i : i + batch_size]
         texts = [ex.text for ex in batch]
         refs = [ex.topic for ex in batch]
         preds = pipeline.predict_topics(texts)
         pred_labels = [p.label for p in preds]
         all_preds.extend(pred_labels)
         all_refs.extend(refs)
     # Calculate metrics
     accuracy = accuracy_score(all_refs, all_preds)
     macro_f1 = f1_score(all_refs, all_preds, average="macro", zero_division=0)
     metrics: dict = {
         "accuracy": accuracy,
         "macro_f1": macro_f1,
         "num_samples": len(all_preds),
     }
     # Bootstrap confidence intervals for accuracy
     if compute_bootstrap:
+        per_sample_correct = [
+            1.0 if p == r else 0.0 for p, r in zip(all_preds, all_refs, strict=True)
+        ]
         mean, lo, hi = bootstrap_confidence_interval(per_sample_correct)
         metrics["accuracy_ci"] = {"mean": mean, "lower": lo, "upper": hi}
     # Print results
     print("\n" + "-" * 40)
     print("TOPIC CLASSIFICATION RESULTS:")
     print("-" * 40)
+    print(f"  Accuracy:  {metrics['accuracy']:.4f} ({metrics['accuracy'] * 100:.1f}%)")
     print(f"  Macro F1:  {metrics['macro_f1']:.4f}")
     if "accuracy_ci" in metrics:
         ci = metrics["accuracy_ci"]
         print(f"  Accuracy 95% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
     # Classification report
     print("\n" + "-" * 40)
     print("PER-CLASS METRICS:")
     print("-" * 40)
     print(classification_report(all_refs, all_preds, zero_division=0))
     return metrics
     parser.add_argument("--data-dir", type=Path, default=Path("data/processed"))
     parser.add_argument("--output", type=Path, default=Path("outputs/evaluation_report.json"))
     parser.add_argument("--max-samples", type=int, default=None, help="Limit samples per task")
+    parser.add_argument(
+        "--include-bertscore", action="store_true", help="Include BERTScore (slow, optional)"
+    )
+    parser.add_argument(
+        "--tune-thresholds",
+        action="store_true",
+        help="Tune per-class emotion thresholds on val set",
+    )
+    parser.add_argument(
+        "--bootstrap", action="store_true", help="Compute bootstrap confidence intervals"
+    )
     parser.add_argument("--summarization-only", action="store_true")
     parser.add_argument("--emotion-only", action="store_true")
     parser.add_argument("--topic-only", action="store_true")
     args = parser.parse_args()
     print("=" * 60)
     print("LexiMind Evaluation")
     print("=" * 60)
     start_time = time.perf_counter()
     # Load model
     print(f"\nLoading model from {args.checkpoint}...")
     device = "cuda" if torch.cuda.is_available() else "cpu"
     print(f"  Device: {device}")
     print(f"  Topics: {labels.topic}")
     print(f"  Emotions: {len(labels.emotion)} classes")
     results = {}
     # Determine which tasks to evaluate
     eval_all = not (args.summarization_only or args.emotion_only or args.topic_only)
     # Evaluate summarization
     if eval_all or args.summarization_only:
         val_path = args.data_dir / "summarization" / "validation.jsonl"
             val_path = args.data_dir / "summarization" / "val.jsonl"
         if val_path.exists():
             results["summarization"] = evaluate_summarization(
+                pipeline,
+                val_path,
                 max_samples=args.max_samples,
                 include_bertscore=args.include_bertscore,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: summarization validation data not found, skipping")
     # Evaluate emotion
     if eval_all or args.emotion_only:
         val_path = args.data_dir / "emotion" / "validation.jsonl"
             val_path = args.data_dir / "emotion" / "val.jsonl"
         if val_path.exists():
             results["emotion"] = evaluate_emotion(
+                pipeline,
+                val_path,
                 max_samples=args.max_samples,
                 tune_thresholds=args.tune_thresholds,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: emotion validation data not found, skipping")
     # Evaluate topic
     if eval_all or args.topic_only:
         val_path = args.data_dir / "topic" / "validation.jsonl"
             val_path = args.data_dir / "topic" / "val.jsonl"
         if val_path.exists():
             results["topic"] = evaluate_topic(
+                pipeline,
+                val_path,
                 max_samples=args.max_samples,
                 compute_bootstrap=args.bootstrap,
             )
         else:
             print("Warning: topic validation data not found, skipping")
     # Save results
     print("\n" + "=" * 60)
     print("SAVING RESULTS")
     print("=" * 60)
     args.output.parent.mkdir(parents=True, exist_ok=True)
     with open(args.output, "w") as f:
         json.dump(results, f, indent=2)
     print(f"  Saved to: {args.output}")
     # Final summary
     elapsed = time.perf_counter() - start_time
     print("\n" + "=" * 60)
     print("EVALUATION COMPLETE")
     print("=" * 60)
+    print(f"  Time: {elapsed / 60:.1f} minutes")
     if "summarization" in results:
         s = results["summarization"]
         print("\n  Summarization:")
         print(f"    BLEU-4:  {s['bleu4']:.4f}")
         if "bertscore_f1" in s:
             print(f"    BERTScore F1: {s['bertscore_f1']:.4f}")
     if "emotion" in results:
         e = results["emotion"]
         print("\n  Emotion:")
         print(f"    Sample-avg F1: {e['sample_avg_f1']:.4f}")
         print(f"    Macro F1:      {e['macro_f1']:.4f}")
         print(f"    Micro F1:      {e['micro_f1']:.4f}")
     if "topic" in results:
         print("\n  Topic:")
         print(f"    Accuracy: {results['topic']['accuracy']:.2%}")

scripts/profile_training.py CHANGED Viewed

@@ -96,10 +96,12 @@ def main(cfg: DictConfig) -> None:
     tok_cfg = data_cfg.get("tokenizer", {})
     max_len = int(cfg.training.get("tokenizer_max_length") or tok_cfg.get("max_length", 512))
-    tokenizer = Tokenizer(TokenizerConfig(
-        pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
-        max_length=max_len,
-    ))
     summ_train = SummarizationDataset(summ_splits["train"])
     emot_train = EmotionDataset(emot_splits["train"])
@@ -112,23 +114,42 @@ def main(cfg: DictConfig) -> None:
     train_loaders = {
         "summarization": build_summarization_dataloader(
-            summ_train, tokenizer, shuffle=True,
-            max_source_length=max_len, max_target_length=max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
         "emotion": build_emotion_dataloader(
-            emot_train, tokenizer, shuffle=True, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
         "topic": build_topic_dataloader(
-            topic_train, tokenizer, shuffle=True, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
     }
     # Build model
-    grad_ckpt = cfg.training.get("gradient_checkpointing", cfg.model.get("gradient_checkpointing", False))
-    use_rel_pos = cfg.training.get("use_relative_position_bias", cfg.model.get("use_relative_position_bias", False))
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
@@ -202,8 +223,10 @@ def main(cfg: DictConfig) -> None:
         except StopIteration:
             iterators[task] = iter(train_loaders[task])
             batch = next(iterators[task])
-        return {k: v.to(device, non_blocking=True) if isinstance(v, torch.Tensor) else v
-                for k, v in batch.items()}
     def training_step(step):
         """One training step across all tasks."""
@@ -219,7 +242,8 @@ def main(cfg: DictConfig) -> None:
                     loss = torch.nn.functional.cross_entropy(
                         logits.view(-1, logits.size(-1)),
                         batch["labels"].view(-1),
-                        ignore_index=-100, label_smoothing=0.1,
                     )
                 elif task == "emotion":
                     inputs = {"input_ids": batch["input_ids"]}
@@ -262,7 +286,10 @@ def main(cfg: DictConfig) -> None:
             torch.profiler.ProfilerActivity.CUDA,
         ],
         schedule=torch.profiler.schedule(
-            wait=1, warmup=2, active=active_steps - 3, repeat=1,
         ),
         on_trace_ready=torch.profiler.tensorboard_trace_handler(trace_path),
         record_shapes=True,

     tok_cfg = data_cfg.get("tokenizer", {})
     max_len = int(cfg.training.get("tokenizer_max_length") or tok_cfg.get("max_length", 512))
+    tokenizer = Tokenizer(
+        TokenizerConfig(
+            pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
+            max_length=max_len,
+        )
+    )
     summ_train = SummarizationDataset(summ_splits["train"])
     emot_train = EmotionDataset(emot_splits["train"])
     train_loaders = {
         "summarization": build_summarization_dataloader(
+            summ_train,
+            tokenizer,
+            shuffle=True,
+            max_source_length=max_len,
+            max_target_length=max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
         "emotion": build_emotion_dataloader(
+            emot_train,
+            tokenizer,
+            shuffle=True,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
         "topic": build_topic_dataloader(
+            topic_train,
+            tokenizer,
+            shuffle=True,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
     }
     # Build model
+    grad_ckpt = cfg.training.get(
+        "gradient_checkpointing", cfg.model.get("gradient_checkpointing", False)
+    )
+    use_rel_pos = cfg.training.get(
+        "use_relative_position_bias", cfg.model.get("use_relative_position_bias", False)
+    )
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
         except StopIteration:
             iterators[task] = iter(train_loaders[task])
             batch = next(iterators[task])
+        return {
+            k: v.to(device, non_blocking=True) if isinstance(v, torch.Tensor) else v
+            for k, v in batch.items()
+        }
     def training_step(step):
         """One training step across all tasks."""
                     loss = torch.nn.functional.cross_entropy(
                         logits.view(-1, logits.size(-1)),
                         batch["labels"].view(-1),
+                        ignore_index=-100,
+                        label_smoothing=0.1,
                     )
                 elif task == "emotion":
                     inputs = {"input_ids": batch["input_ids"]}
             torch.profiler.ProfilerActivity.CUDA,
         ],
         schedule=torch.profiler.schedule(
+            wait=1,
+            warmup=2,
+            active=active_steps - 3,
+            repeat=1,
         ),
         on_trace_ready=torch.profiler.tensorboard_trace_handler(trace_path),
         record_shapes=True,

scripts/train.py CHANGED Viewed

@@ -56,6 +56,7 @@ def set_seed(seed: int) -> None:
     import random
     import numpy as np
     random.seed(seed)
     np.random.seed(seed)
     torch.manual_seed(seed)
@@ -78,20 +79,20 @@ def load_splits(data_dir: Path, loader_fn) -> Dict[str, list]:
 def main(cfg: DictConfig) -> None:
     """Main training entry point."""
     start_time = time.perf_counter()
     print("=" * 60)
     print("LexiMind Training")
     print("=" * 60)
     print(OmegaConf.to_yaml(cfg))
     set_seed(cfg.seed)
     device = torch.device(cfg.device)
     # GPU optimizations for Ampere+
     if device.type == "cuda":
         # Enable cudnn benchmark for fixed-size inputs (10-20% speedup)
         torch.backends.cudnn.benchmark = True
         if torch.cuda.get_device_capability()[0] >= 8:
             torch.set_float32_matmul_precision("high")
             torch.backends.cuda.matmul.allow_tf32 = True
@@ -99,18 +100,18 @@ def main(cfg: DictConfig) -> None:
             print("  TF32 + cudnn.benchmark enabled (Ampere GPU)")
         else:
             print("  cudnn.benchmark enabled")
     # --------------- Load Data ---------------
     print("\nLoading datasets...")
     data_cfg = cfg.data
     trainer_cfg = cfg.training.get("trainer", {})
     # Load splits
     summ_splits = load_splits(Path(data_cfg.processed.summarization), load_summarization_jsonl)
     emot_splits = load_splits(Path(data_cfg.processed.emotion), load_emotion_jsonl)
     topic_splits = load_splits(Path(data_cfg.processed.topic), load_topic_jsonl)
     # Apply sample limits for dev runs
     max_train = trainer_cfg.get("max_train_samples")
     max_val = trainer_cfg.get("max_val_samples")
@@ -121,86 +122,130 @@ def main(cfg: DictConfig) -> None:
         for splits in [summ_splits, emot_splits, topic_splits]:
             if "val" in splits:
                 splits["val"] = splits["val"][:max_val]
-    print(f"  Summarization: {len(summ_splits['train']):,} train, {len(summ_splits.get('val', [])):,} val")
-    print(f"  Emotion: {len(emot_splits['train']):,} train, {len(emot_splits.get('val', [])):,} val")
-    print(f"  Topic: {len(topic_splits['train']):,} train, {len(topic_splits.get('val', [])):,} val")
     # --------------- Tokenizer ---------------
     tok_cfg = data_cfg.get("tokenizer", {})
     max_len = int(cfg.training.get("tokenizer_max_length") or tok_cfg.get("max_length", 512))
-    tokenizer = Tokenizer(TokenizerConfig(
-        pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
-        max_length=max_len,
-    ))
     print(f"  Tokenizer: {tokenizer.vocab_size:,} vocab, max_len={max_len}")
     # --------------- Datasets ---------------
     summ_train = SummarizationDataset(summ_splits["train"])
     summ_val = SummarizationDataset(summ_splits.get("val", []))
     emot_train = EmotionDataset(emot_splits["train"])
     emot_val = EmotionDataset(emot_splits.get("val", []), binarizer=emot_train.binarizer)
     topic_train = TopicDataset(topic_splits["train"])
     topic_val = TopicDataset(topic_splits.get("val", []), encoder=topic_train.encoder)
     print(f"  Emotions: {len(emot_train.emotion_classes)} classes")
-    print(f"  Topics: {len(topic_train.topic_classes)} classes → {list(map(str, topic_train.topic_classes))}")
     # --------------- DataLoaders ---------------
     dl_cfg = cfg.training.get("dataloader", {})
     batch_size = int(dl_cfg.get("batch_size", 8))
     num_workers = int(dl_cfg.get("num_workers", 4))
     # Classification tasks don't need full 512 tokens - 256 is sufficient
     # This speeds up emotion/topic forward passes significantly
     classification_max_len = min(256, max_len)
     train_loaders = {
         "summarization": build_summarization_dataloader(
-            summ_train, tokenizer, shuffle=True,
-            max_source_length=max_len, max_target_length=max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
         "emotion": build_emotion_dataloader(
-            emot_train, tokenizer, shuffle=True, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
         "topic": build_topic_dataloader(
-            topic_train, tokenizer, shuffle=True, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         ),
     }
     val_loaders = {}
     if summ_val:
         val_loaders["summarization"] = build_summarization_dataloader(
-            summ_val, tokenizer, shuffle=False,
-            max_source_length=max_len, max_target_length=max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         )
     if emot_val:
         val_loaders["emotion"] = build_emotion_dataloader(
-            emot_val, tokenizer, shuffle=False, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         )
     if topic_val:
         val_loaders["topic"] = build_topic_dataloader(
-            topic_val, tokenizer, shuffle=False, max_length=classification_max_len,
-            batch_size=batch_size, num_workers=num_workers, pin_memory=True,
         )
     # --------------- Model ---------------
     print("\nBuilding model...")
     # Check for overrides in training config
-    grad_ckpt = cfg.training.get("gradient_checkpointing", cfg.model.get("gradient_checkpointing", False))
-    use_rel_pos = cfg.training.get("use_relative_position_bias", cfg.model.get("use_relative_position_bias", False))
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
         vocab_size=getattr(cfg.model, "vocab_size", None),
@@ -215,42 +260,42 @@ def main(cfg: DictConfig) -> None:
         use_relative_position_bias=use_rel_pos,
         gradient_checkpointing=grad_ckpt,
     )
     if grad_ckpt:
         print("  Gradient checkpointing: on")
     if not use_rel_pos:
         print("  FlashAttention: on (no relative position bias)")
     model = build_multitask_model(
         tokenizer,
         num_emotions=len(emot_train.emotion_classes),
         num_topics=len(topic_train.topic_classes),
         config=model_cfg,
     ).to(device)
     param_count = sum(p.numel() for p in model.parameters())
-    print(f"  Parameters: {param_count:,} ({param_count/1e6:.1f}M)")
     # Freeze lower encoder layers (keeps pretrained language understanding, adapts upper layers)
     freeze_layers = cfg.training.get("freeze_encoder_layers", 0)
     if freeze_layers > 0:
         frozen_params = 0
         # Freeze embedding layer
-        if hasattr(model.encoder, 'embed_tokens'):
             for p in model.encoder.embed_tokens.parameters():
                 p.requires_grad = False
                 frozen_params += p.numel()
         # Freeze specified number of encoder layers
-        if hasattr(model.encoder, 'layers'):
             for i, layer in enumerate(model.encoder.layers):
                 if i < freeze_layers:
                     for p in layer.parameters():
                         p.requires_grad = False
                         frozen_params += p.numel()
         trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
-        print(f"  Frozen layers: 0-{freeze_layers-1} ({frozen_params/1e6:.1f}M params)")
-        print(f"  Trainable: {trainable:,} ({trainable/1e6:.1f}M)")
     # Resume from checkpoint?
     start_epoch = 1
     resume_path = cfg.get("resume_from")
@@ -258,10 +303,11 @@ def main(cfg: DictConfig) -> None:
         print(f"  Resuming from: {resume_path}")
         load_state(model, str(resume_path))
         import re
         digits = re.findall(r"\d+", Path(resume_path).stem)
         if digits:
             start_epoch = int(digits[-1]) + 1
     # Compile model for speed
     # Note: "reduce-overhead" mode uses CUDA graphs which conflicts with gradient checkpointing
     # Use "default" mode when checkpointing is enabled
@@ -272,13 +318,13 @@ def main(cfg: DictConfig) -> None:
     if cfg.training.get("compile_decoder", True):
         model.decoder = torch.compile(model.decoder, mode=compile_mode)  # type: ignore[assignment]
         print(f"  Decoder compiled ({compile_mode})")
     # --------------- Train ---------------
     print("\nStarting training...")
     opt_cfg = cfg.training.get("optimizer", {})
     sched_cfg = cfg.training.get("scheduler", {})
     # Use fused AdamW on CUDA for ~5-10% speedup
     use_fused = device.type == "cuda" and "fused" in torch.optim.AdamW.__init__.__code__.co_varnames
     optimizer = torch.optim.AdamW(
@@ -289,7 +335,7 @@ def main(cfg: DictConfig) -> None:
     )
     if use_fused:
         print("  Fused AdamW: on")
     trainer = Trainer(
         model=model,
         optimizer=optimizer,
@@ -309,38 +355,38 @@ def main(cfg: DictConfig) -> None:
         device=device,
         tokenizer=tokenizer,
     )
     # Checkpoint callback
     ckpt_dir = Path(cfg.checkpoint_out).parent
-    best_val_loss = float('inf')
     def save_checkpoint(epoch: int, model: torch.nn.Module, history: Dict) -> None:
         nonlocal best_val_loss
         ckpt_dir.mkdir(parents=True, exist_ok=True)
         # Save epoch checkpoint
         save_state(model, str(ckpt_dir / f"epoch_{epoch}.pt"))
         # Track best
         val_key = f"val_epoch_{epoch}"
         if val_key in history:
-            val_loss = history[val_key].get("total_loss", float('inf'))
             if val_loss < best_val_loss:
                 best_val_loss = val_loss
                 save_state(model, str(ckpt_dir / "best.pt"))
                 print(f"  New best model saved (val_loss={val_loss:.4f})")
     history = trainer.fit(
         train_loaders,
         val_loaders if val_loaders else None,
         checkpoint_callback=save_checkpoint,
         start_epoch=start_epoch,
     )
     # --------------- Save Outputs ---------------
     print("\nSaving outputs...")
     # Labels
     labels_path = Path(cfg.labels_out)
     save_label_metadata(
@@ -348,17 +394,17 @@ def main(cfg: DictConfig) -> None:
         labels_path,
     )
     print(f"  Labels: {labels_path}")
     # History
     history_path = Path(cfg.history_out)
     history_path.parent.mkdir(parents=True, exist_ok=True)
     with history_path.open("w") as f:
         json.dump(history, f, indent=2)
     print(f"  History: {history_path}")
     total_time = time.perf_counter() - start_time
     print(f"\n{'=' * 60}")
-    print(f"Training complete in {total_time/60:.1f} minutes")
     print(f"  Best checkpoint: {ckpt_dir / 'best.pt'}")
     print(f"{'=' * 60}")

     import random
     import numpy as np
     random.seed(seed)
     np.random.seed(seed)
     torch.manual_seed(seed)
 def main(cfg: DictConfig) -> None:
     """Main training entry point."""
     start_time = time.perf_counter()
     print("=" * 60)
     print("LexiMind Training")
     print("=" * 60)
     print(OmegaConf.to_yaml(cfg))
     set_seed(cfg.seed)
     device = torch.device(cfg.device)
     # GPU optimizations for Ampere+
     if device.type == "cuda":
         # Enable cudnn benchmark for fixed-size inputs (10-20% speedup)
         torch.backends.cudnn.benchmark = True
         if torch.cuda.get_device_capability()[0] >= 8:
             torch.set_float32_matmul_precision("high")
             torch.backends.cuda.matmul.allow_tf32 = True
             print("  TF32 + cudnn.benchmark enabled (Ampere GPU)")
         else:
             print("  cudnn.benchmark enabled")
     # --------------- Load Data ---------------
     print("\nLoading datasets...")
     data_cfg = cfg.data
     trainer_cfg = cfg.training.get("trainer", {})
     # Load splits
     summ_splits = load_splits(Path(data_cfg.processed.summarization), load_summarization_jsonl)
     emot_splits = load_splits(Path(data_cfg.processed.emotion), load_emotion_jsonl)
     topic_splits = load_splits(Path(data_cfg.processed.topic), load_topic_jsonl)
     # Apply sample limits for dev runs
     max_train = trainer_cfg.get("max_train_samples")
     max_val = trainer_cfg.get("max_val_samples")
         for splits in [summ_splits, emot_splits, topic_splits]:
             if "val" in splits:
                 splits["val"] = splits["val"][:max_val]
+    print(
+        f"  Summarization: {len(summ_splits['train']):,} train, {len(summ_splits.get('val', [])):,} val"
+    )
+    print(
+        f"  Emotion: {len(emot_splits['train']):,} train, {len(emot_splits.get('val', [])):,} val"
+    )
+    print(
+        f"  Topic: {len(topic_splits['train']):,} train, {len(topic_splits.get('val', [])):,} val"
+    )
     # --------------- Tokenizer ---------------
     tok_cfg = data_cfg.get("tokenizer", {})
     max_len = int(cfg.training.get("tokenizer_max_length") or tok_cfg.get("max_length", 512))
+    tokenizer = Tokenizer(
+        TokenizerConfig(
+            pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
+            max_length=max_len,
+        )
+    )
     print(f"  Tokenizer: {tokenizer.vocab_size:,} vocab, max_len={max_len}")
     # --------------- Datasets ---------------
     summ_train = SummarizationDataset(summ_splits["train"])
     summ_val = SummarizationDataset(summ_splits.get("val", []))
     emot_train = EmotionDataset(emot_splits["train"])
     emot_val = EmotionDataset(emot_splits.get("val", []), binarizer=emot_train.binarizer)
     topic_train = TopicDataset(topic_splits["train"])
     topic_val = TopicDataset(topic_splits.get("val", []), encoder=topic_train.encoder)
     print(f"  Emotions: {len(emot_train.emotion_classes)} classes")
+    print(
+        f"  Topics: {len(topic_train.topic_classes)} classes → {list(map(str, topic_train.topic_classes))}"
+    )
     # --------------- DataLoaders ---------------
     dl_cfg = cfg.training.get("dataloader", {})
     batch_size = int(dl_cfg.get("batch_size", 8))
     num_workers = int(dl_cfg.get("num_workers", 4))
     # Classification tasks don't need full 512 tokens - 256 is sufficient
     # This speeds up emotion/topic forward passes significantly
     classification_max_len = min(256, max_len)
     train_loaders = {
         "summarization": build_summarization_dataloader(
+            summ_train,
+            tokenizer,
+            shuffle=True,
+            max_source_length=max_len,
+            max_target_length=max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
         "emotion": build_emotion_dataloader(
+            emot_train,
+            tokenizer,
+            shuffle=True,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
         "topic": build_topic_dataloader(
+            topic_train,
+            tokenizer,
+            shuffle=True,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         ),
     }
     val_loaders = {}
     if summ_val:
         val_loaders["summarization"] = build_summarization_dataloader(
+            summ_val,
+            tokenizer,
+            shuffle=False,
+            max_source_length=max_len,
+            max_target_length=max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         )
     if emot_val:
         val_loaders["emotion"] = build_emotion_dataloader(
+            emot_val,
+            tokenizer,
+            shuffle=False,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         )
     if topic_val:
         val_loaders["topic"] = build_topic_dataloader(
+            topic_val,
+            tokenizer,
+            shuffle=False,
+            max_length=classification_max_len,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            pin_memory=True,
         )
     # --------------- Model ---------------
     print("\nBuilding model...")
     # Check for overrides in training config
+    grad_ckpt = cfg.training.get(
+        "gradient_checkpointing", cfg.model.get("gradient_checkpointing", False)
+    )
+    use_rel_pos = cfg.training.get(
+        "use_relative_position_bias", cfg.model.get("use_relative_position_bias", False)
+    )
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
         vocab_size=getattr(cfg.model, "vocab_size", None),
         use_relative_position_bias=use_rel_pos,
         gradient_checkpointing=grad_ckpt,
     )
     if grad_ckpt:
         print("  Gradient checkpointing: on")
     if not use_rel_pos:
         print("  FlashAttention: on (no relative position bias)")
     model = build_multitask_model(
         tokenizer,
         num_emotions=len(emot_train.emotion_classes),
         num_topics=len(topic_train.topic_classes),
         config=model_cfg,
     ).to(device)
     param_count = sum(p.numel() for p in model.parameters())
+    print(f"  Parameters: {param_count:,} ({param_count / 1e6:.1f}M)")
     # Freeze lower encoder layers (keeps pretrained language understanding, adapts upper layers)
     freeze_layers = cfg.training.get("freeze_encoder_layers", 0)
     if freeze_layers > 0:
         frozen_params = 0
         # Freeze embedding layer
+        if hasattr(model.encoder, "embed_tokens"):
             for p in model.encoder.embed_tokens.parameters():
                 p.requires_grad = False
                 frozen_params += p.numel()
         # Freeze specified number of encoder layers
+        if hasattr(model.encoder, "layers"):
             for i, layer in enumerate(model.encoder.layers):
                 if i < freeze_layers:
                     for p in layer.parameters():
                         p.requires_grad = False
                         frozen_params += p.numel()
         trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+        print(f"  Frozen layers: 0-{freeze_layers - 1} ({frozen_params / 1e6:.1f}M params)")
+        print(f"  Trainable: {trainable:,} ({trainable / 1e6:.1f}M)")
     # Resume from checkpoint?
     start_epoch = 1
     resume_path = cfg.get("resume_from")
         print(f"  Resuming from: {resume_path}")
         load_state(model, str(resume_path))
         import re
         digits = re.findall(r"\d+", Path(resume_path).stem)
         if digits:
             start_epoch = int(digits[-1]) + 1
     # Compile model for speed
     # Note: "reduce-overhead" mode uses CUDA graphs which conflicts with gradient checkpointing
     # Use "default" mode when checkpointing is enabled
     if cfg.training.get("compile_decoder", True):
         model.decoder = torch.compile(model.decoder, mode=compile_mode)  # type: ignore[assignment]
         print(f"  Decoder compiled ({compile_mode})")
     # --------------- Train ---------------
     print("\nStarting training...")
     opt_cfg = cfg.training.get("optimizer", {})
     sched_cfg = cfg.training.get("scheduler", {})
     # Use fused AdamW on CUDA for ~5-10% speedup
     use_fused = device.type == "cuda" and "fused" in torch.optim.AdamW.__init__.__code__.co_varnames
     optimizer = torch.optim.AdamW(
     )
     if use_fused:
         print("  Fused AdamW: on")
     trainer = Trainer(
         model=model,
         optimizer=optimizer,
         device=device,
         tokenizer=tokenizer,
     )
     # Checkpoint callback
     ckpt_dir = Path(cfg.checkpoint_out).parent
+    best_val_loss = float("inf")
     def save_checkpoint(epoch: int, model: torch.nn.Module, history: Dict) -> None:
         nonlocal best_val_loss
         ckpt_dir.mkdir(parents=True, exist_ok=True)
         # Save epoch checkpoint
         save_state(model, str(ckpt_dir / f"epoch_{epoch}.pt"))
         # Track best
         val_key = f"val_epoch_{epoch}"
         if val_key in history:
+            val_loss = history[val_key].get("total_loss", float("inf"))
             if val_loss < best_val_loss:
                 best_val_loss = val_loss
                 save_state(model, str(ckpt_dir / "best.pt"))
                 print(f"  New best model saved (val_loss={val_loss:.4f})")
     history = trainer.fit(
         train_loaders,
         val_loaders if val_loaders else None,
         checkpoint_callback=save_checkpoint,
         start_epoch=start_epoch,
     )
     # --------------- Save Outputs ---------------
     print("\nSaving outputs...")
     # Labels
     labels_path = Path(cfg.labels_out)
     save_label_metadata(
         labels_path,
     )
     print(f"  Labels: {labels_path}")
     # History
     history_path = Path(cfg.history_out)
     history_path.parent.mkdir(parents=True, exist_ok=True)
     with history_path.open("w") as f:
         json.dump(history, f, indent=2)
     print(f"  History: {history_path}")
     total_time = time.perf_counter() - start_time
     print(f"\n{'=' * 60}")
+    print(f"Training complete in {total_time / 60:.1f} minutes")
     print(f"  Best checkpoint: {ckpt_dir / 'best.pt'}")
     print(f"{'=' * 60}")

scripts/train_multiseed.py CHANGED Viewed

@@ -30,7 +30,8 @@ def run_single_seed(seed: int, config_overrides: str, base_dir: Path) -> Dict:
     seed_dir.mkdir(parents=True, exist_ok=True)
     cmd = [
-        sys.executable, "scripts/train.py",
         f"seed={seed}",
         f"checkpoint_out={seed_dir}/checkpoints/best.pt",
         f"history_out={seed_dir}/training_history.json",
@@ -39,9 +40,9 @@ def run_single_seed(seed: int, config_overrides: str, base_dir: Path) -> Dict:
     if config_overrides:
         cmd.extend(config_overrides.split())
-    print(f"\n{'='*60}")
     print(f"Training seed {seed}")
-    print(f"{'='*60}")
     print(f"  Command: {' '.join(cmd)}")
     result = subprocess.run(cmd, capture_output=False)
@@ -69,7 +70,8 @@ def run_evaluation(seed: int, base_dir: Path, extra_args: List[str] | None = Non
         return {}
     cmd = [
-        sys.executable, "scripts/evaluate.py",
         f"--checkpoint={checkpoint}",
         f"--labels={labels}",
         f"--output={output}",
@@ -105,7 +107,11 @@ def aggregate_results(all_results: Dict[int, Dict]) -> Dict:
             if not isinstance(task_metrics, dict):
                 continue
             for metric_name, value in task_metrics.items():
-                if isinstance(value, (int, float)) and metric_name != "num_samples" and metric_name != "num_classes":
                     key = f"{task}/{metric_name}"
                     metric_values.setdefault(key, []).append(float(value))
@@ -125,9 +131,9 @@ def aggregate_results(all_results: Dict[int, Dict]) -> Dict:
 def print_summary(aggregated: Dict, seeds: List[int]) -> None:
     """Print human-readable summary of multi-seed results."""
-    print(f"\n{'='*70}")
     print(f"MULTI-SEED RESULTS SUMMARY ({len(seeds)} seeds: {seeds})")
-    print(f"{'='*70}")
     # Group by task
     tasks: Dict[str, Dict[str, Dict]] = {}
@@ -142,23 +148,32 @@ def print_summary(aggregated: Dict, seeds: List[int]) -> None:
             std = stats["std"]
             # Format based on metric type
             if "accuracy" in metric:
-                print(f"    {metric:25s}: {mean*100:.1f}% ± {std*100:.1f}%")
             else:
                 print(f"    {metric:25s}: {mean:.4f} ± {std:.4f}")
 def main():
     parser = argparse.ArgumentParser(description="Multi-seed training for LexiMind")
-    parser.add_argument("--seeds", nargs="+", type=int, default=[17, 42, 123],
-                        help="Random seeds to train with")
-    parser.add_argument("--config", type=str, default="",
-                        help="Hydra config overrides (e.g., 'training=full')")
-    parser.add_argument("--output-dir", type=Path, default=Path("outputs/multiseed"),
-                        help="Base output directory")
-    parser.add_argument("--skip-training", action="store_true",
-                        help="Skip training, only aggregate existing results")
-    parser.add_argument("--skip-eval", action="store_true",
-                        help="Skip evaluation, only aggregate training histories")
     args = parser.parse_args()
     args.output_dir.mkdir(parents=True, exist_ok=True)
@@ -184,11 +199,15 @@ def main():
         # Save aggregated results
         output_path = args.output_dir / "aggregated_results.json"
         with open(output_path, "w") as f:
-            json.dump({
-                "seeds": args.seeds,
-                "per_seed": {str(k): v for k, v in all_eval_results.items()},
-                "aggregated": aggregated,
-            }, f, indent=2)
         print(f"\n  Saved to: {output_path}")
     else:
         print("\nNo evaluation results to aggregate.")

     seed_dir.mkdir(parents=True, exist_ok=True)
     cmd = [
+        sys.executable,
+        "scripts/train.py",
         f"seed={seed}",
         f"checkpoint_out={seed_dir}/checkpoints/best.pt",
         f"history_out={seed_dir}/training_history.json",
     if config_overrides:
         cmd.extend(config_overrides.split())
+    print(f"\n{'=' * 60}")
     print(f"Training seed {seed}")
+    print(f"{'=' * 60}")
     print(f"  Command: {' '.join(cmd)}")
     result = subprocess.run(cmd, capture_output=False)
         return {}
     cmd = [
+        sys.executable,
+        "scripts/evaluate.py",
         f"--checkpoint={checkpoint}",
         f"--labels={labels}",
         f"--output={output}",
             if not isinstance(task_metrics, dict):
                 continue
             for metric_name, value in task_metrics.items():
+                if (
+                    isinstance(value, (int, float))
+                    and metric_name != "num_samples"
+                    and metric_name != "num_classes"
+                ):
                     key = f"{task}/{metric_name}"
                     metric_values.setdefault(key, []).append(float(value))
 def print_summary(aggregated: Dict, seeds: List[int]) -> None:
     """Print human-readable summary of multi-seed results."""
+    print(f"\n{'=' * 70}")
     print(f"MULTI-SEED RESULTS SUMMARY ({len(seeds)} seeds: {seeds})")
+    print(f"{'=' * 70}")
     # Group by task
     tasks: Dict[str, Dict[str, Dict]] = {}
             std = stats["std"]
             # Format based on metric type
             if "accuracy" in metric:
+                print(f"    {metric:25s}: {mean * 100:.1f}% ± {std * 100:.1f}%")
             else:
                 print(f"    {metric:25s}: {mean:.4f} ± {std:.4f}")
 def main():
     parser = argparse.ArgumentParser(description="Multi-seed training for LexiMind")
+    parser.add_argument(
+        "--seeds", nargs="+", type=int, default=[17, 42, 123], help="Random seeds to train with"
+    )
+    parser.add_argument(
+        "--config", type=str, default="", help="Hydra config overrides (e.g., 'training=full')"
+    )
+    parser.add_argument(
+        "--output-dir", type=Path, default=Path("outputs/multiseed"), help="Base output directory"
+    )
+    parser.add_argument(
+        "--skip-training",
+        action="store_true",
+        help="Skip training, only aggregate existing results",
+    )
+    parser.add_argument(
+        "--skip-eval",
+        action="store_true",
+        help="Skip evaluation, only aggregate training histories",
+    )
     args = parser.parse_args()
     args.output_dir.mkdir(parents=True, exist_ok=True)
         # Save aggregated results
         output_path = args.output_dir / "aggregated_results.json"
         with open(output_path, "w") as f:
+            json.dump(
+                {
+                    "seeds": args.seeds,
+                    "per_seed": {str(k): v for k, v in all_eval_results.items()},
+                    "aggregated": aggregated,
+                },
+                f,
+                indent=2,
+            )
         print(f"\n  Saved to: {output_path}")
     else:
         print("\nNo evaluation results to aggregate.")

scripts/visualize_training.py CHANGED Viewed

@@ -81,31 +81,33 @@ ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
 # Professional color palette (accessible + publication-ready)
 COLORS = {
-    "primary": "#2E86AB",     # Deep blue - training
-    "secondary": "#E94F37",   # Coral red - validation
-    "accent": "#28A745",      # Green - best points
-    "highlight": "#F7B801",   # Gold - highlights
-    "dark": "#1E3A5F",        # Navy - text
-    "light": "#F5F5F5",       # Light gray - background
-    "topic": "#8338EC",       # Purple
-    "emotion": "#FF6B6B",     # Salmon
-    "summary": "#06D6A0",     # Teal
 }
 # Style configuration
 plt.style.use("seaborn-v0_8-whitegrid")
-plt.rcParams.update({
-    "font.family": "sans-serif",
-    "font.size": 11,
-    "axes.titlesize": 14,
-    "axes.titleweight": "bold",
-    "axes.labelsize": 12,
-    "legend.fontsize": 10,
-    "figure.titlesize": 16,
-    "figure.titleweight": "bold",
-    "savefig.dpi": 150,
-    "savefig.bbox": "tight",
-})
 # Custom colormap for heatmaps
 HEATMAP_CMAP = LinearSegmentedColormap.from_list(
@@ -115,12 +117,14 @@ HEATMAP_CMAP = LinearSegmentedColormap.from_list(
 # MLflow Utilities
 def get_mlflow_client():
     """Get MLflow client with correct tracking URI."""
     if not HAS_MLFLOW:
         raise ImportError("MLflow not installed. Install with: pip install mlflow")
     import mlflow
     import mlflow.tracking
     # Use SQLite database (same as trainer.py)
     mlflow.set_tracking_uri("sqlite:///mlruns.db")
     return mlflow.tracking.MlflowClient()
@@ -153,6 +157,7 @@ def get_metric_history(run, metric_name: str) -> tuple[list, list]:
 # Core Training Visualizations
 def plot_loss_curves(run, interactive: bool = False) -> None:
     """
     Plot training and validation loss over time.
@@ -164,37 +169,49 @@ def plot_loss_curves(run, interactive: bool = False) -> None:
     if interactive and HAS_PLOTLY:
         import plotly.graph_objects as go
         fig = go.Figure()
         if train_values:
-            fig.add_trace(go.Scatter(
-                x=train_steps, y=train_values,
-                name="Training Loss", mode="lines",
-                line=dict(color=COLORS["primary"], width=3)
-            ))
         if val_values:
-            fig.add_trace(go.Scatter(
-                x=val_steps, y=val_values,
-                name="Validation Loss", mode="lines",
-                line=dict(color=COLORS["secondary"], width=3)
-            ))
             # Best point
             best_idx = int(np.argmin(val_values))
-            fig.add_trace(go.Scatter(
-                x=[val_steps[best_idx]], y=[val_values[best_idx]],
-                name=f"Best: {val_values[best_idx]:.3f}",
-                mode="markers",
-                marker=dict(color=COLORS["accent"], size=15, symbol="star")
-            ))
         fig.update_layout(
             title="Training Progress: Multi-Task Loss",
             xaxis_title="Epoch",
             yaxis_title="Loss",
             template="plotly_white",
-            hovermode="x unified"
         )
         output_path = OUTPUTS_DIR / "training_loss_curve.html"
@@ -206,32 +223,62 @@ def plot_loss_curves(run, interactive: bool = False) -> None:
     fig, ax = plt.subplots(figsize=(12, 6))
     if not train_values:
-        ax.text(0.5, 0.5, "No training data yet\n\nWaiting for first epoch...",
-                ha="center", va="center", fontsize=14, color="gray")
         ax.set_xlim(0, 1)
         ax.set_ylim(0, 1)
     else:
         # Training curve
-        ax.plot(train_steps, train_values, label="Training Loss", linewidth=2.5,
-                color=COLORS["primary"], alpha=0.9)
         # Validation curve with best point
         if val_values:
-            ax.plot(val_steps, val_values, label="Validation Loss", linewidth=2.5,
-                    color=COLORS["secondary"], alpha=0.9)
             best_idx = int(np.argmin(val_values))
-            ax.scatter([val_steps[best_idx]], [val_values[best_idx]],
-                       s=200, c=COLORS["accent"], zorder=5, marker="*",
-                       edgecolors="white", linewidth=2,
-                       label=f"Best: {val_values[best_idx]:.3f}")
             # Annotate best point
-            ax.annotate(f"Epoch {val_steps[best_idx]}",
-                        xy=(val_steps[best_idx], val_values[best_idx]),
-                        xytext=(10, 20), textcoords="offset points",
-                        fontsize=10, color=COLORS["accent"],
-                        arrowprops=dict(arrowstyle="->", color=COLORS["accent"]))
         ax.legend(fontsize=11, loc="upper right", framealpha=0.9)
         ax.set_ylim(bottom=0)
@@ -265,11 +312,22 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     val_sum = client.get_metric_history(run.info.run_id, "val_summarization_loss")
     if train_sum:
-        ax.plot([m.step for m in train_sum], [m.value for m in train_sum],
-                label="Train", linewidth=2.5, color=COLORS["summary"])
     if val_sum:
-        ax.plot([m.step for m in val_sum], [m.value for m in val_sum],
-                label="Validation", linewidth=2.5, color=COLORS["secondary"], linestyle="--")
     ax.set_title("Summarization Loss")
     ax.set_xlabel("Epoch")
@@ -286,20 +344,43 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     val_f1 = client.get_metric_history(run.info.run_id, "val_emotion_f1")
     if train_emo:
-        ax.plot([m.step for m in train_emo], [m.value for m in train_emo],
-                label="Train Loss", linewidth=2.5, color=COLORS["emotion"])
     if val_emo:
-        ax.plot([m.step for m in val_emo], [m.value for m in val_emo],
-                label="Val Loss", linewidth=2.5, color=COLORS["secondary"], linestyle="--")
     # Secondary axis for F1
     ax2 = ax.twinx()
     if train_f1:
-        ax2.plot([m.step for m in train_f1], [m.value for m in train_f1],
-                 label="Train F1", linewidth=2, color=COLORS["accent"], alpha=0.7)
     if val_f1:
-        ax2.plot([m.step for m in val_f1], [m.value for m in val_f1],
-                 label="Val F1", linewidth=2, color=COLORS["highlight"], alpha=0.7)
         ax2.set_ylim(0, 1)
     ax.set_title("Emotion Detection (28 classes)")
@@ -320,19 +401,42 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     val_acc = client.get_metric_history(run.info.run_id, "val_topic_accuracy")
     if train_topic:
-        ax.plot([m.step for m in train_topic], [m.value for m in train_topic],
-                label="Train Loss", linewidth=2.5, color=COLORS["topic"])
     if val_topic:
-        ax.plot([m.step for m in val_topic], [m.value for m in val_topic],
-                label="Val Loss", linewidth=2.5, color=COLORS["secondary"], linestyle="--")
     ax2 = ax.twinx()
     if train_acc:
-        ax2.plot([m.step for m in train_acc], [m.value for m in train_acc],
-                 label="Train Acc", linewidth=2, color=COLORS["accent"], alpha=0.7)
     if val_acc:
-        ax2.plot([m.step for m in val_acc], [m.value for m in val_acc],
-                 label="Val Acc", linewidth=2, color=COLORS["highlight"], alpha=0.7)
         ax2.set_ylim(0, 1)
     ax.set_title("Topic Classification (4 classes)")
@@ -350,9 +454,11 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     ax.axis("off")
     # Get final metrics
-    summary_lines = ["+--------------------------------------+",
-                     "|     FINAL METRICS (Last Epoch)       |",
-                     "+--------------------------------------+"]
     if val_topic and val_acc:
         summary_lines.append(f"|  Topic Accuracy:    {val_acc[-1].value:>6.1%}         |")
@@ -363,8 +469,15 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     summary_lines.append("+--------------------------------------+")
-    ax.text(0.1, 0.6, "\n".join(summary_lines), fontsize=11, family="monospace",
-            verticalalignment="center", bbox=dict(boxstyle="round", facecolor=COLORS["light"]))
     # Add model info
     run_params = run.data.params
@@ -372,8 +485,7 @@ def plot_task_metrics(run, interactive: bool = False) -> None:
     model_info += f"Batch Size: {run_params.get('batch_size', 'N/A')}\n"
     model_info += f"Learning Rate: {run_params.get('learning_rate', 'N/A')}"
-    ax.text(0.1, 0.15, model_info, fontsize=10, color="gray",
-            verticalalignment="center")
     plt.tight_layout()
     output_path = OUTPUTS_DIR / "task_metrics.png"
@@ -392,13 +504,13 @@ def plot_learning_rate(run) -> None:
     if not lr_metrics or len(lr_metrics) < 2:
         # No LR data logged - generate theoretical schedule from config
         logger.info("  No LR metrics found - generating theoretical schedule...")
         # Get config from run params
         params = run.data.params
         lr_max = float(params.get("learning_rate", params.get("lr", 5e-5)))
         warmup_steps = int(params.get("warmup_steps", 500))
         max_epochs = int(params.get("max_epochs", 5))
         # Estimate total steps from training loss history
         train_loss = client.get_metric_history(run.info.run_id, "train_total_loss")
         if train_loss:
@@ -407,7 +519,7 @@ def plot_learning_rate(run) -> None:
             total_steps = max_epochs * estimated_steps_per_epoch
         else:
             total_steps = 4000  # Default fallback
         # Generate cosine schedule with warmup
         steps = np.arange(0, total_steps)
         values = []
@@ -418,25 +530,43 @@ def plot_learning_rate(run) -> None:
                 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
                 lr = lr_max * max(0.1, 0.5 * (1 + np.cos(np.pi * progress)))
             values.append(lr)
         ax.fill_between(steps, values, alpha=0.3, color=COLORS["primary"])
         ax.plot(steps, values, linewidth=2.5, color=COLORS["primary"], label="Cosine + Warmup")
         # Mark warmup region
-        ax.axvline(warmup_steps, color=COLORS["secondary"], linestyle="--",
-                   alpha=0.7, linewidth=2, label=f"Warmup End ({warmup_steps})")
         ax.axvspan(0, warmup_steps, alpha=0.1, color=COLORS["highlight"])
         # Add annotation
-        ax.annotate(f"Peak LR: {lr_max:.1e}", xy=(warmup_steps, lr_max),
-                    xytext=(warmup_steps + 200, lr_max * 0.9),
-                    fontsize=10, color=COLORS["dark"],
-                    arrowprops=dict(arrowstyle="->", color=COLORS["dark"], alpha=0.5))
         ax.legend(loc="upper right")
-        ax.text(0.98, 0.02, "(Theoretical - actual LR not logged)",
-                transform=ax.transAxes, ha="right", va="bottom",
-                fontsize=9, color="gray", style="italic")
     else:
         steps = np.array([m.step for m in lr_metrics])
         values = [m.value for m in lr_metrics]
@@ -449,10 +579,15 @@ def plot_learning_rate(run) -> None:
         params = run.data.params
         warmup_steps = int(params.get("warmup_steps", 500))
         if warmup_steps < max(steps):
-            ax.axvline(warmup_steps, color=COLORS["secondary"], linestyle="--",
-                       alpha=0.7, linewidth=2, label="Warmup End")
-            ax.axvspan(0, warmup_steps, alpha=0.1, color=COLORS["highlight"],
-                       label="Warmup Phase")
             ax.legend(loc="upper right")
     # Scientific notation for y-axis if needed
@@ -471,6 +606,7 @@ def plot_learning_rate(run) -> None:
 # Advanced Visualizations
 def plot_confusion_matrix(run, task: str = "topic") -> None:
     """
     Plot confusion matrix for classification tasks.
@@ -482,8 +618,16 @@ def plot_confusion_matrix(run, task: str = "topic") -> None:
     if task == "topic":
         default_labels = ["World", "Sports", "Business", "Sci/Tech"]
     else:  # emotion - top 8 for visibility
-        default_labels = ["admiration", "amusement", "anger", "annoyance",
-                          "approval", "caring", "curiosity", "desire"]
     if labels_path.exists():
         with open(labels_path) as f:
@@ -516,9 +660,16 @@ def plot_confusion_matrix(run, task: str = "topic") -> None:
     # Plot
     fig, ax = plt.subplots(figsize=(10, 8))
-    sns.heatmap(cm_normalized, annot=True, fmt=".2f", cmap=HEATMAP_CMAP,
-                xticklabels=labels[:n_classes], yticklabels=labels[:n_classes],
-                ax=ax, cbar_kws={"label": "Proportion"})
     ax.set_title(f"Confusion Matrix: {task.title()} Classification")
     ax.set_xlabel("Predicted Label")
@@ -570,7 +721,7 @@ def plot_3d_loss_landscape(run) -> None:
     # Synthetic loss surface (bowl shape with some local minima)
     min_loss = min(val_loss) if val_loss else min(train_loss)
-    Z = min_loss + 0.3 * (X**2 + Y**2) + 0.1 * np.sin(3*X) * np.cos(3*Y)
     # Add noise for realism
     Z += np.random.normal(0, 0.02, Z.shape)
@@ -584,41 +735,57 @@ def plot_3d_loss_landscape(run) -> None:
     fig = go.Figure()
     # Loss surface
-    fig.add_trace(go.Surface(
-        x=X, y=Y, z=Z,
-        colorscale=[[0, COLORS["accent"]], [0.5, COLORS["primary"]], [1, COLORS["secondary"]]],
-        opacity=0.8,
-        showscale=True,
-        colorbar=dict(title="Loss", x=1.02)
-    ))
     # Training trajectory
-    fig.add_trace(go.Scatter3d(
-        x=trajectory_x, y=trajectory_y, z=trajectory_z,
-        mode="lines+markers",
-        line=dict(color=COLORS["highlight"], width=5),
-        marker=dict(size=4, color=COLORS["highlight"]),
-        name="Training Path"
-    ))
     # Mark start and end
-    fig.add_trace(go.Scatter3d(
-        x=[trajectory_x[0]], y=[trajectory_y[0]], z=[trajectory_z[0]],
-        mode="markers+text",
-        marker=dict(size=10, color="red", symbol="circle"),
-        text=["Start"],
-        textposition="top center",
-        name="Start"
-    ))
-    fig.add_trace(go.Scatter3d(
-        x=[trajectory_x[-1]], y=[trajectory_y[-1]], z=[trajectory_z[-1]],
-        mode="markers+text",
-        marker=dict(size=10, color="green", symbol="diamond"),
-        text=["Converged"],
-        textposition="top center",
-        name="Converged"
-    ))
     fig.update_layout(
         title="Loss Landscape & Optimization Trajectory",
@@ -626,7 +793,7 @@ def plot_3d_loss_landscape(run) -> None:
             xaxis_title="Parameter Direction 1",
             yaxis_title="Parameter Direction 2",
             zaxis_title="Loss",
-            camera=dict(eye=dict(x=1.5, y=1.5, z=0.8))
         ),
         width=900,
         height=700,
@@ -658,26 +825,46 @@ def plot_3d_loss_landscape_static(run) -> None:
     X, Y = np.meshgrid(x, y)
     min_loss = min(train_loss)
-    Z = min_loss + 0.3 * (X**2 + Y**2) + 0.08 * np.sin(3*X) * np.cos(3*Y)
     fig = plt.figure(figsize=(12, 8))
     ax = fig.add_subplot(111, projection="3d")
     # Surface
-    surf = ax.plot_surface(X, Y, Z, cmap="viridis", alpha=0.7,
-                           linewidth=0, antialiased=True)
     # Training path
     path_x = np.linspace(-1.5, 0, len(train_loss))
     path_y = np.linspace(1.2, 0, len(train_loss))
-    ax.plot(path_x, path_y, train_loss, color=COLORS["secondary"],
-            linewidth=3, label="Training Path", zorder=10)
     # Start/end markers
-    ax.scatter([path_x[0]], [path_y[0]], train_loss[0],  # type: ignore[arg-type]
-               c="red", s=100, marker="o", label="Start")
-    ax.scatter([path_x[-1]], [path_y[-1]], train_loss[-1],  # type: ignore[arg-type]
-               c="green", s=100, marker="*", label="Converged")
     ax.set_xlabel("θ₁ Direction")
     ax.set_ylabel("θ₂ Direction")
@@ -722,7 +909,7 @@ def plot_embedding_space(run) -> None:
     for i in range(n_clusters):
         # Create cluster center
         center = np.random.randn(64) * 0.5
-        center[i*16:(i+1)*16] += 3  # Make clusters separable
         # Add samples around center
         samples = center + np.random.randn(n_samples // n_clusters, 64) * 0.5
@@ -742,8 +929,14 @@ def plot_embedding_space(run) -> None:
     for i in range(n_clusters):
         mask = cluster_labels == i
-        ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
-                   c=colors[i], label=labels[i], alpha=0.6, s=30)
     ax.set_xlabel("t-SNE Dimension 1")
     ax.set_ylabel("t-SNE Dimension 2")
@@ -787,14 +980,18 @@ def plot_training_dynamics(run) -> None:
     # Smoothed loss (exponential moving average)
     if len(train_loss) > 5:
         window = min(5, len(train_loss) // 2)
-        smoothed = np.convolve(train_loss, np.ones(window)/window, mode="valid")
-        smoothed_steps = train_steps[window-1:]
-        ax.plot(smoothed_steps, smoothed, color=COLORS["primary"],
-                linewidth=2.5, label="Training (smoothed)")
     if val_loss:
-        ax.plot(val_steps, val_loss, color=COLORS["secondary"],
-                linewidth=2.5, label="Validation")
     ax.set_title("Loss Convergence")
     ax.set_xlabel("Epoch")
@@ -806,8 +1003,10 @@ def plot_training_dynamics(run) -> None:
     ax = axes[0, 1]
     if len(train_loss) > 1:
-        improvements = [-(train_loss[i] - train_loss[i-1])/train_loss[i-1] * 100
-                        for i in range(1, len(train_loss))]
         colors_bar = [COLORS["accent"] if imp > 0 else COLORS["secondary"] for imp in improvements]
         ax.bar(train_steps[1:], improvements, color=colors_bar, alpha=0.7)
         ax.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
@@ -862,6 +1061,7 @@ def plot_training_dynamics(run) -> None:
 # Dashboard Generator
 def generate_dashboard(run) -> None:
     """
     Generate an interactive HTML dashboard with all visualizations.
@@ -883,63 +1083,73 @@ def generate_dashboard(run) -> None:
     # Create subplots
     fig = make_subplots(
-        rows=2, cols=2,
         subplot_titles=("Total Loss", "Task Losses", "Learning Rate", "Metrics"),
-        specs=[[{}, {}], [{}, {}]]
     )
     # Total loss
     if train_loss:
         fig.add_trace(
-            go.Scatter(x=train_steps, y=train_loss, name="Train Loss",
-                       line=dict(color=COLORS["primary"])),
-            row=1, col=1
         )
     if val_loss:
         fig.add_trace(
-            go.Scatter(x=val_steps, y=val_loss, name="Val Loss",
-                       line=dict(color=COLORS["secondary"])),
-            row=1, col=1
         )
     # Per-task losses
-    for task, color in [("summarization", COLORS["summary"]),
-                        ("emotion", COLORS["emotion"]),
-                        ("topic", COLORS["topic"])]:
         steps, values = get_metric_history(run, f"val_{task}_loss")
         if values:
             fig.add_trace(
-                go.Scatter(x=steps, y=values, name=f"{task.title()} Loss",
-                           line=dict(color=color)),
-                row=1, col=2
             )
     # Learning rate
     lr_metrics = client.get_metric_history(run.info.run_id, "learning_rate")
     if lr_metrics:
         fig.add_trace(
-            go.Scatter(x=[m.step for m in lr_metrics], y=[m.value for m in lr_metrics],
-                       name="Learning Rate", fill="tozeroy",
-                       line=dict(color=COLORS["primary"])),
-            row=2, col=1
         )
     # Accuracy metrics
-    for metric, color in [("topic_accuracy", COLORS["topic"]),
-                          ("emotion_f1", COLORS["emotion"])]:
         steps, values = get_metric_history(run, f"val_{metric}")
         if values:
             fig.add_trace(
-                go.Scatter(x=steps, y=values, name=metric.replace("_", " ").title(),
-                           line=dict(color=color)),
-                row=2, col=2
             )
     fig.update_layout(
-        title="LexiMind Training Dashboard",
-        height=800,
-        template="plotly_white",
-        showlegend=True
     )
     output_path = OUTPUTS_DIR / "training_dashboard.html"
@@ -949,17 +1159,20 @@ def generate_dashboard(run) -> None:
 # Main Entry Point
 def main():
     """Generate all training visualizations."""
     parser = argparse.ArgumentParser(description="LexiMind Visualization Suite")
-    parser.add_argument("--interactive", action="store_true",
-                        help="Generate interactive HTML plots (requires plotly)")
-    parser.add_argument("--landscape", action="store_true",
-                        help="Include 3D loss landscape visualization")
-    parser.add_argument("--dashboard", action="store_true",
-                        help="Generate interactive dashboard")
-    parser.add_argument("--all", action="store_true",
-                        help="Generate all visualizations")
     args = parser.parse_args()
     logger.info("=" * 60)

 # Professional color palette (accessible + publication-ready)
 COLORS = {
+    "primary": "#2E86AB",  # Deep blue - training
+    "secondary": "#E94F37",  # Coral red - validation
+    "accent": "#28A745",  # Green - best points
+    "highlight": "#F7B801",  # Gold - highlights
+    "dark": "#1E3A5F",  # Navy - text
+    "light": "#F5F5F5",  # Light gray - background
+    "topic": "#8338EC",  # Purple
+    "emotion": "#FF6B6B",  # Salmon
+    "summary": "#06D6A0",  # Teal
 }
 # Style configuration
 plt.style.use("seaborn-v0_8-whitegrid")
+plt.rcParams.update(
+    {
+        "font.family": "sans-serif",
+        "font.size": 11,
+        "axes.titlesize": 14,
+        "axes.titleweight": "bold",
+        "axes.labelsize": 12,
+        "legend.fontsize": 10,
+        "figure.titlesize": 16,
+        "figure.titleweight": "bold",
+        "savefig.dpi": 150,
+        "savefig.bbox": "tight",
+    }
+)
 # Custom colormap for heatmaps
 HEATMAP_CMAP = LinearSegmentedColormap.from_list(
 # MLflow Utilities
 def get_mlflow_client():
     """Get MLflow client with correct tracking URI."""
     if not HAS_MLFLOW:
         raise ImportError("MLflow not installed. Install with: pip install mlflow")
     import mlflow
     import mlflow.tracking
     # Use SQLite database (same as trainer.py)
     mlflow.set_tracking_uri("sqlite:///mlruns.db")
     return mlflow.tracking.MlflowClient()
 # Core Training Visualizations
 def plot_loss_curves(run, interactive: bool = False) -> None:
     """
     Plot training and validation loss over time.
     if interactive and HAS_PLOTLY:
         import plotly.graph_objects as go
         fig = go.Figure()
         if train_values:
+            fig.add_trace(
+                go.Scatter(
+                    x=train_steps,
+                    y=train_values,
+                    name="Training Loss",
+                    mode="lines",
+                    line=dict(color=COLORS["primary"], width=3),
+                )
+            )
         if val_values:
+            fig.add_trace(
+                go.Scatter(
+                    x=val_steps,
+                    y=val_values,
+                    name="Validation Loss",
+                    mode="lines",
+                    line=dict(color=COLORS["secondary"], width=3),
+                )
+            )
             # Best point
             best_idx = int(np.argmin(val_values))
+            fig.add_trace(
+                go.Scatter(
+                    x=[val_steps[best_idx]],
+                    y=[val_values[best_idx]],
+                    name=f"Best: {val_values[best_idx]:.3f}",
+                    mode="markers",
+                    marker=dict(color=COLORS["accent"], size=15, symbol="star"),
+                )
+            )
         fig.update_layout(
             title="Training Progress: Multi-Task Loss",
             xaxis_title="Epoch",
             yaxis_title="Loss",
             template="plotly_white",
+            hovermode="x unified",
         )
         output_path = OUTPUTS_DIR / "training_loss_curve.html"
     fig, ax = plt.subplots(figsize=(12, 6))
     if not train_values:
+        ax.text(
+            0.5,
+            0.5,
+            "No training data yet\n\nWaiting for first epoch...",
+            ha="center",
+            va="center",
+            fontsize=14,
+            color="gray",
+        )
         ax.set_xlim(0, 1)
         ax.set_ylim(0, 1)
     else:
         # Training curve
+        ax.plot(
+            train_steps,
+            train_values,
+            label="Training Loss",
+            linewidth=2.5,
+            color=COLORS["primary"],
+            alpha=0.9,
+        )
         # Validation curve with best point
         if val_values:
+            ax.plot(
+                val_steps,
+                val_values,
+                label="Validation Loss",
+                linewidth=2.5,
+                color=COLORS["secondary"],
+                alpha=0.9,
+            )
             best_idx = int(np.argmin(val_values))
+            ax.scatter(
+                [val_steps[best_idx]],
+                [val_values[best_idx]],
+                s=200,
+                c=COLORS["accent"],
+                zorder=5,
+                marker="*",
+                edgecolors="white",
+                linewidth=2,
+                label=f"Best: {val_values[best_idx]:.3f}",
+            )
             # Annotate best point
+            ax.annotate(
+                f"Epoch {val_steps[best_idx]}",
+                xy=(val_steps[best_idx], val_values[best_idx]),
+                xytext=(10, 20),
+                textcoords="offset points",
+                fontsize=10,
+                color=COLORS["accent"],
+                arrowprops=dict(arrowstyle="->", color=COLORS["accent"]),
+            )
         ax.legend(fontsize=11, loc="upper right", framealpha=0.9)
         ax.set_ylim(bottom=0)
     val_sum = client.get_metric_history(run.info.run_id, "val_summarization_loss")
     if train_sum:
+        ax.plot(
+            [m.step for m in train_sum],
+            [m.value for m in train_sum],
+            label="Train",
+            linewidth=2.5,
+            color=COLORS["summary"],
+        )
     if val_sum:
+        ax.plot(
+            [m.step for m in val_sum],
+            [m.value for m in val_sum],
+            label="Validation",
+            linewidth=2.5,
+            color=COLORS["secondary"],
+            linestyle="--",
+        )
     ax.set_title("Summarization Loss")
     ax.set_xlabel("Epoch")
     val_f1 = client.get_metric_history(run.info.run_id, "val_emotion_f1")
     if train_emo:
+        ax.plot(
+            [m.step for m in train_emo],
+            [m.value for m in train_emo],
+            label="Train Loss",
+            linewidth=2.5,
+            color=COLORS["emotion"],
+        )
     if val_emo:
+        ax.plot(
+            [m.step for m in val_emo],
+            [m.value for m in val_emo],
+            label="Val Loss",
+            linewidth=2.5,
+            color=COLORS["secondary"],
+            linestyle="--",
+        )
     # Secondary axis for F1
     ax2 = ax.twinx()
     if train_f1:
+        ax2.plot(
+            [m.step for m in train_f1],
+            [m.value for m in train_f1],
+            label="Train F1",
+            linewidth=2,
+            color=COLORS["accent"],
+            alpha=0.7,
+        )
     if val_f1:
+        ax2.plot(
+            [m.step for m in val_f1],
+            [m.value for m in val_f1],
+            label="Val F1",
+            linewidth=2,
+            color=COLORS["highlight"],
+            alpha=0.7,
+        )
         ax2.set_ylim(0, 1)
     ax.set_title("Emotion Detection (28 classes)")
     val_acc = client.get_metric_history(run.info.run_id, "val_topic_accuracy")
     if train_topic:
+        ax.plot(
+            [m.step for m in train_topic],
+            [m.value for m in train_topic],
+            label="Train Loss",
+            linewidth=2.5,
+            color=COLORS["topic"],
+        )
     if val_topic:
+        ax.plot(
+            [m.step for m in val_topic],
+            [m.value for m in val_topic],
+            label="Val Loss",
+            linewidth=2.5,
+            color=COLORS["secondary"],
+            linestyle="--",
+        )
     ax2 = ax.twinx()
     if train_acc:
+        ax2.plot(
+            [m.step for m in train_acc],
+            [m.value for m in train_acc],
+            label="Train Acc",
+            linewidth=2,
+            color=COLORS["accent"],
+            alpha=0.7,
+        )
     if val_acc:
+        ax2.plot(
+            [m.step for m in val_acc],
+            [m.value for m in val_acc],
+            label="Val Acc",
+            linewidth=2,
+            color=COLORS["highlight"],
+            alpha=0.7,
+        )
         ax2.set_ylim(0, 1)
     ax.set_title("Topic Classification (4 classes)")
     ax.axis("off")
     # Get final metrics
+    summary_lines = [
+        "+--------------------------------------+",
+        "|     FINAL METRICS (Last Epoch)       |",
+        "+--------------------------------------+",
+    ]
     if val_topic and val_acc:
         summary_lines.append(f"|  Topic Accuracy:    {val_acc[-1].value:>6.1%}         |")
     summary_lines.append("+--------------------------------------+")
+    ax.text(
+        0.1,
+        0.6,
+        "\n".join(summary_lines),
+        fontsize=11,
+        family="monospace",
+        verticalalignment="center",
+        bbox=dict(boxstyle="round", facecolor=COLORS["light"]),
+    )
     # Add model info
     run_params = run.data.params
     model_info += f"Batch Size: {run_params.get('batch_size', 'N/A')}\n"
     model_info += f"Learning Rate: {run_params.get('learning_rate', 'N/A')}"
+    ax.text(0.1, 0.15, model_info, fontsize=10, color="gray", verticalalignment="center")
     plt.tight_layout()
     output_path = OUTPUTS_DIR / "task_metrics.png"
     if not lr_metrics or len(lr_metrics) < 2:
         # No LR data logged - generate theoretical schedule from config
         logger.info("  No LR metrics found - generating theoretical schedule...")
         # Get config from run params
         params = run.data.params
         lr_max = float(params.get("learning_rate", params.get("lr", 5e-5)))
         warmup_steps = int(params.get("warmup_steps", 500))
         max_epochs = int(params.get("max_epochs", 5))
         # Estimate total steps from training loss history
         train_loss = client.get_metric_history(run.info.run_id, "train_total_loss")
         if train_loss:
             total_steps = max_epochs * estimated_steps_per_epoch
         else:
             total_steps = 4000  # Default fallback
         # Generate cosine schedule with warmup
         steps = np.arange(0, total_steps)
         values = []
                 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
                 lr = lr_max * max(0.1, 0.5 * (1 + np.cos(np.pi * progress)))
             values.append(lr)
         ax.fill_between(steps, values, alpha=0.3, color=COLORS["primary"])
         ax.plot(steps, values, linewidth=2.5, color=COLORS["primary"], label="Cosine + Warmup")
         # Mark warmup region
+        ax.axvline(
+            warmup_steps,
+            color=COLORS["secondary"],
+            linestyle="--",
+            alpha=0.7,
+            linewidth=2,
+            label=f"Warmup End ({warmup_steps})",
+        )
         ax.axvspan(0, warmup_steps, alpha=0.1, color=COLORS["highlight"])
         # Add annotation
+        ax.annotate(
+            f"Peak LR: {lr_max:.1e}",
+            xy=(warmup_steps, lr_max),
+            xytext=(warmup_steps + 200, lr_max * 0.9),
+            fontsize=10,
+            color=COLORS["dark"],
+            arrowprops=dict(arrowstyle="->", color=COLORS["dark"], alpha=0.5),
+        )
         ax.legend(loc="upper right")
+        ax.text(
+            0.98,
+            0.02,
+            "(Theoretical - actual LR not logged)",
+            transform=ax.transAxes,
+            ha="right",
+            va="bottom",
+            fontsize=9,
+            color="gray",
+            style="italic",
+        )
     else:
         steps = np.array([m.step for m in lr_metrics])
         values = [m.value for m in lr_metrics]
         params = run.data.params
         warmup_steps = int(params.get("warmup_steps", 500))
         if warmup_steps < max(steps):
+            ax.axvline(
+                warmup_steps,
+                color=COLORS["secondary"],
+                linestyle="--",
+                alpha=0.7,
+                linewidth=2,
+                label="Warmup End",
+            )
+            ax.axvspan(0, warmup_steps, alpha=0.1, color=COLORS["highlight"], label="Warmup Phase")
             ax.legend(loc="upper right")
     # Scientific notation for y-axis if needed
 # Advanced Visualizations
 def plot_confusion_matrix(run, task: str = "topic") -> None:
     """
     Plot confusion matrix for classification tasks.
     if task == "topic":
         default_labels = ["World", "Sports", "Business", "Sci/Tech"]
     else:  # emotion - top 8 for visibility
+        default_labels = [
+            "admiration",
+            "amusement",
+            "anger",
+            "annoyance",
+            "approval",
+            "caring",
+            "curiosity",
+            "desire",
+        ]
     if labels_path.exists():
         with open(labels_path) as f:
     # Plot
     fig, ax = plt.subplots(figsize=(10, 8))
+    sns.heatmap(
+        cm_normalized,
+        annot=True,
+        fmt=".2f",
+        cmap=HEATMAP_CMAP,
+        xticklabels=labels[:n_classes],
+        yticklabels=labels[:n_classes],
+        ax=ax,
+        cbar_kws={"label": "Proportion"},
+    )
     ax.set_title(f"Confusion Matrix: {task.title()} Classification")
     ax.set_xlabel("Predicted Label")
     # Synthetic loss surface (bowl shape with some local minima)
     min_loss = min(val_loss) if val_loss else min(train_loss)
+    Z = min_loss + 0.3 * (X**2 + Y**2) + 0.1 * np.sin(3 * X) * np.cos(3 * Y)
     # Add noise for realism
     Z += np.random.normal(0, 0.02, Z.shape)
     fig = go.Figure()
     # Loss surface
+    fig.add_trace(
+        go.Surface(
+            x=X,
+            y=Y,
+            z=Z,
+            colorscale=[[0, COLORS["accent"]], [0.5, COLORS["primary"]], [1, COLORS["secondary"]]],
+            opacity=0.8,
+            showscale=True,
+            colorbar=dict(title="Loss", x=1.02),
+        )
+    )
     # Training trajectory
+    fig.add_trace(
+        go.Scatter3d(
+            x=trajectory_x,
+            y=trajectory_y,
+            z=trajectory_z,
+            mode="lines+markers",
+            line=dict(color=COLORS["highlight"], width=5),
+            marker=dict(size=4, color=COLORS["highlight"]),
+            name="Training Path",
+        )
+    )
     # Mark start and end
+    fig.add_trace(
+        go.Scatter3d(
+            x=[trajectory_x[0]],
+            y=[trajectory_y[0]],
+            z=[trajectory_z[0]],
+            mode="markers+text",
+            marker=dict(size=10, color="red", symbol="circle"),
+            text=["Start"],
+            textposition="top center",
+            name="Start",
+        )
+    )
+    fig.add_trace(
+        go.Scatter3d(
+            x=[trajectory_x[-1]],
+            y=[trajectory_y[-1]],
+            z=[trajectory_z[-1]],
+            mode="markers+text",
+            marker=dict(size=10, color="green", symbol="diamond"),
+            text=["Converged"],
+            textposition="top center",
+            name="Converged",
+        )
+    )
     fig.update_layout(
         title="Loss Landscape & Optimization Trajectory",
             xaxis_title="Parameter Direction 1",
             yaxis_title="Parameter Direction 2",
             zaxis_title="Loss",
+            camera=dict(eye=dict(x=1.5, y=1.5, z=0.8)),
         ),
         width=900,
         height=700,
     X, Y = np.meshgrid(x, y)
     min_loss = min(train_loss)
+    Z = min_loss + 0.3 * (X**2 + Y**2) + 0.08 * np.sin(3 * X) * np.cos(3 * Y)
     fig = plt.figure(figsize=(12, 8))
     ax = fig.add_subplot(111, projection="3d")
     # Surface
+    surf = ax.plot_surface(X, Y, Z, cmap="viridis", alpha=0.7, linewidth=0, antialiased=True)
     # Training path
     path_x = np.linspace(-1.5, 0, len(train_loss))
     path_y = np.linspace(1.2, 0, len(train_loss))
+    ax.plot(
+        path_x,
+        path_y,
+        train_loss,
+        color=COLORS["secondary"],
+        linewidth=3,
+        label="Training Path",
+        zorder=10,
+    )
     # Start/end markers
+    ax.scatter(
+        [path_x[0]],
+        [path_y[0]],
+        train_loss[0],  # type: ignore[arg-type]
+        c="red",
+        s=100,
+        marker="o",
+        label="Start",
+    )
+    ax.scatter(
+        [path_x[-1]],
+        [path_y[-1]],
+        train_loss[-1],  # type: ignore[arg-type]
+        c="green",
+        s=100,
+        marker="*",
+        label="Converged",
+    )
     ax.set_xlabel("θ₁ Direction")
     ax.set_ylabel("θ₂ Direction")
     for i in range(n_clusters):
         # Create cluster center
         center = np.random.randn(64) * 0.5
+        center[i * 16 : (i + 1) * 16] += 3  # Make clusters separable
         # Add samples around center
         samples = center + np.random.randn(n_samples // n_clusters, 64) * 0.5
     for i in range(n_clusters):
         mask = cluster_labels == i
+        ax.scatter(
+            embeddings_2d[mask, 0],
+            embeddings_2d[mask, 1],
+            c=colors[i],
+            label=labels[i],
+            alpha=0.6,
+            s=30,
+        )
     ax.set_xlabel("t-SNE Dimension 1")
     ax.set_ylabel("t-SNE Dimension 2")
     # Smoothed loss (exponential moving average)
     if len(train_loss) > 5:
         window = min(5, len(train_loss) // 2)
+        smoothed = np.convolve(train_loss, np.ones(window) / window, mode="valid")
+        smoothed_steps = train_steps[window - 1 :]
+        ax.plot(
+            smoothed_steps,
+            smoothed,
+            color=COLORS["primary"],
+            linewidth=2.5,
+            label="Training (smoothed)",
+        )
     if val_loss:
+        ax.plot(val_steps, val_loss, color=COLORS["secondary"], linewidth=2.5, label="Validation")
     ax.set_title("Loss Convergence")
     ax.set_xlabel("Epoch")
     ax = axes[0, 1]
     if len(train_loss) > 1:
+        improvements = [
+            -(train_loss[i] - train_loss[i - 1]) / train_loss[i - 1] * 100
+            for i in range(1, len(train_loss))
+        ]
         colors_bar = [COLORS["accent"] if imp > 0 else COLORS["secondary"] for imp in improvements]
         ax.bar(train_steps[1:], improvements, color=colors_bar, alpha=0.7)
         ax.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
 # Dashboard Generator
 def generate_dashboard(run) -> None:
     """
     Generate an interactive HTML dashboard with all visualizations.
     # Create subplots
     fig = make_subplots(
+        rows=2,
+        cols=2,
         subplot_titles=("Total Loss", "Task Losses", "Learning Rate", "Metrics"),
+        specs=[[{}, {}], [{}, {}]],
     )
     # Total loss
     if train_loss:
         fig.add_trace(
+            go.Scatter(
+                x=train_steps, y=train_loss, name="Train Loss", line=dict(color=COLORS["primary"])
+            ),
+            row=1,
+            col=1,
         )
     if val_loss:
         fig.add_trace(
+            go.Scatter(
+                x=val_steps, y=val_loss, name="Val Loss", line=dict(color=COLORS["secondary"])
+            ),
+            row=1,
+            col=1,
         )
     # Per-task losses
+    for task, color in [
+        ("summarization", COLORS["summary"]),
+        ("emotion", COLORS["emotion"]),
+        ("topic", COLORS["topic"]),
+    ]:
         steps, values = get_metric_history(run, f"val_{task}_loss")
         if values:
             fig.add_trace(
+                go.Scatter(x=steps, y=values, name=f"{task.title()} Loss", line=dict(color=color)),
+                row=1,
+                col=2,
             )
     # Learning rate
     lr_metrics = client.get_metric_history(run.info.run_id, "learning_rate")
     if lr_metrics:
         fig.add_trace(
+            go.Scatter(
+                x=[m.step for m in lr_metrics],
+                y=[m.value for m in lr_metrics],
+                name="Learning Rate",
+                fill="tozeroy",
+                line=dict(color=COLORS["primary"]),
+            ),
+            row=2,
+            col=1,
         )
     # Accuracy metrics
+    for metric, color in [("topic_accuracy", COLORS["topic"]), ("emotion_f1", COLORS["emotion"])]:
         steps, values = get_metric_history(run, f"val_{metric}")
         if values:
             fig.add_trace(
+                go.Scatter(
+                    x=steps, y=values, name=metric.replace("_", " ").title(), line=dict(color=color)
+                ),
+                row=2,
+                col=2,
             )
     fig.update_layout(
+        title="LexiMind Training Dashboard", height=800, template="plotly_white", showlegend=True
     )
     output_path = OUTPUTS_DIR / "training_dashboard.html"
 # Main Entry Point
 def main():
     """Generate all training visualizations."""
     parser = argparse.ArgumentParser(description="LexiMind Visualization Suite")
+    parser.add_argument(
+        "--interactive",
+        action="store_true",
+        help="Generate interactive HTML plots (requires plotly)",
+    )
+    parser.add_argument(
+        "--landscape", action="store_true", help="Include 3D loss landscape visualization"
+    )
+    parser.add_argument("--dashboard", action="store_true", help="Generate interactive dashboard")
+    parser.add_argument("--all", action="store_true", help="Generate all visualizations")
     args = parser.parse_args()
     logger.info("=" * 60)

src/data/dataset.py CHANGED Viewed

@@ -24,6 +24,7 @@ from torch.utils.data import Dataset
 @dataclass
 class SummarizationExample:
     """Container for abstractive summarization samples."""
     source: str
     summary: str
@@ -31,6 +32,7 @@ class SummarizationExample:
 @dataclass
 class EmotionExample:
     """Container for multi-label emotion classification samples."""
     text: str
     emotions: Sequence[str]
@@ -38,12 +40,14 @@ class EmotionExample:
 @dataclass
 class TopicExample:
     """Container for topic clustering / classification samples."""
     text: str
     topic: str
 class SummarizationDataset(Dataset[SummarizationExample]):
     """Dataset yielding encoder-decoder training pairs."""
     def __init__(self, examples: Iterable[SummarizationExample]) -> None:
         self._examples = list(examples)
@@ -56,6 +60,7 @@ class SummarizationDataset(Dataset[SummarizationExample]):
 class EmotionDataset(Dataset[EmotionExample]):
     """Dataset that owns a scikit-learn MultiLabelBinarizer for emissions."""
     def __init__(
         self,
         examples: Iterable[EmotionExample],
@@ -91,6 +96,7 @@ class EmotionDataset(Dataset[EmotionExample]):
 class TopicDataset(Dataset[TopicExample]):
     """Dataset that owns a LabelEncoder for topic ids."""
     def __init__(
         self,
         examples: Iterable[TopicExample],
@@ -241,7 +247,7 @@ def load_topic_jsonl(path: str) -> List[TopicExample]:
 def _text_fingerprint(text: str, n_chars: int = 200) -> str:
     """Create a stable fingerprint from the first N characters of text.
     Uses a hash of the normalized (lowered, whitespace-collapsed) prefix
     to detect document-level overlap across tasks.
     """
@@ -255,28 +261,28 @@ def deduplicate_across_tasks(
     emotion_examples: List[EmotionExample] | None = None,
 ) -> Dict[str, int]:
     """Detect and report cross-task document overlap.
     Checks whether texts appearing in the summarization dataset also appear
     in the topic or emotion datasets, which could create data leakage in MTL.
     Returns:
         Dict with overlap counts between task pairs.
     """
     summ_fps: Set[str] = {_text_fingerprint(ex.source) for ex in summ_examples}
     topic_fps: Set[str] = {_text_fingerprint(ex.text) for ex in topic_examples}
     overlap: Dict[str, int] = {
         "summ_topic_overlap": len(summ_fps & topic_fps),
         "summ_total": len(summ_fps),
         "topic_total": len(topic_fps),
     }
     if emotion_examples:
         emot_fps: Set[str] = {_text_fingerprint(ex.text) for ex in emotion_examples}
         overlap["summ_emotion_overlap"] = len(summ_fps & emot_fps)
         overlap["topic_emotion_overlap"] = len(topic_fps & emot_fps)
         overlap["emotion_total"] = len(emot_fps)
     return overlap
@@ -286,20 +292,20 @@ def remove_overlapping_examples(
     split: str = "val",
 ) -> tuple[List[TopicExample], int]:
     """Remove topic examples whose texts overlap with summarization data.
-    This prevents cross-task data leakage where a document seen during
     summarization training could boost topic classification on validation/test.
     Args:
         primary_examples: Topic examples to filter
         reference_examples: Summarization examples to check against
         split: Name of split being processed (for logging)
     Returns:
         Tuple of (filtered_examples, num_removed)
     """
     ref_fps = {_text_fingerprint(ex.source) for ex in reference_examples}
     filtered = []
     removed = 0
     for ex in primary_examples:
@@ -308,8 +314,8 @@ def remove_overlapping_examples(
             removed += 1
         else:
             filtered.append(ex)
     if removed > 0:
         print(f"  Dedup: removed {removed} overlapping examples from topic {split}")
     return filtered, removed

 @dataclass
 class SummarizationExample:
     """Container for abstractive summarization samples."""
     source: str
     summary: str
 @dataclass
 class EmotionExample:
     """Container for multi-label emotion classification samples."""
     text: str
     emotions: Sequence[str]
 @dataclass
 class TopicExample:
     """Container for topic clustering / classification samples."""
     text: str
     topic: str
 class SummarizationDataset(Dataset[SummarizationExample]):
     """Dataset yielding encoder-decoder training pairs."""
     def __init__(self, examples: Iterable[SummarizationExample]) -> None:
         self._examples = list(examples)
 class EmotionDataset(Dataset[EmotionExample]):
     """Dataset that owns a scikit-learn MultiLabelBinarizer for emissions."""
     def __init__(
         self,
         examples: Iterable[EmotionExample],
 class TopicDataset(Dataset[TopicExample]):
     """Dataset that owns a LabelEncoder for topic ids."""
     def __init__(
         self,
         examples: Iterable[TopicExample],
 def _text_fingerprint(text: str, n_chars: int = 200) -> str:
     """Create a stable fingerprint from the first N characters of text.
     Uses a hash of the normalized (lowered, whitespace-collapsed) prefix
     to detect document-level overlap across tasks.
     """
     emotion_examples: List[EmotionExample] | None = None,
 ) -> Dict[str, int]:
     """Detect and report cross-task document overlap.
     Checks whether texts appearing in the summarization dataset also appear
     in the topic or emotion datasets, which could create data leakage in MTL.
     Returns:
         Dict with overlap counts between task pairs.
     """
     summ_fps: Set[str] = {_text_fingerprint(ex.source) for ex in summ_examples}
     topic_fps: Set[str] = {_text_fingerprint(ex.text) for ex in topic_examples}
     overlap: Dict[str, int] = {
         "summ_topic_overlap": len(summ_fps & topic_fps),
         "summ_total": len(summ_fps),
         "topic_total": len(topic_fps),
     }
     if emotion_examples:
         emot_fps: Set[str] = {_text_fingerprint(ex.text) for ex in emotion_examples}
         overlap["summ_emotion_overlap"] = len(summ_fps & emot_fps)
         overlap["topic_emotion_overlap"] = len(topic_fps & emot_fps)
         overlap["emotion_total"] = len(emot_fps)
     return overlap
     split: str = "val",
 ) -> tuple[List[TopicExample], int]:
     """Remove topic examples whose texts overlap with summarization data.
+    This prevents cross-task data leakage where a document seen during
     summarization training could boost topic classification on validation/test.
     Args:
         primary_examples: Topic examples to filter
         reference_examples: Summarization examples to check against
         split: Name of split being processed (for logging)
     Returns:
         Tuple of (filtered_examples, num_removed)
     """
     ref_fps = {_text_fingerprint(ex.source) for ex in reference_examples}
     filtered = []
     removed = 0
     for ex in primary_examples:
             removed += 1
         else:
             filtered.append(ex)
     if removed > 0:
         print(f"  Dedup: removed {removed} overlapping examples from topic {split}")
     return filtered, removed

src/models/decoder.py CHANGED Viewed

@@ -327,7 +327,6 @@ class TransformerDecoder(nn.Module):
             elif tgt_mask.dim() == 3:
                 tgt_mask = tgt_mask.unsqueeze(1)
         # Normalize memory_mask dtype/device and expand simple shapes
         if memory_mask is not None:
             memory_mask = memory_mask.to(dtype=torch.bool, device=x.device)
@@ -355,7 +354,15 @@ class TransformerDecoder(nn.Module):
                 # Gradient checkpointing requires the inputs to require grad
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
-                        return module(*inputs, tgt_mask=tgt_mask, memory_mask=memory_mask, collect_attn=collect_attn, self_attn_position_bias=self_position_bias, cross_attn_position_bias=cross_position_bias)
                     return custom_forward
                 x, attn = cast(
@@ -450,7 +457,7 @@ class TransformerDecoder(nn.Module):
     ) -> torch.Tensor:
         """
         Greedy decoding with KV caching for O(N) complexity.
         Args:
             length_penalty: Values > 1.0 encourage shorter sequences by boosting EOS probability
                            as sequence length increases. Default 1.0 (no penalty).

             elif tgt_mask.dim() == 3:
                 tgt_mask = tgt_mask.unsqueeze(1)
         # Normalize memory_mask dtype/device and expand simple shapes
         if memory_mask is not None:
             memory_mask = memory_mask.to(dtype=torch.bool, device=x.device)
                 # Gradient checkpointing requires the inputs to require grad
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
+                        return module(
+                            *inputs,
+                            tgt_mask=tgt_mask,
+                            memory_mask=memory_mask,
+                            collect_attn=collect_attn,
+                            self_attn_position_bias=self_position_bias,
+                            cross_attn_position_bias=cross_position_bias,
+                        )
                     return custom_forward
                 x, attn = cast(
     ) -> torch.Tensor:
         """
         Greedy decoding with KV caching for O(N) complexity.
         Args:
             length_penalty: Values > 1.0 encourage shorter sequences by boosting EOS probability
                            as sequence length increases. Default 1.0 (no penalty).

src/models/encoder.py CHANGED Viewed

@@ -291,7 +291,13 @@ class TransformerEncoder(nn.Module):
                 # We use a lambda to pass keyword arguments
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
-                        return module(*inputs, mask=mask, collect_attn=collect_attn, position_bias=position_bias)
                     return custom_forward
                 x, attn = cast(
@@ -303,8 +309,10 @@ class TransformerEncoder(nn.Module):
                     ),
                 )
             else:
-                x, attn = layer(x, mask=mask, collect_attn=collect_attn, position_bias=position_bias)
             if collect_attn:
                 attn_weights_per_layer.append(attn)

                 # We use a lambda to pass keyword arguments
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
+                        return module(
+                            *inputs,
+                            mask=mask,
+                            collect_attn=collect_attn,
+                            position_bias=position_bias,
+                        )
                     return custom_forward
                 x, attn = cast(
                     ),
                 )
             else:
+                x, attn = layer(
+                    x, mask=mask, collect_attn=collect_attn, position_bias=position_bias
+                )
             if collect_attn:
                 attn_weights_per_layer.append(attn)

src/models/factory.py CHANGED Viewed

@@ -208,7 +208,9 @@ def _load_pretrained_weights(
     if hasattr(encoder, "relative_position_bias") and encoder.relative_position_bias is not None:
         print("Transferring encoder relative position bias...")
         t5_enc_rel_bias = (
-            cast(Any, t5_encoder.block[0]).layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         encoder.relative_position_bias.relative_attention_bias.weight.data.copy_(t5_enc_rel_bias)
@@ -285,7 +287,9 @@ def _load_pretrained_weights(
     ):
         print("Transferring decoder self-attention relative position bias...")
         t5_dec_self_rel_bias = (
-            cast(Any, t5_decoder.block[0]).layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         decoder.self_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_self_rel_bias
@@ -298,7 +302,9 @@ def _load_pretrained_weights(
         print("Transferring decoder cross-attention relative position bias...")
         # Cross-attention relative position bias is in EncDecAttention of first block
         t5_dec_cross_rel_bias = (
-            cast(Any, t5_decoder.block[0]).layer[1].EncDecAttention.relative_attention_bias.weight.data
         )
         decoder.cross_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_cross_rel_bias
@@ -554,9 +560,9 @@ def build_multitask_model(
     model.add_head(
         "emotion",
         ClassificationHead(
-            d_model=cfg.d_model,
-            num_labels=num_emotions,
-            pooler="attention",
             dropout=cfg.dropout,
             hidden_dim=cfg.d_model // 2,  # 384-dim hidden layer
         ),

     if hasattr(encoder, "relative_position_bias") and encoder.relative_position_bias is not None:
         print("Transferring encoder relative position bias...")
         t5_enc_rel_bias = (
+            cast(Any, t5_encoder.block[0])
+            .layer[0]
+            .SelfAttention.relative_attention_bias.weight.data
         )
         encoder.relative_position_bias.relative_attention_bias.weight.data.copy_(t5_enc_rel_bias)
     ):
         print("Transferring decoder self-attention relative position bias...")
         t5_dec_self_rel_bias = (
+            cast(Any, t5_decoder.block[0])
+            .layer[0]
+            .SelfAttention.relative_attention_bias.weight.data
         )
         decoder.self_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_self_rel_bias
         print("Transferring decoder cross-attention relative position bias...")
         # Cross-attention relative position bias is in EncDecAttention of first block
         t5_dec_cross_rel_bias = (
+            cast(Any, t5_decoder.block[0])
+            .layer[1]
+            .EncDecAttention.relative_attention_bias.weight.data
         )
         decoder.cross_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_cross_rel_bias
     model.add_head(
         "emotion",
         ClassificationHead(
+            d_model=cfg.d_model,
+            num_labels=num_emotions,
+            pooler="attention",
             dropout=cfg.dropout,
             hidden_dim=cfg.d_model // 2,  # 384-dim hidden layer
         ),

src/models/heads.py CHANGED Viewed

@@ -66,13 +66,15 @@ class ClassificationHead(nn.Module):
         hidden_dim: Optional[int] = None,
     ):
         super().__init__()
-        assert pooler in ("mean", "cls", "max", "attention"), "pooler must be 'mean'|'cls'|'max'|'attention'"
         self.pooler = pooler
         self.dropout = nn.Dropout(dropout)
         if pooler == "attention":
             self.attn_pool = AttentionPooling(d_model)
         # Optional 2-layer MLP for more capacity (useful for multi-label)
         if hidden_dim is not None:
             self.out_proj = nn.Sequential(

         hidden_dim: Optional[int] = None,
     ):
         super().__init__()
+        assert pooler in ("mean", "cls", "max", "attention"), (
+            "pooler must be 'mean'|'cls'|'max'|'attention'"
+        )
         self.pooler = pooler
         self.dropout = nn.Dropout(dropout)
         if pooler == "attention":
             self.attn_pool = AttentionPooling(d_model)
         # Optional 2-layer MLP for more capacity (useful for multi-label)
         if hidden_dim is not None:
             self.out_proj = nn.Sequential(

src/training/metrics.py CHANGED Viewed

@@ -72,33 +72,33 @@ def calculate_bertscore(
 ) -> Dict[str, float]:
     """
     Calculate BERTScore for semantic similarity between predictions and references.
     BERTScore measures semantic similarity using contextual embeddings, making it
     more robust than n-gram based metrics like ROUGE for paraphrased content.
     Args:
         predictions: Generated summaries/descriptions
         references: Reference summaries/descriptions
         model_type: BERT model to use (default: deberta-xlarge-mnli for best quality)
         batch_size: Batch size for encoding
         device: Device to use (auto-detected if None)
     Returns:
         Dict with 'precision', 'recall', 'f1' BERTScore averages
     """
     if not predictions or not references:
         return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
     try:
         from bert_score import score as bert_score  # type: ignore[import-not-found]
     except ImportError:
         print("Warning: bert-score not installed. Run: pip install bert-score")
         return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
     # Auto-detect device
     if device is None:
         device = "cuda" if torch.cuda.is_available() else "cpu"
     # Calculate BERTScore
     P, R, F1 = bert_score(
         list(predictions),
@@ -108,7 +108,7 @@ def calculate_bertscore(
         device=device,
         verbose=False,
     )
     return {
         "precision": float(P.mean().item()),  # type: ignore[union-attr]
         "recall": float(R.mean().item()),  # type: ignore[union-attr]
@@ -122,35 +122,35 @@ def calculate_rouge(
 ) -> Dict[str, float]:
     """
     Calculate proper ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L).
     Args:
         predictions: Generated summaries
         references: Reference summaries
     Returns:
         Dict with rouge1, rouge2, rougeL F1 scores
     """
     if not predictions or not references:
         return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
     try:
         from rouge_score import rouge_scorer
     except ImportError:
         print("Warning: rouge-score not installed. Run: pip install rouge-score")
         return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
-    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
     rouge1_scores = []
     rouge2_scores = []
     rougeL_scores = []
     for pred, ref in zip(predictions, references, strict=False):
         scores = scorer.score(ref, pred)
-        rouge1_scores.append(scores['rouge1'].fmeasure)
-        rouge2_scores.append(scores['rouge2'].fmeasure)
-        rougeL_scores.append(scores['rougeL'].fmeasure)
     return {
         "rouge1": sum(rouge1_scores) / len(rouge1_scores),
         "rouge2": sum(rouge2_scores) / len(rouge2_scores),
@@ -166,37 +166,35 @@ def calculate_all_summarization_metrics(
 ) -> Dict[str, float]:
     """
     Calculate comprehensive summarization metrics for research paper reporting.
     Includes:
     - ROUGE-1, ROUGE-2, ROUGE-L (lexical overlap)
     - BLEU-4 (n-gram precision)
     - BERTScore (semantic similarity)
     Args:
         predictions: Generated summaries/descriptions
         references: Reference summaries/descriptions
         include_bertscore: Whether to compute BERTScore (slower but valuable)
         bertscore_model: Model for BERTScore computation
     Returns:
         Dict with all metric scores
     """
     metrics: Dict[str, float] = {}
     # ROUGE scores
     rouge_scores = calculate_rouge(predictions, references)
     metrics.update({f"rouge_{k}": v for k, v in rouge_scores.items()})
     # BLEU score
     metrics["bleu4"] = calculate_bleu(predictions, references)
     # BERTScore (semantic similarity - important for back-cover style descriptions)
     if include_bertscore:
-        bert_scores = calculate_bertscore(
-            predictions, references, model_type=bertscore_model
-        )
         metrics.update({f"bertscore_{k}": v for k, v in bert_scores.items()})
     return metrics
@@ -246,22 +244,22 @@ def get_confusion_matrix(
 def multilabel_macro_f1(predictions: torch.Tensor, targets: torch.Tensor) -> float:
     """Compute macro F1: average F1 per class (as in GoEmotions paper).
-    This averages F1 across labels, giving equal weight to each emotion class
     regardless of prevalence. Directly comparable to GoEmotions baselines.
     """
     preds = predictions.float()
     gold = targets.float()
     # Per-class TP, FP, FN
     tp = (preds * gold).sum(dim=0)
     fp = (preds * (1 - gold)).sum(dim=0)
     fn = ((1 - preds) * gold).sum(dim=0)
     precision = tp / (tp + fp).clamp(min=1e-8)
     recall = tp / (tp + fn).clamp(min=1e-8)
     f1 = (2 * precision * recall) / (precision + recall).clamp(min=1e-8)
     # Zero out F1 for classes with no support in either predictions or targets
     mask = (tp + fp + fn) > 0
     if mask.sum() == 0:
@@ -271,16 +269,16 @@ def multilabel_macro_f1(predictions: torch.Tensor, targets: torch.Tensor) -> flo
 def multilabel_micro_f1(predictions: torch.Tensor, targets: torch.Tensor) -> float:
     """Compute micro F1: aggregate TP/FP/FN across all classes.
     This gives more weight to frequent classes. Useful when class distribution matters.
     """
     preds = predictions.float()
     gold = targets.float()
     tp = (preds * gold).sum()
     fp = (preds * (1 - gold)).sum()
     fn = ((1 - preds) * gold).sum()
     precision = tp / (tp + fp).clamp(min=1e-8)
     recall = tp / (tp + fn).clamp(min=1e-8)
     f1 = (2 * precision * recall) / (precision + recall).clamp(min=1e-8)
@@ -293,17 +291,17 @@ def multilabel_per_class_metrics(
     class_names: Sequence[str] | None = None,
 ) -> Dict[str, Dict[str, float]]:
     """Compute per-class precision, recall, F1 for multi-label classification.
     Returns a dict mapping class name/index to its metrics.
     """
     preds = predictions.float()
     gold = targets.float()
     num_classes = preds.shape[1]
     tp = (preds * gold).sum(dim=0)
     fp = (preds * (1 - gold)).sum(dim=0)
     fn = ((1 - preds) * gold).sum(dim=0)
     report: Dict[str, Dict[str, float]] = {}
     for i in range(num_classes):
         name = class_names[i] if class_names else str(i)
@@ -325,26 +323,26 @@ def tune_per_class_thresholds(
     thresholds: Sequence[float] | None = None,
 ) -> tuple[List[float], float]:
     """Tune per-class thresholds on validation set to maximize macro F1.
-    For each class, tries multiple thresholds and selects the one that
-    maximizes that class's F1 score. This is standard practice for multi-label
     classification (used in the original GoEmotions paper).
     Args:
         logits: Raw model logits (batch, num_classes)
         targets: Binary target labels (batch, num_classes)
         thresholds: Candidate thresholds to try (default: 0.1 to 0.9 by 0.05)
     Returns:
         Tuple of (best_thresholds_per_class, resulting_macro_f1)
     """
     if thresholds is None:
         thresholds = [round(t, 2) for t in np.arange(0.1, 0.9, 0.05).tolist()]
     probs = torch.sigmoid(logits)
     num_classes = probs.shape[1]
     gold = targets.float()
     best_thresholds: List[float] = []
     for c in range(num_classes):
         best_f1 = -1.0
@@ -364,13 +362,13 @@ def tune_per_class_thresholds(
                 best_f1 = f1
                 best_t = t
         best_thresholds.append(best_t)
     # Compute resulting macro F1 with tuned thresholds
     tuned_preds = torch.zeros_like(probs)
     for c in range(num_classes):
         tuned_preds[:, c] = (probs[:, c] >= best_thresholds[c]).float()
     macro_f1 = multilabel_macro_f1(tuned_preds, targets)
     return best_thresholds, macro_f1
@@ -384,30 +382,30 @@ def bootstrap_confidence_interval(
     seed: int = 42,
 ) -> tuple[float, float, float]:
     """Compute bootstrap confidence interval for a metric.
     Args:
         scores: Per-sample metric values
         n_bootstrap: Number of bootstrap resamples
         confidence: Confidence level (default 95%)
         seed: Random seed for reproducibility
     Returns:
         Tuple of (mean, lower_bound, upper_bound)
     """
     rng = np.random.default_rng(seed)
     scores_arr = np.array(scores)
     n = len(scores_arr)
     bootstrap_means = []
     for _ in range(n_bootstrap):
         sample = rng.choice(scores_arr, size=n, replace=True)
         bootstrap_means.append(float(np.mean(sample)))
     bootstrap_means.sort()
     alpha = 1 - confidence
     lower_idx = int(alpha / 2 * n_bootstrap)
     upper_idx = int((1 - alpha / 2) * n_bootstrap)
     return (
         float(np.mean(scores_arr)),
         bootstrap_means[lower_idx],
@@ -422,15 +420,15 @@ def paired_bootstrap_test(
     seed: int = 42,
 ) -> float:
     """Paired bootstrap significance test between two systems.
     Tests if system B is significantly better than system A.
     Args:
         scores_a: Per-sample scores from system A
         scores_b: Per-sample scores from system B
         n_bootstrap: Number of bootstrap iterations
         seed: Random seed
     Returns:
         p-value (probability that B is not better than A)
     """
@@ -438,14 +436,14 @@ def paired_bootstrap_test(
     a = np.array(scores_a)
     b = np.array(scores_b)
     assert len(a) == len(b), "Both score lists must have the same length"
     n = len(a)
     count = 0
     for _ in range(n_bootstrap):
         idx = rng.choice(n, size=n, replace=True)
         diff = float(np.mean(b[idx]) - np.mean(a[idx]))
         if diff <= 0:
             count += 1
     return count / n_bootstrap

 ) -> Dict[str, float]:
     """
     Calculate BERTScore for semantic similarity between predictions and references.
     BERTScore measures semantic similarity using contextual embeddings, making it
     more robust than n-gram based metrics like ROUGE for paraphrased content.
     Args:
         predictions: Generated summaries/descriptions
         references: Reference summaries/descriptions
         model_type: BERT model to use (default: deberta-xlarge-mnli for best quality)
         batch_size: Batch size for encoding
         device: Device to use (auto-detected if None)
     Returns:
         Dict with 'precision', 'recall', 'f1' BERTScore averages
     """
     if not predictions or not references:
         return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
     try:
         from bert_score import score as bert_score  # type: ignore[import-not-found]
     except ImportError:
         print("Warning: bert-score not installed. Run: pip install bert-score")
         return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
     # Auto-detect device
     if device is None:
         device = "cuda" if torch.cuda.is_available() else "cpu"
     # Calculate BERTScore
     P, R, F1 = bert_score(
         list(predictions),
         device=device,
         verbose=False,
     )
     return {
         "precision": float(P.mean().item()),  # type: ignore[union-attr]
         "recall": float(R.mean().item()),  # type: ignore[union-attr]
 ) -> Dict[str, float]:
     """
     Calculate proper ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L).
     Args:
         predictions: Generated summaries
         references: Reference summaries
     Returns:
         Dict with rouge1, rouge2, rougeL F1 scores
     """
     if not predictions or not references:
         return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
     try:
         from rouge_score import rouge_scorer
     except ImportError:
         print("Warning: rouge-score not installed. Run: pip install rouge-score")
         return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
+    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
     rouge1_scores = []
     rouge2_scores = []
     rougeL_scores = []
     for pred, ref in zip(predictions, references, strict=False):
         scores = scorer.score(ref, pred)
+        rouge1_scores.append(scores["rouge1"].fmeasure)
+        rouge2_scores.append(scores["rouge2"].fmeasure)
+        rougeL_scores.append(scores["rougeL"].fmeasure)
     return {
         "rouge1": sum(rouge1_scores) / len(rouge1_scores),
         "rouge2": sum(rouge2_scores) / len(rouge2_scores),
 ) -> Dict[str, float]:
     """
     Calculate comprehensive summarization metrics for research paper reporting.
     Includes:
     - ROUGE-1, ROUGE-2, ROUGE-L (lexical overlap)
     - BLEU-4 (n-gram precision)
     - BERTScore (semantic similarity)
     Args:
         predictions: Generated summaries/descriptions
         references: Reference summaries/descriptions
         include_bertscore: Whether to compute BERTScore (slower but valuable)
         bertscore_model: Model for BERTScore computation
     Returns:
         Dict with all metric scores
     """
     metrics: Dict[str, float] = {}
     # ROUGE scores
     rouge_scores = calculate_rouge(predictions, references)
     metrics.update({f"rouge_{k}": v for k, v in rouge_scores.items()})
     # BLEU score
     metrics["bleu4"] = calculate_bleu(predictions, references)
     # BERTScore (semantic similarity - important for back-cover style descriptions)
     if include_bertscore:
+        bert_scores = calculate_bertscore(predictions, references, model_type=bertscore_model)
         metrics.update({f"bertscore_{k}": v for k, v in bert_scores.items()})
     return metrics
 def multilabel_macro_f1(predictions: torch.Tensor, targets: torch.Tensor) -> float:
     """Compute macro F1: average F1 per class (as in GoEmotions paper).
+    This averages F1 across labels, giving equal weight to each emotion class
     regardless of prevalence. Directly comparable to GoEmotions baselines.
     """
     preds = predictions.float()
     gold = targets.float()
     # Per-class TP, FP, FN
     tp = (preds * gold).sum(dim=0)
     fp = (preds * (1 - gold)).sum(dim=0)
     fn = ((1 - preds) * gold).sum(dim=0)
     precision = tp / (tp + fp).clamp(min=1e-8)
     recall = tp / (tp + fn).clamp(min=1e-8)
     f1 = (2 * precision * recall) / (precision + recall).clamp(min=1e-8)
     # Zero out F1 for classes with no support in either predictions or targets
     mask = (tp + fp + fn) > 0
     if mask.sum() == 0:
 def multilabel_micro_f1(predictions: torch.Tensor, targets: torch.Tensor) -> float:
     """Compute micro F1: aggregate TP/FP/FN across all classes.
     This gives more weight to frequent classes. Useful when class distribution matters.
     """
     preds = predictions.float()
     gold = targets.float()
     tp = (preds * gold).sum()
     fp = (preds * (1 - gold)).sum()
     fn = ((1 - preds) * gold).sum()
     precision = tp / (tp + fp).clamp(min=1e-8)
     recall = tp / (tp + fn).clamp(min=1e-8)
     f1 = (2 * precision * recall) / (precision + recall).clamp(min=1e-8)
     class_names: Sequence[str] | None = None,
 ) -> Dict[str, Dict[str, float]]:
     """Compute per-class precision, recall, F1 for multi-label classification.
     Returns a dict mapping class name/index to its metrics.
     """
     preds = predictions.float()
     gold = targets.float()
     num_classes = preds.shape[1]
     tp = (preds * gold).sum(dim=0)
     fp = (preds * (1 - gold)).sum(dim=0)
     fn = ((1 - preds) * gold).sum(dim=0)
     report: Dict[str, Dict[str, float]] = {}
     for i in range(num_classes):
         name = class_names[i] if class_names else str(i)
     thresholds: Sequence[float] | None = None,
 ) -> tuple[List[float], float]:
     """Tune per-class thresholds on validation set to maximize macro F1.
+    For each class, tries multiple thresholds and selects the one that
+    maximizes that class's F1 score. This is standard practice for multi-label
     classification (used in the original GoEmotions paper).
     Args:
         logits: Raw model logits (batch, num_classes)
         targets: Binary target labels (batch, num_classes)
         thresholds: Candidate thresholds to try (default: 0.1 to 0.9 by 0.05)
     Returns:
         Tuple of (best_thresholds_per_class, resulting_macro_f1)
     """
     if thresholds is None:
         thresholds = [round(t, 2) for t in np.arange(0.1, 0.9, 0.05).tolist()]
     probs = torch.sigmoid(logits)
     num_classes = probs.shape[1]
     gold = targets.float()
     best_thresholds: List[float] = []
     for c in range(num_classes):
         best_f1 = -1.0
                 best_f1 = f1
                 best_t = t
         best_thresholds.append(best_t)
     # Compute resulting macro F1 with tuned thresholds
     tuned_preds = torch.zeros_like(probs)
     for c in range(num_classes):
         tuned_preds[:, c] = (probs[:, c] >= best_thresholds[c]).float()
     macro_f1 = multilabel_macro_f1(tuned_preds, targets)
     return best_thresholds, macro_f1
     seed: int = 42,
 ) -> tuple[float, float, float]:
     """Compute bootstrap confidence interval for a metric.
     Args:
         scores: Per-sample metric values
         n_bootstrap: Number of bootstrap resamples
         confidence: Confidence level (default 95%)
         seed: Random seed for reproducibility
     Returns:
         Tuple of (mean, lower_bound, upper_bound)
     """
     rng = np.random.default_rng(seed)
     scores_arr = np.array(scores)
     n = len(scores_arr)
     bootstrap_means = []
     for _ in range(n_bootstrap):
         sample = rng.choice(scores_arr, size=n, replace=True)
         bootstrap_means.append(float(np.mean(sample)))
     bootstrap_means.sort()
     alpha = 1 - confidence
     lower_idx = int(alpha / 2 * n_bootstrap)
     upper_idx = int((1 - alpha / 2) * n_bootstrap)
     return (
         float(np.mean(scores_arr)),
         bootstrap_means[lower_idx],
     seed: int = 42,
 ) -> float:
     """Paired bootstrap significance test between two systems.
     Tests if system B is significantly better than system A.
     Args:
         scores_a: Per-sample scores from system A
         scores_b: Per-sample scores from system B
         n_bootstrap: Number of bootstrap iterations
         seed: Random seed
     Returns:
         p-value (probability that B is not better than A)
     """
     a = np.array(scores_a)
     b = np.array(scores_b)
     assert len(a) == len(b), "Both score lists must have the same length"
     n = len(a)
     count = 0
     for _ in range(n_bootstrap):
         idx = rng.choice(n, size=n, replace=True)
         diff = float(np.mean(b[idx]) - np.mean(a[idx]))
         if diff <= 0:
             count += 1
     return count / n_bootstrap

src/training/trainer.py CHANGED Viewed

@@ -48,24 +48,24 @@ class TrainerConfig:
     validation_max_length: int = 128
     label_smoothing: float = 0.1
     gradient_accumulation_steps: int = 1
     # LR scheduler
     scheduler_type: str = "cosine"
     warmup_steps: int = 500
     # Early stopping
     early_stopping_patience: int | None = 5
     # Task sampling strategy: "round_robin" or "temperature"
     # Temperature sampling: p_i ∝ n_i^alpha where n_i = dataset size
     # alpha < 1 reduces dominance of large tasks (recommended: 0.5-0.7)
     task_sampling: str = "temperature"
     task_sampling_alpha: float = 0.5
     # Gradient conflict diagnostics
     # Compute inter-task gradient cosine similarity every N steps (0 = disabled)
     gradient_conflict_frequency: int = 0
     # MLflow
     experiment_name: str = "LexiMind"
     run_name: str | None = None
@@ -76,13 +76,13 @@ class TrainerConfig:
 class EarlyStopping:
     """Stop training when validation loss stops improving."""
     def __init__(self, patience: int = 5, min_delta: float = 0.001):
         self.patience = patience
         self.min_delta = min_delta
         self.counter = 0
-        self.best_value = float('inf')
     def __call__(self, val_loss: float) -> bool:
         """Returns True if training should stop."""
         if val_loss < self.best_value - self.min_delta:
@@ -155,7 +155,9 @@ class Trainer:
             pbar = tqdm(
                 range(start_epoch, self.config.max_epochs + 1),
-                desc="Training", unit="epoch", file=sys.stderr
             )
             for epoch in pbar:
@@ -178,10 +180,12 @@ class Trainer:
                     # Early stopping
                     if self.early_stopping:
-                        val_loss = val_metrics.get("total_loss", float('inf'))
                         if self.early_stopping(val_loss):
-                            tqdm.write(f"\nEarly stopping at epoch {epoch} (best loss: {self.early_stopping.best_value:.4f})")
                             break
                 # Checkpoint
@@ -190,11 +194,11 @@ class Trainer:
                 # Update progress
                 epoch_time = time.perf_counter() - epoch_start
-                loss = train_metrics.get('total_loss', 0)
                 pbar.set_postfix({"loss": f"{loss:.3f}", "time": f"{epoch_time:.0f}s"})
         total_time = time.perf_counter() - total_start
-        print(f"\nTraining complete in {total_time/60:.1f} minutes")
         return history
     def _setup_scheduler(self, loaders: Dict[str, DataLoader], start_epoch: int) -> None:
@@ -203,7 +207,9 @@ class Trainer:
             self.scheduler = None
             return
-        steps_per_epoch = max(len(loader) for loader in loaders.values()) // max(1, self.config.gradient_accumulation_steps)
         total_steps = steps_per_epoch * (self.config.max_epochs - start_epoch + 1)
         warmup = self.config.warmup_steps
@@ -238,10 +244,12 @@ class Trainer:
         if self.config.task_sampling == "temperature" and len(task_names) > 1:
             sizes = np.array([len(loaders[t].dataset) for t in task_names], dtype=np.float64)  # type: ignore[arg-type]
             alpha = self.config.task_sampling_alpha
-            probs = sizes ** alpha
             probs = probs / probs.sum()
-            tqdm.write(f"  Temperature sampling (α={alpha}): " +
-                       ", ".join(f"{t}={p:.2%}" for t, p in zip(task_names, probs, strict=True)))
         else:
             probs = None
@@ -253,7 +261,9 @@ class Trainer:
                 # Select tasks for this step
                 if probs is not None and train:
                     # Temperature sampling: sample tasks based on dataset size
-                    selected_tasks = list(np.random.choice(task_names, size=len(task_names), replace=True, p=probs))
                 else:
                     # Round-robin: all tasks every step
                     selected_tasks = task_names
@@ -288,8 +298,11 @@ class Trainer:
                         scaled.backward()
                 # Gradient conflict diagnostics
-                if (train and self.config.gradient_conflict_frequency > 0
-                        and (step + 1) % self.config.gradient_conflict_frequency == 0):
                     conflict_stats = self._compute_gradient_conflicts(loaders, iterators)
                     for k, v in conflict_stats.items():
                         metrics[f"grad_{k}"].append(v)
@@ -316,8 +329,10 @@ class Trainer:
         # Average metrics
         averaged = {k: sum(v) / len(v) for k, v in metrics.items() if v}
-        tqdm.write(f"[{phase.lower()}] epoch {epoch}: " +
-                   ", ".join(f"{k}={v:.4f}" for k, v in averaged.items() if k != "epoch"))
         return averaged
     def _get_batch(self, iterators: Dict, loader: DataLoader, task: str) -> Dict | None:
@@ -330,8 +345,10 @@ class Trainer:
                 batch = next(iterators[task])
             except StopIteration:
                 return None
-        return {k: v.to(self.device, non_blocking=True) if isinstance(v, torch.Tensor) else v
-                for k, v in batch.items()}
     def _forward_task(self, task: str, batch: Dict) -> tuple[torch.Tensor, Dict[str, float]]:
         """Route to task-specific forward pass."""
@@ -360,10 +377,10 @@ class Trainer:
         # Decode predictions and references
         preds = self.tokenizer.decode_batch(logits.argmax(dim=-1).tolist())
         refs = self._decode_labels(batch["labels"])
         # Calculate comprehensive metrics
         metrics = {"rouge_like": rouge_like(preds, refs)}
         # Proper ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)
         try:
             rouge_scores = calculate_rouge(preds, refs)
@@ -372,13 +389,13 @@ class Trainer:
             metrics["rougeL"] = rouge_scores["rougeL"]
         except Exception:
             pass  # Fall back to rouge_like only if rouge-score not installed
         # BLEU-4 score
         try:
             metrics["bleu4"] = calculate_bleu(preds, refs)
         except Exception:
             pass
         return loss, metrics
     def _forward_emotion(self, batch: Dict) -> tuple[torch.Tensor, Dict[str, float]]:
@@ -423,8 +440,10 @@ class Trainer:
                 if i >= n:
                     break
-                batch = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v
-                         for k, v in batch.items()}
                 src_ids = batch["src_ids"][:1]
                 src_mask = batch.get("src_mask", None)
                 if src_mask is not None:
@@ -432,7 +451,9 @@ class Trainer:
                 # Generate with anti-repetition
                 model: Any = self.model
-                enc_mask = src_mask.unsqueeze(1) & src_mask.unsqueeze(2) if src_mask is not None else None
                 memory = model.encoder(src_ids, mask=enc_mask)
                 generated = model.decoder.greedy_decode(
                     memory=memory,
@@ -463,27 +484,27 @@ class Trainer:
         iterators: Dict,
     ) -> Dict[str, float]:
         """Compute inter-task gradient cosine similarity to diagnose conflicts.
         Returns cosine similarity between gradient vectors for each task pair.
         Negative values indicate conflicting gradients (negative transfer risk).
         """
         task_grads: Dict[str, torch.Tensor] = {}
         for task, loader in loaders.items():
             self.optimizer.zero_grad()
             batch = self._get_batch(iterators, loader, task)
             if batch is None:
                 continue
             dtype = torch.bfloat16 if self.use_bfloat16 else torch.float16
             with torch.autocast("cuda", dtype=dtype, enabled=self.use_amp):
                 loss, _ = self._forward_task(task, batch)
             if torch.isnan(loss):
                 continue
             loss.backward()
             # Flatten all gradients into a single vector
             grad_vec = []
             for p in self.model.parameters():
@@ -491,9 +512,9 @@ class Trainer:
                     grad_vec.append(p.grad.detach().clone().flatten())
             if grad_vec:
                 task_grads[task] = torch.cat(grad_vec)
         self.optimizer.zero_grad()
         # Compute pairwise cosine similarity
         stats: Dict[str, float] = {}
         tasks = list(task_grads.keys())
@@ -504,20 +525,22 @@ class Trainer:
                 cos_sim = F.cosine_similarity(g1.unsqueeze(0), g2.unsqueeze(0)).item()
                 stats[f"cos_sim_{t1}_{t2}"] = cos_sim
                 stats[f"conflict_{t1}_{t2}"] = 1.0 if cos_sim < 0 else 0.0
         return stats
     def _log_config(self) -> None:
         """Log config to MLflow."""
-        mlflow.log_params({
-            "max_epochs": self.config.max_epochs,
-            "gradient_clip_norm": self.config.gradient_clip_norm,
-            "label_smoothing": self.config.label_smoothing,
-            "task_weights": str(self.config.task_weights),
-            "warmup_steps": self.config.warmup_steps,
-            "scheduler_type": self.config.scheduler_type,
-            "learning_rate": self.optimizer.param_groups[0]["lr"],
-        })
     def _log_metrics(self, metrics: Dict[str, float], prefix: str, epoch: int) -> None:
         """Log metrics to MLflow."""

     validation_max_length: int = 128
     label_smoothing: float = 0.1
     gradient_accumulation_steps: int = 1
     # LR scheduler
     scheduler_type: str = "cosine"
     warmup_steps: int = 500
     # Early stopping
     early_stopping_patience: int | None = 5
     # Task sampling strategy: "round_robin" or "temperature"
     # Temperature sampling: p_i ∝ n_i^alpha where n_i = dataset size
     # alpha < 1 reduces dominance of large tasks (recommended: 0.5-0.7)
     task_sampling: str = "temperature"
     task_sampling_alpha: float = 0.5
     # Gradient conflict diagnostics
     # Compute inter-task gradient cosine similarity every N steps (0 = disabled)
     gradient_conflict_frequency: int = 0
     # MLflow
     experiment_name: str = "LexiMind"
     run_name: str | None = None
 class EarlyStopping:
     """Stop training when validation loss stops improving."""
     def __init__(self, patience: int = 5, min_delta: float = 0.001):
         self.patience = patience
         self.min_delta = min_delta
         self.counter = 0
+        self.best_value = float("inf")
     def __call__(self, val_loss: float) -> bool:
         """Returns True if training should stop."""
         if val_loss < self.best_value - self.min_delta:
             pbar = tqdm(
                 range(start_epoch, self.config.max_epochs + 1),
+                desc="Training",
+                unit="epoch",
+                file=sys.stderr,
             )
             for epoch in pbar:
                     # Early stopping
                     if self.early_stopping:
+                        val_loss = val_metrics.get("total_loss", float("inf"))
                         if self.early_stopping(val_loss):
+                            tqdm.write(
+                                f"\nEarly stopping at epoch {epoch} (best loss: {self.early_stopping.best_value:.4f})"
+                            )
                             break
                 # Checkpoint
                 # Update progress
                 epoch_time = time.perf_counter() - epoch_start
+                loss = train_metrics.get("total_loss", 0)
                 pbar.set_postfix({"loss": f"{loss:.3f}", "time": f"{epoch_time:.0f}s"})
         total_time = time.perf_counter() - total_start
+        print(f"\nTraining complete in {total_time / 60:.1f} minutes")
         return history
     def _setup_scheduler(self, loaders: Dict[str, DataLoader], start_epoch: int) -> None:
             self.scheduler = None
             return
+        steps_per_epoch = max(len(loader) for loader in loaders.values()) // max(
+            1, self.config.gradient_accumulation_steps
+        )
         total_steps = steps_per_epoch * (self.config.max_epochs - start_epoch + 1)
         warmup = self.config.warmup_steps
         if self.config.task_sampling == "temperature" and len(task_names) > 1:
             sizes = np.array([len(loaders[t].dataset) for t in task_names], dtype=np.float64)  # type: ignore[arg-type]
             alpha = self.config.task_sampling_alpha
+            probs = sizes**alpha
             probs = probs / probs.sum()
+            tqdm.write(
+                f"  Temperature sampling (α={alpha}): "
+                + ", ".join(f"{t}={p:.2%}" for t, p in zip(task_names, probs, strict=True))
+            )
         else:
             probs = None
                 # Select tasks for this step
                 if probs is not None and train:
                     # Temperature sampling: sample tasks based on dataset size
+                    selected_tasks = list(
+                        np.random.choice(task_names, size=len(task_names), replace=True, p=probs)
+                    )
                 else:
                     # Round-robin: all tasks every step
                     selected_tasks = task_names
                         scaled.backward()
                 # Gradient conflict diagnostics
+                if (
+                    train
+                    and self.config.gradient_conflict_frequency > 0
+                    and (step + 1) % self.config.gradient_conflict_frequency == 0
+                ):
                     conflict_stats = self._compute_gradient_conflicts(loaders, iterators)
                     for k, v in conflict_stats.items():
                         metrics[f"grad_{k}"].append(v)
         # Average metrics
         averaged = {k: sum(v) / len(v) for k, v in metrics.items() if v}
+        tqdm.write(
+            f"[{phase.lower()}] epoch {epoch}: "
+            + ", ".join(f"{k}={v:.4f}" for k, v in averaged.items() if k != "epoch")
+        )
         return averaged
     def _get_batch(self, iterators: Dict, loader: DataLoader, task: str) -> Dict | None:
                 batch = next(iterators[task])
             except StopIteration:
                 return None
+        return {
+            k: v.to(self.device, non_blocking=True) if isinstance(v, torch.Tensor) else v
+            for k, v in batch.items()
+        }
     def _forward_task(self, task: str, batch: Dict) -> tuple[torch.Tensor, Dict[str, float]]:
         """Route to task-specific forward pass."""
         # Decode predictions and references
         preds = self.tokenizer.decode_batch(logits.argmax(dim=-1).tolist())
         refs = self._decode_labels(batch["labels"])
         # Calculate comprehensive metrics
         metrics = {"rouge_like": rouge_like(preds, refs)}
         # Proper ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)
         try:
             rouge_scores = calculate_rouge(preds, refs)
             metrics["rougeL"] = rouge_scores["rougeL"]
         except Exception:
             pass  # Fall back to rouge_like only if rouge-score not installed
         # BLEU-4 score
         try:
             metrics["bleu4"] = calculate_bleu(preds, refs)
         except Exception:
             pass
         return loss, metrics
     def _forward_emotion(self, batch: Dict) -> tuple[torch.Tensor, Dict[str, float]]:
                 if i >= n:
                     break
+                batch = {
+                    k: v.to(self.device) if isinstance(v, torch.Tensor) else v
+                    for k, v in batch.items()
+                }
                 src_ids = batch["src_ids"][:1]
                 src_mask = batch.get("src_mask", None)
                 if src_mask is not None:
                 # Generate with anti-repetition
                 model: Any = self.model
+                enc_mask = (
+                    src_mask.unsqueeze(1) & src_mask.unsqueeze(2) if src_mask is not None else None
+                )
                 memory = model.encoder(src_ids, mask=enc_mask)
                 generated = model.decoder.greedy_decode(
                     memory=memory,
         iterators: Dict,
     ) -> Dict[str, float]:
         """Compute inter-task gradient cosine similarity to diagnose conflicts.
         Returns cosine similarity between gradient vectors for each task pair.
         Negative values indicate conflicting gradients (negative transfer risk).
         """
         task_grads: Dict[str, torch.Tensor] = {}
         for task, loader in loaders.items():
             self.optimizer.zero_grad()
             batch = self._get_batch(iterators, loader, task)
             if batch is None:
                 continue
             dtype = torch.bfloat16 if self.use_bfloat16 else torch.float16
             with torch.autocast("cuda", dtype=dtype, enabled=self.use_amp):
                 loss, _ = self._forward_task(task, batch)
             if torch.isnan(loss):
                 continue
             loss.backward()
             # Flatten all gradients into a single vector
             grad_vec = []
             for p in self.model.parameters():
                     grad_vec.append(p.grad.detach().clone().flatten())
             if grad_vec:
                 task_grads[task] = torch.cat(grad_vec)
         self.optimizer.zero_grad()
         # Compute pairwise cosine similarity
         stats: Dict[str, float] = {}
         tasks = list(task_grads.keys())
                 cos_sim = F.cosine_similarity(g1.unsqueeze(0), g2.unsqueeze(0)).item()
                 stats[f"cos_sim_{t1}_{t2}"] = cos_sim
                 stats[f"conflict_{t1}_{t2}"] = 1.0 if cos_sim < 0 else 0.0
         return stats
     def _log_config(self) -> None:
         """Log config to MLflow."""
+        mlflow.log_params(
+            {
+                "max_epochs": self.config.max_epochs,
+                "gradient_clip_norm": self.config.gradient_clip_norm,
+                "label_smoothing": self.config.label_smoothing,
+                "task_weights": str(self.config.task_weights),
+                "warmup_steps": self.config.warmup_steps,
+                "scheduler_type": self.config.scheduler_type,
+                "learning_rate": self.optimizer.param_groups[0]["lr"],
+            }
+        )
     def _log_metrics(self, metrics: Dict[str, float], prefix: str, epoch: int) -> None:
         """Log metrics to MLflow."""

src/utils/__init__.py CHANGED Viewed

@@ -14,9 +14,16 @@ from .io import load_state, save_state
 from .labels import load_label_metadata, save_label_metadata
 __all__ = [
-    "save_checkpoint", "load_checkpoint",
-    "save_state", "load_state",
-    "LabelMetadata", "load_labels", "save_labels",
-    "load_label_metadata", "save_label_metadata",
-    "set_seed", "Config", "load_yaml",
 ]

 from .labels import load_label_metadata, save_label_metadata
 __all__ = [
+    "save_checkpoint",
+    "load_checkpoint",
+    "save_state",
+    "load_state",
+    "LabelMetadata",
+    "load_labels",
+    "save_labels",
+    "load_label_metadata",
+    "save_label_metadata",
+    "set_seed",
+    "Config",
+    "load_yaml",
 ]

src/utils/core.py CHANGED Viewed

@@ -28,7 +28,7 @@ def save_checkpoint(model: torch.nn.Module, path: str | Path) -> None:
     """Save model state dict, handling torch.compile artifacts."""
     path = Path(path)
     path.parent.mkdir(parents=True, exist_ok=True)
     # Strip '_orig_mod.' prefix from compiled models
     state_dict = {k.replace("_orig_mod.", ""): v for k, v in model.state_dict().items()}
     torch.save(state_dict, path)
@@ -47,7 +47,7 @@ def load_checkpoint(model: torch.nn.Module, path: str | Path) -> None:
 @dataclass
 class LabelMetadata:
     """Container for emotion and topic label vocabularies."""
     emotion: List[str]
     topic: List[str]
@@ -65,16 +65,16 @@ def load_labels(path: str | Path) -> LabelMetadata:
     path = Path(path)
     if not path.exists():
         raise FileNotFoundError(f"Labels not found: {path}")
     with path.open("r", encoding="utf-8") as f:
         data = json.load(f)
     emotion = data.get("emotion") or data.get("emotions", [])
     topic = data.get("topic") or data.get("topics", [])
     if not emotion or not topic:
         raise ValueError("Labels file must contain 'emotion' and 'topic' lists")
     return LabelMetadata(emotion=emotion, topic=topic)
@@ -82,7 +82,7 @@ def save_labels(labels: LabelMetadata, path: str | Path) -> None:
     """Save label metadata to JSON file."""
     path = Path(path)
     path.parent.mkdir(parents=True, exist_ok=True)
     with path.open("w", encoding="utf-8") as f:
         json.dump({"emotion": labels.emotion, "topic": labels.topic}, f, indent=2)
@@ -105,12 +105,14 @@ def set_seed(seed: int) -> None:
 @dataclass
 class Config:
     """Simple config wrapper."""
     data: dict
 def load_yaml(path: str | Path) -> Config:
     """Load YAML configuration file."""
     import yaml
     with Path(path).open("r", encoding="utf-8") as f:
         content = yaml.safe_load(f)
     if not isinstance(content, dict):

     """Save model state dict, handling torch.compile artifacts."""
     path = Path(path)
     path.parent.mkdir(parents=True, exist_ok=True)
     # Strip '_orig_mod.' prefix from compiled models
     state_dict = {k.replace("_orig_mod.", ""): v for k, v in model.state_dict().items()}
     torch.save(state_dict, path)
 @dataclass
 class LabelMetadata:
     """Container for emotion and topic label vocabularies."""
     emotion: List[str]
     topic: List[str]
     path = Path(path)
     if not path.exists():
         raise FileNotFoundError(f"Labels not found: {path}")
     with path.open("r", encoding="utf-8") as f:
         data = json.load(f)
     emotion = data.get("emotion") or data.get("emotions", [])
     topic = data.get("topic") or data.get("topics", [])
     if not emotion or not topic:
         raise ValueError("Labels file must contain 'emotion' and 'topic' lists")
     return LabelMetadata(emotion=emotion, topic=topic)
     """Save label metadata to JSON file."""
     path = Path(path)
     path.parent.mkdir(parents=True, exist_ok=True)
     with path.open("w", encoding="utf-8") as f:
         json.dump({"emotion": labels.emotion, "topic": labels.topic}, f, indent=2)
 @dataclass
 class Config:
     """Simple config wrapper."""
     data: dict
 def load_yaml(path: str | Path) -> Config:
     """Load YAML configuration file."""
     import yaml
     with Path(path).open("r", encoding="utf-8") as f:
         content = yaml.safe_load(f)
     if not isinstance(content, dict):

tests/test_training/test_trainer.py CHANGED Viewed

@@ -111,8 +111,9 @@ class TestGradientFlow(unittest.TestCase):
         loss = nn.CrossEntropyLoss()(logits, batch["labels"])
         loss.backward()
-        has_grads = any(p.grad is not None and p.grad.abs().sum() > 0
-                        for p in self.model.parameters())
         self.assertTrue(has_grads, "No gradients found")
     def test_emotion_gradients(self):
@@ -130,8 +131,9 @@ class TestGradientFlow(unittest.TestCase):
         loss = nn.BCEWithLogitsLoss()(logits, batch["labels"])
         loss.backward()
-        has_grads = any(p.grad is not None and p.grad.abs().sum() > 0
-                        for p in self.model.parameters())
         self.assertTrue(has_grads, "No gradients found")
     def test_summarization_gradients(self):
@@ -145,14 +147,12 @@ class TestGradientFlow(unittest.TestCase):
         self.model.zero_grad()
         logits = self.model.forward("summarization", batch)
         # Flatten for cross entropy: (B*T, vocab) vs (B*T,)
-        loss = nn.CrossEntropyLoss()(
-            logits.view(-1, 100),
-            batch["labels"].view(-1)
-        )
         loss.backward()
-        has_grads = any(p.grad is not None and p.grad.abs().sum() > 0
-                        for p in self.model.parameters())
         self.assertTrue(has_grads, "No gradients found")

         loss = nn.CrossEntropyLoss()(logits, batch["labels"])
         loss.backward()
+        has_grads = any(
+            p.grad is not None and p.grad.abs().sum() > 0 for p in self.model.parameters()
+        )
         self.assertTrue(has_grads, "No gradients found")
     def test_emotion_gradients(self):
         loss = nn.BCEWithLogitsLoss()(logits, batch["labels"])
         loss.backward()
+        has_grads = any(
+            p.grad is not None and p.grad.abs().sum() > 0 for p in self.model.parameters()
+        )
         self.assertTrue(has_grads, "No gradients found")
     def test_summarization_gradients(self):
         self.model.zero_grad()
         logits = self.model.forward("summarization", batch)
         # Flatten for cross entropy: (B*T, vocab) vs (B*T,)
+        loss = nn.CrossEntropyLoss()(logits.view(-1, 100), batch["labels"].view(-1))
         loss.backward()
+        has_grads = any(
+            p.grad is not None and p.grad.abs().sum() > 0 for p in self.model.parameters()
+        )
         self.assertTrue(has_grads, "No gradients found")