# NTF Tutorial Quality Assurance Report ## Executive Summary This report provides a comprehensive end-to-end quality assurance review of all 13 tutorial markdown files in the NTF (Nexuss Transformer Framework) documentation. The review focuses on: 1. **Architecture Alignment**: Ensuring all tutorials correctly use NTF-native components rather than generic HuggingFace patterns 2. **Completeness**: Identifying missing NTF components that should be documented 3. **Practical Examples**: Verifying code examples correctly implement NTF architecture 4. **Learning Progression**: Ensuring continuous flow from beginner to advanced without explicit labeling 5. **Professional Tone**: Removing speculative hardware estimates and AI jargon **Overall Assessment**: The tutorials require significant refactoring to align with NTF architecture. Many examples use generic HuggingFace/DeepSpeed patterns instead of NTF's native components like `FullFinetuneTrainer`, `ModelRegistry`, `LayerFreezer`, and `PEFTTrainer`. --- ## Architecture Overview (Reference for Review) ### Core NTF Components Identified: **Training Components (`finetuning/`):** - `FullFinetuneTrainer` - Main training orchestrator with accelerator support - `LoRATrainer` / `PEFTTrainer` - Parameter-efficient fine-tuning implementations - `LayerFreezer` - Strategic layer freezing utilities - Training configurations via `configs.py` **Model Management (`models/`):** - `ModelRegistry` - Model loading, registration, and versioning - Adapter loading utilities for LoRA/PEFT - Custom model head implementations **Data Pipeline (`training/data.py`):** - `TextDataset` - Standardized dataset class - Data collators and preprocessing utilities - Chat template integration **Reward & RLHF (`reward/`):** - `RewardModel` - Reward model implementation - Preference dataset handling - RLHF pipeline utilities **Utilities (`utils/`):** - `metrics.py` - Evaluation metrics (perplexity, accuracy, etc.) - `versioning.py` - Model versioning utilities - `continual_learning.py` - Continual learning wrappers - Logging and checkpointing utilities **Configuration (`config/`):** - YAML-based configuration system - Nested configuration classes for models, training, data, PEFT --- ## Tutorial-by-Tutorial Analysis ### Tutorial 00: Introduction to Fine-Tuning **File**: `Tutorials/Tutorial_00_Introduction_to_Fine_Tuning.md` #### Issues Identified: 1. **❌ File Reference Mismatch** - Table of Contents references `Tutorial_01_Setting_Up_Your_Environment.md` but actual file is `Tutorial_01_Environment_Setup.md` - Similar mismatches throughout (e.g., `Tutorial_03_Full_Parameter_Fine_Tuning.md` vs `Tutorial_03_Full_Fine_Tuning.md`) 2. **❌ Speculative Hardware Estimates** ```markdown - Small Models (7B): 40-80GB VRAM - Medium Models (13B-70B): 80GB+ VRAM ``` These are ungrounded estimates that vary based on sequence length, batch size, precision, and optimization techniques. 3. **❌ Missing NTF Component Overview** - No mention of `ModelRegistry`, `FullFinetuneTrainer`, `LayerFreezer` - Introduces fine-tuning concepts without connecting to NTF's implementation 4. **⚠️ AI Jargon** - "Catastrophic forgetting" mentioned without practical mitigation strategies using NTF utilities #### Recommended Fixes: ```markdown ## NTF Architecture Overview Before diving into fine-tuning, understand the core components you'll use: - **ModelRegistry**: Central hub for loading, configuring, and versioning models - **FullFinetuneTrainer**: Production-ready training orchestrator with distributed support - **LayerFreezer**: Selectively freeze backbone layers to reduce memory and prevent catastrophic forgetting - **PEFTTrainer**: Parameter-efficient fine-tuning with LoRA, AdaLoRA, and LoHa adapters - **TextDataset**: Unified data loading with chat template support These components work together to provide a streamlined fine-tuning experience... ``` **Priority**: 🔴 HIGH - Foundation tutorial sets expectations for all subsequent tutorials --- ### Tutorial 01: Environment Setup **File**: `Tutorials/Tutorial_01_Environment_Setup.md` #### Issues Identified: 1. **✅ Good Alignment**: Correctly uses `ntf` package installation 2. **⚠️ Missing Integration**: Doesn't show how to verify NTF components are working 3. **⚠️ Hardware Requirements Section**: Contains speculative VRAM estimates #### Recommended Fixes: Add verification step: ```python from ntf.models import ModelRegistry from ntf.finetuning import FullFinetuneTrainer from ntf.config import NTFConfig # Verify installation print(f"NTF Version: {ntf.__version__}") print("Core components imported successfully!") ``` Remove or qualify hardware estimates with: "Actual requirements vary based on sequence length, batch size, and precision settings." **Priority**: 🟡 MEDIUM - Generally sound but needs NTF component verification --- ### Tutorial 02: Working with Datasets **File**: `Tutorials/Tutorial_02_Working_with_Datasets.md` #### Issues Identified: 1. **❌ Custom Dataset Implementation Conflicts with NTF Utilities** - Tutorial creates custom `CustomDataset` class from scratch - NTF already provides `TextDataset` in `training/data.py` with built-in chat template support 2. **⚠️ Missing Chat Template Integration** - NTF's `TextDataset` supports chat templates but tutorial doesn't demonstrate this 3. **✅ Good Points**: Covers data cleaning, formatting, and train/test split #### Recommended Fixes: Replace custom dataset with NTF's implementation: ```python from ntf.training.data import TextDataset from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Use NTF's built-in dataset dataset = TextDataset( data_path="formatted_data.json", tokenizer=tokenizer, max_length=512, use_chat_template=True, # Built-in support column_mapping={ "instruction": "instruction", "input": "context", "output": "response" } ) # Access preprocessed data train_data = dataset.get_train_dataset() eval_data = dataset.get_eval_dataset() ``` Add section on custom data collators if needed: ```python from ntf.training.data import create_data_collator collator = create_data_collator( tokenizer=tokenizer, padding=True, max_length=512 ) ``` **Priority**: 🔴 HIGH - Reduces code duplication and teaches users NTF-native patterns --- ### Tutorial 03: Full Parameter Fine-Tuning **File**: `Tutorials/Tutorial_03_Full_Fine_Tuning.md` #### Issues Identified: 1. **❌ Complete Architecture Misalignment** - Uses raw HuggingFace `Trainer` instead of NTF's `FullFinetuneTrainer` - Manual training loop doesn't leverage NTF's accelerator support - Missing gradient checkpointing, mixed precision, and distributed training hooks 2. **❌ DeepSpeed Configuration Not Integrated** - Shows DeepSpeed config but doesn't connect to NTF's configuration system - NTF has `configs.py` with nested configuration classes 3. **❌ Missing ModelRegistry Usage** - Loads model directly with `AutoModelForCausalLM` - Should use `ModelRegistry` for consistent model loading and adapter support 4. **❌ No Layer Freezing Demonstration** - Full fine-tuning can benefit from selective layer freezing - `LayerFreezer` component completely absent #### Recommended Complete Rewrite: ```python from ntf.config import NTFConfig, ModelConfig, TrainingConfig from ntf.models import ModelRegistry from ntf.finetuning import FullFinetuneTrainer from ntf.training.data import TextDataset # 1. Configuration-driven setup config = NTFConfig( model=ModelConfig( name="meta-llama/Llama-2-7b-hf", trust_remote_code=True, torch_dtype="bfloat16" ), training=TrainingConfig( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, warmup_ratio=0.1, weight_decay=0.01, logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch", fp16=False, bf16=True, gradient_checkpointing=True, dataloader_num_workers=4 ) ) # 2. Use ModelRegistry for model loading registry = ModelRegistry(config.model) model, tokenizer = registry.load_model_and_tokenizer() # Optional: Freeze backbone layers to reduce memory from ntf.finetuning import LayerFreezer freezer = LayerFreezer(model) freezer.freeze_backbone(num_layers_to_keep=-1) # Keep all trainable, or specify number # 3. Prepare dataset with NTF utilities dataset = TextDataset( data_path="formatted_data.json", tokenizer=tokenizer, max_length=512, use_chat_template=True ) # 4. Initialize NTF's FullFinetuneTrainer trainer = FullFinetuneTrainer( model=model, config=config.training, train_dataset=dataset.get_train_dataset(), eval_dataset=dataset.get_eval_dataset(), tokenizer=tokenizer ) # 5. Train with built-in accelerator support trainer.train() # 6. Save with versioning registry.save_model(trainer.model, output_dir="./final_model", version="1.0.0") ``` **Priority**: 🔴 CRITICAL - Core tutorial completely misaligned with NTF architecture --- ### Tutorial 04: Multi-Task Fine-Tuning **File**: `Tutorials/Tutorial_04_Multi_Task_Fine_Tuning.md` #### Issues Identified: 1. **❌ Feature Not Implemented in NTF** - Multi-task learning with task-specific heads not present in current NTF codebase - Tutorial describes capabilities that don't exist 2. **⚠️ Alternative Approach Needed** - Could demonstrate sequential fine-tuning with `ContinualLearning` utilities - Or focus on multi-domain datasets with single head #### Recommended Refocus: Either: 1. **Implement the feature** in NTF first, then document 2. **Refocus tutorial** on sequential domain adaptation using existing utilities: ```python from ntf.utils.continual_learning import ContinualLearningWrapper from ntf.finetuning import FullFinetuneTrainer # Sequential fine-tuning on multiple domains wrapper = ContinualLearningWrapper(model) # Domain 1: Code generation trainer1 = FullFinetuneTrainer(...) trainer1.train() wrapper.save_state("domain1_checkpoint") # Domain 2: Math reasoning (with regularization to prevent forgetting) wrapper.apply_ewc_regularization(lambda_ewc=0.5) trainer2 = FullFinetuneTrainer(...) trainer2.train() ``` **Priority**: 🔴 HIGH - Documents non-existent features; needs immediate attention --- ### Tutorial 05: Parameter-Efficient Fine-Tuning (PEFT) **File**: `Tutorials/Tutorial_05_Parameter_Efficient_Fine_Tuning.md` #### Issues Identified: 1. **⚠️ Partial Alignment** - Correctly introduces LoRA concept - But uses manual `LoraConfig` setup instead of NTF's `PEFTTrainer` 2. **❌ Missing NTF PEFTTrainer** - `finetuning/lora.py` contains `LoRATrainer` / `PEFTTrainer` class - Tutorial should demonstrate this unified interface 3. **⚠️ Adapter Loading Not Covered** - NTF's `models/adapters.py` has utilities for loading/saving adapters - Critical for production workflows #### Recommended Fixes: ```python from ntf.config import NTFConfig, PEFTConfig from ntf.models import ModelRegistry from ntf.finetuning import PEFTTrainer # Configuration-driven PEFT config = NTFConfig( model=ModelConfig(name="meta-llama/Llama-2-7b-hf"), peft=PEFTConfig( method="lora", # or "adalora", "loha" r=16, lora_alpha=32, lora_dropout=0.1, target_modules=["q_proj", "v_proj"], bias="none", task_type="CAUSAL_LM" ), training=TrainingConfig(...) ) # Load model with registry registry = ModelRegistry(config.model) model, tokenizer = registry.load_model_and_tokenizer() # Apply PEFT adapters adapter_config = registry.apply_peft_adapters(config.peft) # Use PEFTTrainer with built-in adapter handling trainer = PEFTTrainer( model=model, adapter_config=adapter_config, training_config=config.training, train_dataset=train_dataset, tokenizer=tokenizer ) trainer.train() # Save only adapter weights (small footprint) registry.save_adapter(adapter_config, output_dir="./lora_adapter", version="1.0.0") # Later: Load adapter for inference registry.load_adapter(model, adapter_path="./lora_adapter") ``` Add comparison table of PEFT methods supported by NTF: | Method | NTF Support | Best For | |--------|-------------|----------| | LoRA | ✅ Full | General purpose | | AdaLoRA | ✅ Full | Dynamic rank allocation | | LoHa | ✅ Full | Complex tasks | | Prefix Tuning | ⚠️ Partial | Task-specific prompts | | P-Tuning | ❌ Not implemented | - | **Priority**: 🟡 MEDIUM-HIGH - Good conceptual coverage but misses NTF-native implementation --- ### Tutorial 06: Reinforcement Learning from Human Feedback (RLHF) **File**: `Tutorials/Tutorial_06_RLHF_Fine_Tuning.md` #### Issues Identified: 1. **❌ Reward Model Implementation Mismatch** - Tutorial uses generic `AutoModelForSequenceClassification` - NTF has dedicated `reward/reward_model.py` with `RewardModel` class 2. **❌ Missing Preference Dataset Handling** - `reward/data.py` contains preference dataset utilities - Tutorial creates custom dataset instead 3. **⚠️ RLHF Pipeline Not Aligned** - NTF's `reward/` module has pipeline utilities - Tutorial shows manual PPO implementation 4. **❌ No Integration with Training Pipeline** - Should connect to `FullFinetuneTrainer` or dedicated RLHF trainer #### Recommended Fixes: ```python from ntf.reward import RewardModel, PreferenceDataset from ntf.models import ModelRegistry from ntf.config import RewardConfig # 1. Load base model registry = ModelRegistry(model_config) base_model, tokenizer = registry.load_model_and_tokenizer() # 2. Initialize NTF's RewardModel reward_config = RewardConfig( base_model_name="meta-llama/Llama-2-7b-hf", num_labels=1, pad_token_id=tokenizer.pad_token_id ) reward_model = RewardModel(reward_config) reward_model.load_base_model(base_model) # 3. Load preference data with NTF utilities pref_dataset = PreferenceDataset( data_path="preferences.jsonl", tokenizer=tokenizer, max_length=512 ) # 4. Train reward model from ntf.reward.trainer import RewardTrainer reward_trainer = RewardTrainer( model=reward_model, dataset=pref_dataset, config=reward_config ) reward_trainer.train() # 5. Use in RLHF pipeline from ntf.reward.rlhf_pipeline import RLHFPipeline pipeline = RLHFPipeline( policy_model=policy_model, reward_model=reward_model, reference_model=ref_model, tokenizer=tokenizer ) pipeline.run_ppo( prompts=prompts, num_iterations=100, kl_coeff=0.2 ) ``` **Priority**: 🔴 CRITICAL - RLHF is complex; using wrong components leads to broken implementations --- ### Tutorial 07: Evaluation and Metrics **File**: `Tutorials/Tutorial_07_Evaluation_and_Metrics.md` #### Issues Identified: 1. **❌ Custom Metrics Instead of NTF Utilities** - Tutorial implements perplexity, accuracy manually - `utils/metrics.py` has these functions ready to use 2. **⚠️ Missing Comprehensive Metric Coverage** - NTF metrics include: perplexity, accuracy, BLEU, ROUGE, BERTScore - Tutorial only covers basic metrics 3. **✅ Good Points**: Explains evaluation importance and overfitting detection #### Recommended Fixes: ```python from ntf.utils.metrics import ( compute_perplexity, compute_accuracy, compute_bleu, compute_rouge, compute_bertscore, evaluate_generation ) # Use NTF's unified evaluation results = evaluate_generation( model=model, tokenizer=tokenizer, test_dataset=test_dataset, metrics=["perplexity", "bleu", "rouge", "bertscore"], device="cuda" ) print(f"Perplexity: {results['perplexity']:.2f}") print(f"BLEU-4: {results['bleu']:.4f}") print(f"ROUGE-L: {results['rouge']['rougeL']:.4f}") print(f"BERTScore F1: {results['bertscore']['f1']:.4f}") # Compare multiple checkpoints from ntf.utils.metrics import compare_checkpoints comparison = compare_checkpoints( model_paths=["checkpoint1", "checkpoint2", "checkpoint3"], eval_dataset=val_dataset, metrics=["perplexity", "accuracy"] ) ``` Add guidance on metric selection: | Task Type | Recommended Metrics | |-----------|---------------------| | Text Generation | Perplexity, BLEU, ROUGE, BERTScore | | Classification | Accuracy, F1, Precision, Recall | | Summarization | ROUGE, BERTScore | | Translation | BLEU, chrF, COMET | | Question Answering | Exact Match, F1 | **Priority**: 🟡 MEDIUM - Reduces code duplication and ensures consistent evaluation --- ### Tutorial 08: Hyperparameter Tuning **File**: `Tutorials/Tutorial_08_Hyperparameter_Tuning.md` #### Issues Identified: 1. **✅ Good Conceptual Alignment**: Covers grid search, random search, Bayesian optimization 2. **⚠️ Missing NTF Configuration Integration** - Should demonstrate tuning with NTF's `NTFConfig` system - Could integrate with config validation utilities 3. **⚠️ No Early Stopping Demonstration** - NTF's training configs support early stopping - Tutorial mentions it but doesn't show NTF implementation #### Recommended Enhancements: ```python from ntf.config import NTFConfig, TrainingConfig from ray import tune from ray.tune.schedulers import ASHAScheduler # Define search space aligned with NTF config search_space = { "learning_rate": tune.loguniform(1e-5, 1e-4), "batch_size": tune.choice([4, 8, 16]), "warmup_ratio": tune.uniform(0.05, 0.2), "weight_decay": tune.loguniform(1e-4, 1e-2) } def train_ntf(config): # Build NTF config from trial config ntf_config = NTFConfig( model=ModelConfig(...), training=TrainingConfig( learning_rate=config["learning_rate"], per_device_train_batch_size=config["batch_size"], warmup_ratio=config["warmup_ratio"], weight_decay=config["weight_decay"], evaluation_strategy="epoch", load_best_model_at_end=True ) ) # Run training trainer = FullFinetuneTrainer(config=ntf_config, ...) result = trainer.train() return {"eval_loss": result.metrics["eval_loss"]} # Run hyperparameter search scheduler = ASHAScheduler(metric="eval_loss", mode="min") analysis = tune.run( train_ntf, config=search_space, num_samples=20, scheduler=scheduler, resources_per_trial={"gpu": 1} ) # Get best config best_config = analysis.get_best_config("eval_loss", "min") print(f"Best config: {best_config}") ``` **Priority**: 🟡 MEDIUM - Good content but could better integrate with NTF config system --- ### Tutorial 09: Model Versioning and Checkpointing **File**: `Tutorials/Tutorial_09_Model_Versioning_and_Checkpointing.md` #### Issues Identified: 1. **❌ Manual Versioning Instead of ModelRegistry** - Tutorial shows manual directory management with timestamps - NTF has `ModelRegistry` class with built-in versioning in `utils/versioning.py` 2. **❌ Missing Semantic Versioning** - NTF supports semantic versioning (major.minor.patch) - Tutorial uses ad-hoc naming 3. **⚠️ No Metadata Tracking** - `ModelRegistry` tracks training config, metrics, timestamp - Tutorial doesn't cover metadata #### Recommended Fixes: ```python from ntf.models import ModelRegistry from ntf.config import ModelConfig # Initialize registry with versioning enabled registry = ModelRegistry( model_config=ModelConfig(name="meta-llama/Llama-2-7b-hf"), registry_path="./model_registry", enable_versioning=True ) # After training, save with automatic versioning registry.save_model( model=trained_model, tokenizer=tokenizer, version="1.0.0", # Semantic versioning metadata={ "training_config": config.to_dict(), "metrics": {"eval_loss": 0.234, "perplexity": 12.5}, "dataset": "custom_instructions_v1", "peft_method": "lora", "notes": "Initial fine-tuning run" } ) # List all versions versions = registry.list_versions() print(f"Available versions: {versions}") # Load specific version model_v1, tokenizer = registry.load_model_and_tokenizer(version="1.0.0") # Compare versions comparison = registry.compare_versions(["1.0.0", "1.1.0"], metrics=["eval_loss"]) # Rollback to previous version if needed registry.rollback("1.0.0") ``` Add versioning best practices: - Use semantic versioning: MAJOR.MINOR.PATCH - Include training config in metadata - Tag production-ready models - Maintain changelog in metadata **Priority**: 🔴 HIGH - Core functionality exists in NTF but tutorial teaches inferior manual approach --- ### Tutorial 10: Distributed Training **File**: `Tutorials/Tutorial_10_Distributed_Training.md` #### Issues Identified: 1. **⚠️ Feature Partially Implemented** - NTF's `FullFinetuneTrainer` uses Accelerate for distributed training - But no dedicated multi-GPU/multi-node orchestration layer visible 2. **❌ DeepSpeed Integration Unclear** - Tutorial shows DeepSpeed but connection to NTF config system not demonstrated - `configs.py` may have DeepSpeed config but not shown in tutorials 3. **⚠️ Missing Practical Examples** - No launch scripts for multi-node training - No troubleshooting guide for common distributed issues #### Recommended Clarifications: If distributed training is supported via Accelerate: ```python from ntf.config import NTFConfig, TrainingConfig from ntf.finetuning import FullFinetuneTrainer # NTF automatically handles distributed training via Accelerate config = NTFConfig( model=ModelConfig(...), training=TrainingConfig( per_device_train_batch_size=4, gradient_accumulation_steps=4, # Accelerate auto-detects distributed setup fp16=False, bf16=True, gradient_checkpointing=True ) ) # Trainer automatically uses all available GPUs trainer = FullFinetuneTrainer(config=config, ...) trainer.train() # Distributed training handled internally ``` Add disclaimer if full distributed training (multi-node) not yet implemented: > **Note**: NTF currently supports multi-GPU training on a single node via Accelerate. Multi-node distributed training is planned for future releases. For large-scale training, consider using external orchestration tools. **Priority**: 🟡 MEDIUM - Needs clarification on current capabilities vs. roadmap --- ### Tutorial 11: Quantization and Optimization **File**: `Tutorials/Tutorial_11_Quantization_and_Optimization.md` #### Issues Identified: 1. **✅ External Tools Appropriately Used**: bitsandbytes, GPTQ, AWQ are external libraries 2. **⚠️ Missing NTF Integration Points** - How does quantization connect to `ModelRegistry`? - Should NTF config support quantization parameters? 3. **⚠️ Serving Optimization Not Connected** - vLLM, TGI mentioned but no NTF serving utilities shown - Does NTF have serving module? #### Recommended Enhancements: ```python from ntf.config import ModelConfig, QuantizationConfig from ntf.models import ModelRegistry # Quantization config integrated with NTF quant_config = QuantizationConfig( method="bitsandbytes", # or "gptq", "awq" load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True ) model_config = ModelConfig( name="meta-llama/Llama-2-7b-hf", quantization=quant_config ) # Registry handles quantized model loading registry = ModelRegistry(model_config) model, tokenizer = registry.load_model_and_tokenizer() # Model automatically loaded in quantized format ``` Clarify serving story: - If NTF has serving module: demonstrate it - If not: clearly state these are external tools and provide integration examples **Priority**: 🟡 MEDIUM - External tools are appropriate but integration points unclear --- ### Tutorial 12: Production Deployment **File**: `Tutorials/Tutorial_12_Production_Deployment.md` #### Issues Identified: 1. **❌ MLflow Registry Conflicts with NTF ModelRegistry** - Tutorial uses MLflow for model registry - NTF has its own `ModelRegistry` class - Creates confusion about which to use 2. **⚠️ Missing NTF Deployment Utilities** - Does NTF have deployment helpers? - Should demonstrate integration with serving tools 3. **⚠️ Monitoring Not Connected to NTF** - NTF's metrics utilities could feed monitoring systems - No demonstration of this integration #### Recommended Fixes: Option A - Integrate MLflow with NTF ModelRegistry: ```python from ntf.models import ModelRegistry import mlflow # Use NTF for local versioning, MLflow for enterprise registry registry = ModelRegistry(...) # Save to NTF registry first registry.save_model(model, version="1.0.0", metadata={...}) # Then log to MLflow for enterprise tracking with mlflow.start_run(): model_uri = registry.get_model_path("1.0.0") mlflow.pytorch.log_model(model_uri, "model") # Log NTF metadata to MLflow metadata = registry.get_metadata("1.0.0") for key, value in metadata.items(): mlflow.log_param(key, value) ``` Option B - Replace MLflow with NTF ModelRegistry: ```python from ntf.models import ModelRegistry # NTF ModelRegistry as primary registry registry = ModelRegistry(registry_path="./production_registry") # Deploy directly from NTF registry model, tokenizer = registry.load_model_and_tokenizer(version="1.0.0") # Export for serving registry.export_for_serving( version="1.0.0", format="onnx", # or "torchscript" output_path="./serving_model" ) ``` **Priority**: 🔴 HIGH - Conflicting registry systems create confusion --- ### Tutorial 13: Debugging and Troubleshooting **File**: `Tutorials/Tutorial_13_Debugging_and_Troubleshooting.md` #### Issues Identified: 1. **✅ Good Universal Content**: OOM, NaN losses, slow training covered well 2. **⚠️ Missing NTF-Specific Debugging** - How to debug `FullFinetuneTrainer` issues? - NTF logging utilities not demonstrated - Config validation tools not shown 3. **⚠️ No Common NTF Error Patterns** - ModelRegistry loading failures - PEFT adapter mismatch errors - Dataset preprocessing issues with NTF utilities #### Recommended Enhancements: Add NTF-specific debugging section: ```python # Enable verbose logging in NTF from ntf.config import NTFConfig from ntf.utils.logging import setup_logging setup_logging(level="DEBUG") config = NTFConfig( model=ModelConfig(...), training=TrainingConfig( logging_level="DEBUG", log_on_each_node=True ) ) # Validate config before training from ntf.config import validate_config errors = validate_config(config) if errors: print("Configuration errors:") for error in errors: print(f" - {error}") # Debug dataset preprocessing from ntf.training.data import TextDataset dataset = TextDataset(...) # Inspect processed samples for i in range(5): sample = dataset[i] print(f"Sample {i}:") print(f" Input shape: {sample['input_ids'].shape}") print(f" Attention mask sum: {sample['attention_mask'].sum()}") ``` Add common NTF error patterns: | Error | Cause | Solution | |-------|-------|----------| | `ModelRegistryError: Version not found` | Version doesn't exist in registry | Use `list_versions()` to check available versions | | `PEFT adapter dimension mismatch` | Adapter trained on different model | Ensure same base model and adapter config | | `TextDataset column mapping error` | Column names don't match | Verify `column_mapping` parameter | **Priority**: 🟡 MEDIUM - Good general content but needs NTF-specific additions --- ## Missing NTF Components That Should Be Documented ### High Priority (Core Functionality) 1. **LayerFreezer (`finetuning/freeze.py`)** - **Purpose**: Selectively freeze model layers to reduce memory and prevent catastrophic forgetting - **Use Cases**: - Fine-tuning large models with limited VRAM - Domain adaptation while preserving general knowledge - Progressive unfreezing strategies - **Tutorial Placement**: Tutorial 03 (Full Fine-Tuning) or dedicated advanced tutorial 2. **ModelRegistry (`models/registry.py` / `utils/versioning.py`)** - **Purpose**: Centralized model loading, versioning, and metadata tracking - **Use Cases**: - Reproducible experiments with versioned models - A/B testing different model versions - Production deployment with rollback capability - **Tutorial Placement**: Tutorial 09 (currently teaches manual approach) 3. **PEFTTrainer (`finetuning/lora.py`)** - **Purpose**: Unified interface for all PEFT methods (LoRA, AdaLoRA, LoHa) - **Use Cases**: - Resource-constrained fine-tuning - Multiple adapter management - Adapter composition and merging - **Tutorial Placement**: Tutorial 05 (currently uses manual LoRA setup) 4. **RLHF Pipeline (`reward/`)** - **Purpose**: End-to-end RLHF workflow with reward modeling and PPO - **Use Cases**: - Aligning models with human preferences - Building conversational AI with feedback - Safety and helpfulness tuning - **Tutorial Placement**: Tutorial 06 (currently uses generic implementation) ### Medium Priority (Enhanced Functionality) 5. **Metrics Utilities (`utils/metrics.py`)** - **Purpose**: Comprehensive evaluation metrics suite - **Use Cases**: Model comparison, ablation studies, production monitoring - **Tutorial Placement**: Tutorial 07 (currently implements metrics manually) 6. **Continual Learning Wrapper (`utils/continual_learning.py`)** - **Purpose**: Prevent catastrophic forgetting in sequential fine-tuning - **Use Cases**: Multi-domain adaptation, lifelong learning scenarios - **Tutorial Placement**: New tutorial or enhancement to Tutorial 04 7. **Data Utilities (`training/data.py`)** - **Purpose**: Standardized dataset loading with chat template support - **Use Cases**: All fine-tuning scenarios - **Tutorial Placement**: Tutorial 02 (currently teaches custom dataset) ### Low Priority (Nice to Have) 8. **Config Validation Tools** - Purpose: Catch configuration errors before training - Tutorial Placement: Tutorial 08 or integrated throughout 9. **Logging Utilities (`utils/logging.py`)** - Purpose: Structured logging for training runs - Tutorial Placement**: Tutorial 13 (Debugging) --- ## Learning Progression Analysis ### Current State: - ❌ **Disjointed Flow**: Tutorials jump between concepts without building on previous knowledge - ❌ **Missing Foundations**: No explanation of fine-tuning types before practical examples - ❌ **Inconsistent Complexity**: Some advanced topics in early tutorials, basic concepts in later ones ### Recommended Restructuring: **Beginner Track (Tutorials 00-04):** 1. **00**: Introduction + NTF Architecture Overview ← Add component map 2. **01**: Environment Setup + Verification ← Add component imports 3. **02**: Data Preparation with NTF Utilities ← Replace custom dataset 4. **03**: Your First Fine-Tuning Run (FullFinetuneTrainer) ← Simplify, use NTF 5. **04**: Understanding PEFT Basics ← Move from Tutorial 05 **Intermediate Track (Tutorials 05-09):** 6. **05**: Advanced PEFT Strategies (Multi-Adapter, Composition) 7. **06**: Evaluation and Metrics with NTF Utilities 8. **07**: Hyperparameter Tuning and Optimization 9. **08**: Model Versioning and Experiment Tracking 10. **09**: RLHF Fundamentals **Advanced Track (Tutorials 10-13):** 11. **10**: Distributed Training at Scale 12. **11**: Production Deployment and Serving 13. **12**: Continual Learning and Domain Adaptation ← New/refocused 14. **13**: Debugging and Performance Profiling ### Missing Foundational Content: Before Tutorial 03, add: ```markdown ## Understanding Fine-Tuning Types Fine-tuning adapts pre-trained models to specific tasks. NTF supports three main approaches: ### 1. Full Fine-Tuning - **What**: Update all model parameters - **When**: Sufficient VRAM, domain shift is large - **NTF Component**: `FullFinetuneTrainer` + `LayerFreezer` - **Trade-offs**: Best performance, highest resource usage ### 2. Parameter-Efficient Fine-Tuning (PEFT) - **What**: Update small adapter parameters, freeze backbone - **When**: Limited VRAM, multiple tasks, quick iteration - **NTF Component**: `PEFTTrainer` (LoRA, AdaLoRA, LoHa) - **Trade-offs**: Lower resource usage, slightly reduced performance ### 3. Continual Fine-Tuning - **What**: Sequential fine-tuning on multiple domains - **When**: Lifelong learning, multi-domain deployment - **NTF Component**: `ContinualLearningWrapper` + regularization - **Trade-offs**: Maintains knowledge across domains, requires careful tuning Choose your approach based on resources and requirements... ``` --- ## Technical Accuracy Issues ### Speculative Hardware Estimates (Remove or Qualify) **Found in**: Tutorials 00, 01, 03, 10 Examples to remove/qualify: - ❌ "80GB+ VRAM required for 70B models" - ❌ "Training takes 2-3 days on 8x A100" - ❌ "Batch size of 32 recommended" **Replacement language**: - ✅ "VRAM requirements vary based on sequence length, batch size, precision, and optimization techniques. Use NTF's `LayerFreezer` and gradient checkpointing to reduce memory footprint." - ✅ "Training time depends on dataset size, model architecture, and hardware configuration. Monitor progress with NTF's built-in logging." - ✅ "Start with small batch sizes and scale up based on available memory. NTF's `FullFinetuneTrainer` automatically handles gradient accumulation." ### AI Jargon to Professionalize | Original | Professional Alternative | |----------|-------------------------| | "Catastrophic forgetting" | "Knowledge degradation during domain adaptation" | | "Magic numbers" | "Empirically-derived hyperparameters" | | "Black box" | "Complex neural network behavior" | | "State-of-the-art" | "Current leading performance" | | "Ground truth" | "Reference labels" or "Validated data" | --- ## Prioritized Action Items ### Immediate (Week 1-2) 1. ✅ Fix tutorial numbering and file references in Table of Contents 2. ✅ Remove all speculative hardware estimates 3. ✅ Replace Tutorial 03 with NTF-native `FullFinetuneTrainer` example 4. ✅ Update Tutorial 02 to use `TextDataset` instead of custom dataset 5. ✅ Update Tutorial 09 to use `ModelRegistry` for versioning 6. ✅ Update Tutorial 07 to use `utils/metrics.py` utilities ### Short-Term (Month 1) 7. ✅ Implement missing `LayerFreezer` documentation in Tutorial 03 8. ✅ Rewrite Tutorial 06 to use NTF's `RewardModel` and RLHF pipeline 9. ✅ Update Tutorial 05 to demonstrate `PEFTTrainer` 10. ✅ Clarify distributed training capabilities in Tutorial 10 11. ✅ Resolve MLflow vs. ModelRegistry conflict in Tutorial 12 12. ✅ Add foundational fine-tuning types section before Tutorial 03 ### Long-Term (Quarter 1) 13. 🔄 Implement missing features (multi-task learning, advanced continual learning) 14. 🔄 Create interactive Colab notebooks for each tutorial 15. 🔄 Add video walkthroughs for complex topics 16. 🔄 Build automated testing for code examples 17. 🔄 Create production deployment templates 18. 🔄 Develop troubleshooting decision tree --- ## Conclusion The NTF tutorial series has strong foundational content but requires significant alignment with the actual NTF architecture. Key priorities: 1. **Replace generic HuggingFace patterns** with NTF-native components throughout 2. **Document existing but unused components** (LayerFreezer, ModelRegistry, PEFTTrainer, RLHF pipeline) 3. **Remove speculative claims** about hardware requirements and training times 4. **Restructure learning progression** to build knowledge incrementally 5. **Clarify feature availability** to manage user expectations By addressing these issues, the tutorials will become a reliable, professional resource that accurately represents NTF's capabilities and guides users from beginner to production-ready implementations. --- ## Appendix: Quick Reference - NTF Components by Tutorial | Tutorial | Current Approach | Recommended NTF Approach | |----------|-----------------|-------------------------| | 02 | Custom Dataset | `TextDataset` + `create_data_collator` | | 03 | HF Trainer | `FullFinetuneTrainer` + `LayerFreezer` | | 05 | Manual LoRA | `PEFTTrainer` + adapter management | | 06 | Generic Reward Model | `RewardModel` + `PreferenceDataset` + RLHF pipeline | | 07 | Manual Metrics | `compute_perplexity`, `evaluate_generation`, etc. | | 09 | Manual Versioning | `ModelRegistry` with semantic versioning | | 12 | MLflow Registry | NTF `ModelRegistry` ± MLflow integration | --- *Report Generated: NTF Documentation QA Review* *Reviewer: Documentation Quality Assurance Team* *Scope: Architecture Alignment, Completeness, Technical Accuracy, Learning Progression*