| # NTF Tutorial Quality Assurance Report |
|
|
| ## Executive Summary |
|
|
| This report provides a comprehensive end-to-end quality assurance review of all 13 tutorial markdown files in the NTF (Nexuss Transformer Framework) documentation. The review focuses on: |
|
|
| 1. **Architecture Alignment**: Ensuring all tutorials correctly use NTF-native components rather than generic HuggingFace patterns |
| 2. **Completeness**: Identifying missing NTF components that should be documented |
| 3. **Practical Examples**: Verifying code examples correctly implement NTF architecture |
| 4. **Learning Progression**: Ensuring continuous flow from beginner to advanced without explicit labeling |
| 5. **Professional Tone**: Removing speculative hardware estimates and AI jargon |
|
|
| **Overall Assessment**: The tutorials require significant refactoring to align with NTF architecture. Many examples use generic HuggingFace/DeepSpeed patterns instead of NTF's native components like `FullFinetuneTrainer`, `ModelRegistry`, `LayerFreezer`, and `PEFTTrainer`. |
|
|
| --- |
|
|
| ## Architecture Overview (Reference for Review) |
|
|
| ### Core NTF Components Identified: |
|
|
| **Training Components (`finetuning/`):** |
| - `FullFinetuneTrainer` - Main training orchestrator with accelerator support |
| - `LoRATrainer` / `PEFTTrainer` - Parameter-efficient fine-tuning implementations |
| - `LayerFreezer` - Strategic layer freezing utilities |
| - Training configurations via `configs.py` |
|
|
| **Model Management (`models/`):** |
| - `ModelRegistry` - Model loading, registration, and versioning |
| - Adapter loading utilities for LoRA/PEFT |
| - Custom model head implementations |
|
|
| **Data Pipeline (`training/data.py`):** |
| - `TextDataset` - Standardized dataset class |
| - Data collators and preprocessing utilities |
| - Chat template integration |
|
|
| **Reward & RLHF (`reward/`):** |
| - `RewardModel` - Reward model implementation |
| - Preference dataset handling |
| - RLHF pipeline utilities |
|
|
| **Utilities (`utils/`):** |
| - `metrics.py` - Evaluation metrics (perplexity, accuracy, etc.) |
| - `versioning.py` - Model versioning utilities |
| - `continual_learning.py` - Continual learning wrappers |
| - Logging and checkpointing utilities |
|
|
| **Configuration (`config/`):** |
| - YAML-based configuration system |
| - Nested configuration classes for models, training, data, PEFT |
|
|
| --- |
|
|
| ## Tutorial-by-Tutorial Analysis |
|
|
| ### Tutorial 00: Introduction to Fine-Tuning |
|
|
| **File**: `Tutorials/Tutorial_00_Introduction_to_Fine_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β File Reference Mismatch** |
| - Table of Contents references `Tutorial_01_Setting_Up_Your_Environment.md` but actual file is `Tutorial_01_Environment_Setup.md` |
| - Similar mismatches throughout (e.g., `Tutorial_03_Full_Parameter_Fine_Tuning.md` vs `Tutorial_03_Full_Fine_Tuning.md`) |
|
|
| 2. **β Speculative Hardware Estimates** |
| ```markdown |
| - Small Models (7B): 40-80GB VRAM |
| - Medium Models (13B-70B): 80GB+ VRAM |
| ``` |
| These are ungrounded estimates that vary based on sequence length, batch size, precision, and optimization techniques. |
|
|
| 3. **β Missing NTF Component Overview** |
| - No mention of `ModelRegistry`, `FullFinetuneTrainer`, `LayerFreezer` |
| - Introduces fine-tuning concepts without connecting to NTF's implementation |
|
|
| 4. **β οΈ AI Jargon** |
| - "Catastrophic forgetting" mentioned without practical mitigation strategies using NTF utilities |
|
|
| #### Recommended Fixes: |
|
|
| ```markdown |
| ## NTF Architecture Overview |
| |
| Before diving into fine-tuning, understand the core components you'll use: |
| |
| - **ModelRegistry**: Central hub for loading, configuring, and versioning models |
| - **FullFinetuneTrainer**: Production-ready training orchestrator with distributed support |
| - **LayerFreezer**: Selectively freeze backbone layers to reduce memory and prevent catastrophic forgetting |
| - **PEFTTrainer**: Parameter-efficient fine-tuning with LoRA, AdaLoRA, and LoHa adapters |
| - **TextDataset**: Unified data loading with chat template support |
| |
| These components work together to provide a streamlined fine-tuning experience... |
| ``` |
|
|
| **Priority**: π΄ HIGH - Foundation tutorial sets expectations for all subsequent tutorials |
|
|
| --- |
|
|
| ### Tutorial 01: Environment Setup |
|
|
| **File**: `Tutorials/Tutorial_01_Environment_Setup.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β
Good Alignment**: Correctly uses `ntf` package installation |
| 2. **β οΈ Missing Integration**: Doesn't show how to verify NTF components are working |
| 3. **β οΈ Hardware Requirements Section**: Contains speculative VRAM estimates |
|
|
| #### Recommended Fixes: |
|
|
| Add verification step: |
| ```python |
| from ntf.models import ModelRegistry |
| from ntf.finetuning import FullFinetuneTrainer |
| from ntf.config import NTFConfig |
| |
| # Verify installation |
| print(f"NTF Version: {ntf.__version__}") |
| print("Core components imported successfully!") |
| ``` |
|
|
| Remove or qualify hardware estimates with: "Actual requirements vary based on sequence length, batch size, and precision settings." |
|
|
| **Priority**: π‘ MEDIUM - Generally sound but needs NTF component verification |
|
|
| --- |
|
|
| ### Tutorial 02: Working with Datasets |
|
|
| **File**: `Tutorials/Tutorial_02_Working_with_Datasets.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Custom Dataset Implementation Conflicts with NTF Utilities** |
| - Tutorial creates custom `CustomDataset` class from scratch |
| - NTF already provides `TextDataset` in `training/data.py` with built-in chat template support |
|
|
| 2. **β οΈ Missing Chat Template Integration** |
| - NTF's `TextDataset` supports chat templates but tutorial doesn't demonstrate this |
|
|
| 3. **β
Good Points**: Covers data cleaning, formatting, and train/test split |
|
|
| #### Recommended Fixes: |
|
|
| Replace custom dataset with NTF's implementation: |
|
|
| ```python |
| from ntf.training.data import TextDataset |
| from transformers import AutoTokenizer |
| |
| # Load tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") |
| |
| # Use NTF's built-in dataset |
| dataset = TextDataset( |
| data_path="formatted_data.json", |
| tokenizer=tokenizer, |
| max_length=512, |
| use_chat_template=True, # Built-in support |
| column_mapping={ |
| "instruction": "instruction", |
| "input": "context", |
| "output": "response" |
| } |
| ) |
| |
| # Access preprocessed data |
| train_data = dataset.get_train_dataset() |
| eval_data = dataset.get_eval_dataset() |
| ``` |
|
|
| Add section on custom data collators if needed: |
| ```python |
| from ntf.training.data import create_data_collator |
| |
| collator = create_data_collator( |
| tokenizer=tokenizer, |
| padding=True, |
| max_length=512 |
| ) |
| ``` |
|
|
| **Priority**: π΄ HIGH - Reduces code duplication and teaches users NTF-native patterns |
|
|
| --- |
|
|
| ### Tutorial 03: Full Parameter Fine-Tuning |
|
|
| **File**: `Tutorials/Tutorial_03_Full_Fine_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Complete Architecture Misalignment** |
| - Uses raw HuggingFace `Trainer` instead of NTF's `FullFinetuneTrainer` |
| - Manual training loop doesn't leverage NTF's accelerator support |
| - Missing gradient checkpointing, mixed precision, and distributed training hooks |
|
|
| 2. **β DeepSpeed Configuration Not Integrated** |
| - Shows DeepSpeed config but doesn't connect to NTF's configuration system |
| - NTF has `configs.py` with nested configuration classes |
|
|
| 3. **β Missing ModelRegistry Usage** |
| - Loads model directly with `AutoModelForCausalLM` |
| - Should use `ModelRegistry` for consistent model loading and adapter support |
|
|
| 4. **β No Layer Freezing Demonstration** |
| - Full fine-tuning can benefit from selective layer freezing |
| - `LayerFreezer` component completely absent |
|
|
| #### Recommended Complete Rewrite: |
|
|
| ```python |
| from ntf.config import NTFConfig, ModelConfig, TrainingConfig |
| from ntf.models import ModelRegistry |
| from ntf.finetuning import FullFinetuneTrainer |
| from ntf.training.data import TextDataset |
| |
| # 1. Configuration-driven setup |
| config = NTFConfig( |
| model=ModelConfig( |
| name="meta-llama/Llama-2-7b-hf", |
| trust_remote_code=True, |
| torch_dtype="bfloat16" |
| ), |
| training=TrainingConfig( |
| output_dir="./results", |
| num_train_epochs=3, |
| per_device_train_batch_size=4, |
| gradient_accumulation_steps=4, |
| learning_rate=2e-5, |
| warmup_ratio=0.1, |
| weight_decay=0.01, |
| logging_steps=10, |
| save_strategy="epoch", |
| evaluation_strategy="epoch", |
| fp16=False, |
| bf16=True, |
| gradient_checkpointing=True, |
| dataloader_num_workers=4 |
| ) |
| ) |
| |
| # 2. Use ModelRegistry for model loading |
| registry = ModelRegistry(config.model) |
| model, tokenizer = registry.load_model_and_tokenizer() |
| |
| # Optional: Freeze backbone layers to reduce memory |
| from ntf.finetuning import LayerFreezer |
| freezer = LayerFreezer(model) |
| freezer.freeze_backbone(num_layers_to_keep=-1) # Keep all trainable, or specify number |
| |
| # 3. Prepare dataset with NTF utilities |
| dataset = TextDataset( |
| data_path="formatted_data.json", |
| tokenizer=tokenizer, |
| max_length=512, |
| use_chat_template=True |
| ) |
| |
| # 4. Initialize NTF's FullFinetuneTrainer |
| trainer = FullFinetuneTrainer( |
| model=model, |
| config=config.training, |
| train_dataset=dataset.get_train_dataset(), |
| eval_dataset=dataset.get_eval_dataset(), |
| tokenizer=tokenizer |
| ) |
| |
| # 5. Train with built-in accelerator support |
| trainer.train() |
| |
| # 6. Save with versioning |
| registry.save_model(trainer.model, output_dir="./final_model", version="1.0.0") |
| ``` |
|
|
| **Priority**: π΄ CRITICAL - Core tutorial completely misaligned with NTF architecture |
|
|
| --- |
|
|
| ### Tutorial 04: Multi-Task Fine-Tuning |
|
|
| **File**: `Tutorials/Tutorial_04_Multi_Task_Fine_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Feature Not Implemented in NTF** |
| - Multi-task learning with task-specific heads not present in current NTF codebase |
| - Tutorial describes capabilities that don't exist |
|
|
| 2. **β οΈ Alternative Approach Needed** |
| - Could demonstrate sequential fine-tuning with `ContinualLearning` utilities |
| - Or focus on multi-domain datasets with single head |
|
|
| #### Recommended Refocus: |
|
|
| Either: |
| 1. **Implement the feature** in NTF first, then document |
| 2. **Refocus tutorial** on sequential domain adaptation using existing utilities: |
|
|
| ```python |
| from ntf.utils.continual_learning import ContinualLearningWrapper |
| from ntf.finetuning import FullFinetuneTrainer |
| |
| # Sequential fine-tuning on multiple domains |
| wrapper = ContinualLearningWrapper(model) |
| |
| # Domain 1: Code generation |
| trainer1 = FullFinetuneTrainer(...) |
| trainer1.train() |
| wrapper.save_state("domain1_checkpoint") |
| |
| # Domain 2: Math reasoning (with regularization to prevent forgetting) |
| wrapper.apply_ewc_regularization(lambda_ewc=0.5) |
| trainer2 = FullFinetuneTrainer(...) |
| trainer2.train() |
| ``` |
|
|
| **Priority**: π΄ HIGH - Documents non-existent features; needs immediate attention |
|
|
| --- |
|
|
| ### Tutorial 05: Parameter-Efficient Fine-Tuning (PEFT) |
|
|
| **File**: `Tutorials/Tutorial_05_Parameter_Efficient_Fine_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β οΈ Partial Alignment** |
| - Correctly introduces LoRA concept |
| - But uses manual `LoraConfig` setup instead of NTF's `PEFTTrainer` |
|
|
| 2. **β Missing NTF PEFTTrainer** |
| - `finetuning/lora.py` contains `LoRATrainer` / `PEFTTrainer` class |
| - Tutorial should demonstrate this unified interface |
|
|
| 3. **β οΈ Adapter Loading Not Covered** |
| - NTF's `models/adapters.py` has utilities for loading/saving adapters |
| - Critical for production workflows |
|
|
| #### Recommended Fixes: |
|
|
| ```python |
| from ntf.config import NTFConfig, PEFTConfig |
| from ntf.models import ModelRegistry |
| from ntf.finetuning import PEFTTrainer |
| |
| # Configuration-driven PEFT |
| config = NTFConfig( |
| model=ModelConfig(name="meta-llama/Llama-2-7b-hf"), |
| peft=PEFTConfig( |
| method="lora", # or "adalora", "loha" |
| r=16, |
| lora_alpha=32, |
| lora_dropout=0.1, |
| target_modules=["q_proj", "v_proj"], |
| bias="none", |
| task_type="CAUSAL_LM" |
| ), |
| training=TrainingConfig(...) |
| ) |
| |
| # Load model with registry |
| registry = ModelRegistry(config.model) |
| model, tokenizer = registry.load_model_and_tokenizer() |
| |
| # Apply PEFT adapters |
| adapter_config = registry.apply_peft_adapters(config.peft) |
| |
| # Use PEFTTrainer with built-in adapter handling |
| trainer = PEFTTrainer( |
| model=model, |
| adapter_config=adapter_config, |
| training_config=config.training, |
| train_dataset=train_dataset, |
| tokenizer=tokenizer |
| ) |
| |
| trainer.train() |
| |
| # Save only adapter weights (small footprint) |
| registry.save_adapter(adapter_config, output_dir="./lora_adapter", version="1.0.0") |
| |
| # Later: Load adapter for inference |
| registry.load_adapter(model, adapter_path="./lora_adapter") |
| ``` |
|
|
| Add comparison table of PEFT methods supported by NTF: |
| | Method | NTF Support | Best For | |
| |--------|-------------|----------| |
| | LoRA | β
Full | General purpose | |
| | AdaLoRA | β
Full | Dynamic rank allocation | |
| | LoHa | β
Full | Complex tasks | |
| | Prefix Tuning | β οΈ Partial | Task-specific prompts | |
| | P-Tuning | β Not implemented | - | |
|
|
| **Priority**: π‘ MEDIUM-HIGH - Good conceptual coverage but misses NTF-native implementation |
|
|
| --- |
|
|
| ### Tutorial 06: Reinforcement Learning from Human Feedback (RLHF) |
|
|
| **File**: `Tutorials/Tutorial_06_RLHF_Fine_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Reward Model Implementation Mismatch** |
| - Tutorial uses generic `AutoModelForSequenceClassification` |
| - NTF has dedicated `reward/reward_model.py` with `RewardModel` class |
|
|
| 2. **β Missing Preference Dataset Handling** |
| - `reward/data.py` contains preference dataset utilities |
| - Tutorial creates custom dataset instead |
|
|
| 3. **β οΈ RLHF Pipeline Not Aligned** |
| - NTF's `reward/` module has pipeline utilities |
| - Tutorial shows manual PPO implementation |
|
|
| 4. **β No Integration with Training Pipeline** |
| - Should connect to `FullFinetuneTrainer` or dedicated RLHF trainer |
|
|
| #### Recommended Fixes: |
|
|
| ```python |
| from ntf.reward import RewardModel, PreferenceDataset |
| from ntf.models import ModelRegistry |
| from ntf.config import RewardConfig |
| |
| # 1. Load base model |
| registry = ModelRegistry(model_config) |
| base_model, tokenizer = registry.load_model_and_tokenizer() |
| |
| # 2. Initialize NTF's RewardModel |
| reward_config = RewardConfig( |
| base_model_name="meta-llama/Llama-2-7b-hf", |
| num_labels=1, |
| pad_token_id=tokenizer.pad_token_id |
| ) |
| reward_model = RewardModel(reward_config) |
| reward_model.load_base_model(base_model) |
| |
| # 3. Load preference data with NTF utilities |
| pref_dataset = PreferenceDataset( |
| data_path="preferences.jsonl", |
| tokenizer=tokenizer, |
| max_length=512 |
| ) |
| |
| # 4. Train reward model |
| from ntf.reward.trainer import RewardTrainer |
| reward_trainer = RewardTrainer( |
| model=reward_model, |
| dataset=pref_dataset, |
| config=reward_config |
| ) |
| reward_trainer.train() |
| |
| # 5. Use in RLHF pipeline |
| from ntf.reward.rlhf_pipeline import RLHFPipeline |
| pipeline = RLHFPipeline( |
| policy_model=policy_model, |
| reward_model=reward_model, |
| reference_model=ref_model, |
| tokenizer=tokenizer |
| ) |
| |
| pipeline.run_ppo( |
| prompts=prompts, |
| num_iterations=100, |
| kl_coeff=0.2 |
| ) |
| ``` |
|
|
| **Priority**: π΄ CRITICAL - RLHF is complex; using wrong components leads to broken implementations |
|
|
| --- |
|
|
| ### Tutorial 07: Evaluation and Metrics |
|
|
| **File**: `Tutorials/Tutorial_07_Evaluation_and_Metrics.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Custom Metrics Instead of NTF Utilities** |
| - Tutorial implements perplexity, accuracy manually |
| - `utils/metrics.py` has these functions ready to use |
|
|
| 2. **β οΈ Missing Comprehensive Metric Coverage** |
| - NTF metrics include: perplexity, accuracy, BLEU, ROUGE, BERTScore |
| - Tutorial only covers basic metrics |
|
|
| 3. **β
Good Points**: Explains evaluation importance and overfitting detection |
|
|
| #### Recommended Fixes: |
|
|
| ```python |
| from ntf.utils.metrics import ( |
| compute_perplexity, |
| compute_accuracy, |
| compute_bleu, |
| compute_rouge, |
| compute_bertscore, |
| evaluate_generation |
| ) |
| |
| # Use NTF's unified evaluation |
| results = evaluate_generation( |
| model=model, |
| tokenizer=tokenizer, |
| test_dataset=test_dataset, |
| metrics=["perplexity", "bleu", "rouge", "bertscore"], |
| device="cuda" |
| ) |
| |
| print(f"Perplexity: {results['perplexity']:.2f}") |
| print(f"BLEU-4: {results['bleu']:.4f}") |
| print(f"ROUGE-L: {results['rouge']['rougeL']:.4f}") |
| print(f"BERTScore F1: {results['bertscore']['f1']:.4f}") |
| |
| # Compare multiple checkpoints |
| from ntf.utils.metrics import compare_checkpoints |
| comparison = compare_checkpoints( |
| model_paths=["checkpoint1", "checkpoint2", "checkpoint3"], |
| eval_dataset=val_dataset, |
| metrics=["perplexity", "accuracy"] |
| ) |
| ``` |
|
|
| Add guidance on metric selection: |
| | Task Type | Recommended Metrics | |
| |-----------|---------------------| |
| | Text Generation | Perplexity, BLEU, ROUGE, BERTScore | |
| | Classification | Accuracy, F1, Precision, Recall | |
| | Summarization | ROUGE, BERTScore | |
| | Translation | BLEU, chrF, COMET | |
| | Question Answering | Exact Match, F1 | |
|
|
| **Priority**: π‘ MEDIUM - Reduces code duplication and ensures consistent evaluation |
|
|
| --- |
|
|
| ### Tutorial 08: Hyperparameter Tuning |
|
|
| **File**: `Tutorials/Tutorial_08_Hyperparameter_Tuning.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β
Good Conceptual Alignment**: Covers grid search, random search, Bayesian optimization |
| 2. **β οΈ Missing NTF Configuration Integration** |
| - Should demonstrate tuning with NTF's `NTFConfig` system |
| - Could integrate with config validation utilities |
|
|
| 3. **β οΈ No Early Stopping Demonstration** |
| - NTF's training configs support early stopping |
| - Tutorial mentions it but doesn't show NTF implementation |
|
|
| #### Recommended Enhancements: |
|
|
| ```python |
| from ntf.config import NTFConfig, TrainingConfig |
| from ray import tune |
| from ray.tune.schedulers import ASHAScheduler |
| |
| # Define search space aligned with NTF config |
| search_space = { |
| "learning_rate": tune.loguniform(1e-5, 1e-4), |
| "batch_size": tune.choice([4, 8, 16]), |
| "warmup_ratio": tune.uniform(0.05, 0.2), |
| "weight_decay": tune.loguniform(1e-4, 1e-2) |
| } |
| |
| def train_ntf(config): |
| # Build NTF config from trial config |
| ntf_config = NTFConfig( |
| model=ModelConfig(...), |
| training=TrainingConfig( |
| learning_rate=config["learning_rate"], |
| per_device_train_batch_size=config["batch_size"], |
| warmup_ratio=config["warmup_ratio"], |
| weight_decay=config["weight_decay"], |
| evaluation_strategy="epoch", |
| load_best_model_at_end=True |
| ) |
| ) |
| |
| # Run training |
| trainer = FullFinetuneTrainer(config=ntf_config, ...) |
| result = trainer.train() |
| |
| return {"eval_loss": result.metrics["eval_loss"]} |
| |
| # Run hyperparameter search |
| scheduler = ASHAScheduler(metric="eval_loss", mode="min") |
| analysis = tune.run( |
| train_ntf, |
| config=search_space, |
| num_samples=20, |
| scheduler=scheduler, |
| resources_per_trial={"gpu": 1} |
| ) |
| |
| # Get best config |
| best_config = analysis.get_best_config("eval_loss", "min") |
| print(f"Best config: {best_config}") |
| ``` |
|
|
| **Priority**: π‘ MEDIUM - Good content but could better integrate with NTF config system |
|
|
| --- |
|
|
| ### Tutorial 09: Model Versioning and Checkpointing |
|
|
| **File**: `Tutorials/Tutorial_09_Model_Versioning_and_Checkpointing.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β Manual Versioning Instead of ModelRegistry** |
| - Tutorial shows manual directory management with timestamps |
| - NTF has `ModelRegistry` class with built-in versioning in `utils/versioning.py` |
|
|
| 2. **β Missing Semantic Versioning** |
| - NTF supports semantic versioning (major.minor.patch) |
| - Tutorial uses ad-hoc naming |
|
|
| 3. **β οΈ No Metadata Tracking** |
| - `ModelRegistry` tracks training config, metrics, timestamp |
| - Tutorial doesn't cover metadata |
|
|
| #### Recommended Fixes: |
|
|
| ```python |
| from ntf.models import ModelRegistry |
| from ntf.config import ModelConfig |
| |
| # Initialize registry with versioning enabled |
| registry = ModelRegistry( |
| model_config=ModelConfig(name="meta-llama/Llama-2-7b-hf"), |
| registry_path="./model_registry", |
| enable_versioning=True |
| ) |
| |
| # After training, save with automatic versioning |
| registry.save_model( |
| model=trained_model, |
| tokenizer=tokenizer, |
| version="1.0.0", # Semantic versioning |
| metadata={ |
| "training_config": config.to_dict(), |
| "metrics": {"eval_loss": 0.234, "perplexity": 12.5}, |
| "dataset": "custom_instructions_v1", |
| "peft_method": "lora", |
| "notes": "Initial fine-tuning run" |
| } |
| ) |
| |
| # List all versions |
| versions = registry.list_versions() |
| print(f"Available versions: {versions}") |
| |
| # Load specific version |
| model_v1, tokenizer = registry.load_model_and_tokenizer(version="1.0.0") |
| |
| # Compare versions |
| comparison = registry.compare_versions(["1.0.0", "1.1.0"], metrics=["eval_loss"]) |
| |
| # Rollback to previous version if needed |
| registry.rollback("1.0.0") |
| ``` |
|
|
| Add versioning best practices: |
| - Use semantic versioning: MAJOR.MINOR.PATCH |
| - Include training config in metadata |
| - Tag production-ready models |
| - Maintain changelog in metadata |
|
|
| **Priority**: π΄ HIGH - Core functionality exists in NTF but tutorial teaches inferior manual approach |
|
|
| --- |
|
|
| ### Tutorial 10: Distributed Training |
|
|
| **File**: `Tutorials/Tutorial_10_Distributed_Training.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β οΈ Feature Partially Implemented** |
| - NTF's `FullFinetuneTrainer` uses Accelerate for distributed training |
| - But no dedicated multi-GPU/multi-node orchestration layer visible |
|
|
| 2. **β DeepSpeed Integration Unclear** |
| - Tutorial shows DeepSpeed but connection to NTF config system not demonstrated |
| - `configs.py` may have DeepSpeed config but not shown in tutorials |
|
|
| 3. **β οΈ Missing Practical Examples** |
| - No launch scripts for multi-node training |
| - No troubleshooting guide for common distributed issues |
|
|
| #### Recommended Clarifications: |
|
|
| If distributed training is supported via Accelerate: |
|
|
| ```python |
| from ntf.config import NTFConfig, TrainingConfig |
| from ntf.finetuning import FullFinetuneTrainer |
| |
| # NTF automatically handles distributed training via Accelerate |
| config = NTFConfig( |
| model=ModelConfig(...), |
| training=TrainingConfig( |
| per_device_train_batch_size=4, |
| gradient_accumulation_steps=4, |
| # Accelerate auto-detects distributed setup |
| fp16=False, |
| bf16=True, |
| gradient_checkpointing=True |
| ) |
| ) |
| |
| # Trainer automatically uses all available GPUs |
| trainer = FullFinetuneTrainer(config=config, ...) |
| trainer.train() # Distributed training handled internally |
| ``` |
|
|
| Add disclaimer if full distributed training (multi-node) not yet implemented: |
| > **Note**: NTF currently supports multi-GPU training on a single node via Accelerate. Multi-node distributed training is planned for future releases. For large-scale training, consider using external orchestration tools. |
|
|
| **Priority**: π‘ MEDIUM - Needs clarification on current capabilities vs. roadmap |
|
|
| --- |
|
|
| ### Tutorial 11: Quantization and Optimization |
|
|
| **File**: `Tutorials/Tutorial_11_Quantization_and_Optimization.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β
External Tools Appropriately Used**: bitsandbytes, GPTQ, AWQ are external libraries |
| 2. **β οΈ Missing NTF Integration Points** |
| - How does quantization connect to `ModelRegistry`? |
| - Should NTF config support quantization parameters? |
|
|
| 3. **β οΈ Serving Optimization Not Connected** |
| - vLLM, TGI mentioned but no NTF serving utilities shown |
| - Does NTF have serving module? |
|
|
| #### Recommended Enhancements: |
|
|
| ```python |
| from ntf.config import ModelConfig, QuantizationConfig |
| from ntf.models import ModelRegistry |
| |
| # Quantization config integrated with NTF |
| quant_config = QuantizationConfig( |
| method="bitsandbytes", # or "gptq", "awq" |
| load_in_4bit=True, |
| bnb_4bit_quant_type="nf4", |
| bnb_4bit_compute_dtype="bfloat16", |
| bnb_4bit_use_double_quant=True |
| ) |
| |
| model_config = ModelConfig( |
| name="meta-llama/Llama-2-7b-hf", |
| quantization=quant_config |
| ) |
| |
| # Registry handles quantized model loading |
| registry = ModelRegistry(model_config) |
| model, tokenizer = registry.load_model_and_tokenizer() |
| # Model automatically loaded in quantized format |
| ``` |
|
|
| Clarify serving story: |
| - If NTF has serving module: demonstrate it |
| - If not: clearly state these are external tools and provide integration examples |
|
|
| **Priority**: π‘ MEDIUM - External tools are appropriate but integration points unclear |
|
|
| --- |
|
|
| ### Tutorial 12: Production Deployment |
|
|
| **File**: `Tutorials/Tutorial_12_Production_Deployment.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β MLflow Registry Conflicts with NTF ModelRegistry** |
| - Tutorial uses MLflow for model registry |
| - NTF has its own `ModelRegistry` class |
| - Creates confusion about which to use |
|
|
| 2. **β οΈ Missing NTF Deployment Utilities** |
| - Does NTF have deployment helpers? |
| - Should demonstrate integration with serving tools |
|
|
| 3. **β οΈ Monitoring Not Connected to NTF** |
| - NTF's metrics utilities could feed monitoring systems |
| - No demonstration of this integration |
|
|
| #### Recommended Fixes: |
|
|
| Option A - Integrate MLflow with NTF ModelRegistry: |
| ```python |
| from ntf.models import ModelRegistry |
| import mlflow |
| |
| # Use NTF for local versioning, MLflow for enterprise registry |
| registry = ModelRegistry(...) |
| |
| # Save to NTF registry first |
| registry.save_model(model, version="1.0.0", metadata={...}) |
| |
| # Then log to MLflow for enterprise tracking |
| with mlflow.start_run(): |
| model_uri = registry.get_model_path("1.0.0") |
| mlflow.pytorch.log_model(model_uri, "model") |
| |
| # Log NTF metadata to MLflow |
| metadata = registry.get_metadata("1.0.0") |
| for key, value in metadata.items(): |
| mlflow.log_param(key, value) |
| ``` |
|
|
| Option B - Replace MLflow with NTF ModelRegistry: |
| ```python |
| from ntf.models import ModelRegistry |
| |
| # NTF ModelRegistry as primary registry |
| registry = ModelRegistry(registry_path="./production_registry") |
| |
| # Deploy directly from NTF registry |
| model, tokenizer = registry.load_model_and_tokenizer(version="1.0.0") |
| |
| # Export for serving |
| registry.export_for_serving( |
| version="1.0.0", |
| format="onnx", # or "torchscript" |
| output_path="./serving_model" |
| ) |
| ``` |
|
|
| **Priority**: π΄ HIGH - Conflicting registry systems create confusion |
|
|
| --- |
|
|
| ### Tutorial 13: Debugging and Troubleshooting |
|
|
| **File**: `Tutorials/Tutorial_13_Debugging_and_Troubleshooting.md` |
|
|
| #### Issues Identified: |
|
|
| 1. **β
Good Universal Content**: OOM, NaN losses, slow training covered well |
| 2. **β οΈ Missing NTF-Specific Debugging** |
| - How to debug `FullFinetuneTrainer` issues? |
| - NTF logging utilities not demonstrated |
| - Config validation tools not shown |
|
|
| 3. **β οΈ No Common NTF Error Patterns** |
| - ModelRegistry loading failures |
| - PEFT adapter mismatch errors |
| - Dataset preprocessing issues with NTF utilities |
|
|
| #### Recommended Enhancements: |
|
|
| Add NTF-specific debugging section: |
|
|
| ```python |
| # Enable verbose logging in NTF |
| from ntf.config import NTFConfig |
| from ntf.utils.logging import setup_logging |
| |
| setup_logging(level="DEBUG") |
| |
| config = NTFConfig( |
| model=ModelConfig(...), |
| training=TrainingConfig( |
| logging_level="DEBUG", |
| log_on_each_node=True |
| ) |
| ) |
| |
| # Validate config before training |
| from ntf.config import validate_config |
| errors = validate_config(config) |
| if errors: |
| print("Configuration errors:") |
| for error in errors: |
| print(f" - {error}") |
| |
| # Debug dataset preprocessing |
| from ntf.training.data import TextDataset |
| dataset = TextDataset(...) |
| |
| # Inspect processed samples |
| for i in range(5): |
| sample = dataset[i] |
| print(f"Sample {i}:") |
| print(f" Input shape: {sample['input_ids'].shape}") |
| print(f" Attention mask sum: {sample['attention_mask'].sum()}") |
| ``` |
|
|
| Add common NTF error patterns: |
| | Error | Cause | Solution | |
| |-------|-------|----------| |
| | `ModelRegistryError: Version not found` | Version doesn't exist in registry | Use `list_versions()` to check available versions | |
| | `PEFT adapter dimension mismatch` | Adapter trained on different model | Ensure same base model and adapter config | |
| | `TextDataset column mapping error` | Column names don't match | Verify `column_mapping` parameter | |
|
|
| **Priority**: π‘ MEDIUM - Good general content but needs NTF-specific additions |
|
|
| --- |
|
|
| ## Missing NTF Components That Should Be Documented |
|
|
| ### High Priority (Core Functionality) |
|
|
| 1. **LayerFreezer (`finetuning/freeze.py`)** |
| - **Purpose**: Selectively freeze model layers to reduce memory and prevent catastrophic forgetting |
| - **Use Cases**: |
| - Fine-tuning large models with limited VRAM |
| - Domain adaptation while preserving general knowledge |
| - Progressive unfreezing strategies |
| - **Tutorial Placement**: Tutorial 03 (Full Fine-Tuning) or dedicated advanced tutorial |
|
|
| 2. **ModelRegistry (`models/registry.py` / `utils/versioning.py`)** |
| - **Purpose**: Centralized model loading, versioning, and metadata tracking |
| - **Use Cases**: |
| - Reproducible experiments with versioned models |
| - A/B testing different model versions |
| - Production deployment with rollback capability |
| - **Tutorial Placement**: Tutorial 09 (currently teaches manual approach) |
|
|
| 3. **PEFTTrainer (`finetuning/lora.py`)** |
| - **Purpose**: Unified interface for all PEFT methods (LoRA, AdaLoRA, LoHa) |
| - **Use Cases**: |
| - Resource-constrained fine-tuning |
| - Multiple adapter management |
| - Adapter composition and merging |
| - **Tutorial Placement**: Tutorial 05 (currently uses manual LoRA setup) |
|
|
| 4. **RLHF Pipeline (`reward/`)** |
| - **Purpose**: End-to-end RLHF workflow with reward modeling and PPO |
| - **Use Cases**: |
| - Aligning models with human preferences |
| - Building conversational AI with feedback |
| - Safety and helpfulness tuning |
| - **Tutorial Placement**: Tutorial 06 (currently uses generic implementation) |
|
|
| ### Medium Priority (Enhanced Functionality) |
|
|
| 5. **Metrics Utilities (`utils/metrics.py`)** |
| - **Purpose**: Comprehensive evaluation metrics suite |
| - **Use Cases**: Model comparison, ablation studies, production monitoring |
| - **Tutorial Placement**: Tutorial 07 (currently implements metrics manually) |
|
|
| 6. **Continual Learning Wrapper (`utils/continual_learning.py`)** |
| - **Purpose**: Prevent catastrophic forgetting in sequential fine-tuning |
| - **Use Cases**: Multi-domain adaptation, lifelong learning scenarios |
| - **Tutorial Placement**: New tutorial or enhancement to Tutorial 04 |
| |
| 7. **Data Utilities (`training/data.py`)** |
| - **Purpose**: Standardized dataset loading with chat template support |
| - **Use Cases**: All fine-tuning scenarios |
| - **Tutorial Placement**: Tutorial 02 (currently teaches custom dataset) |
| |
| ### Low Priority (Nice to Have) |
| |
| 8. **Config Validation Tools** |
| - Purpose: Catch configuration errors before training |
| - Tutorial Placement: Tutorial 08 or integrated throughout |
| |
| 9. **Logging Utilities (`utils/logging.py`)** |
| - Purpose: Structured logging for training runs |
| - Tutorial Placement**: Tutorial 13 (Debugging) |
| |
| --- |
| |
| ## Learning Progression Analysis |
| |
| ### Current State: |
| - β **Disjointed Flow**: Tutorials jump between concepts without building on previous knowledge |
| - β **Missing Foundations**: No explanation of fine-tuning types before practical examples |
| - β **Inconsistent Complexity**: Some advanced topics in early tutorials, basic concepts in later ones |
| |
| ### Recommended Restructuring: |
| |
| **Beginner Track (Tutorials 00-04):** |
| 1. **00**: Introduction + NTF Architecture Overview β Add component map |
| 2. **01**: Environment Setup + Verification β Add component imports |
| 3. **02**: Data Preparation with NTF Utilities β Replace custom dataset |
| 4. **03**: Your First Fine-Tuning Run (FullFinetuneTrainer) β Simplify, use NTF |
| 5. **04**: Understanding PEFT Basics β Move from Tutorial 05 |
| |
| **Intermediate Track (Tutorials 05-09):** |
| 6. **05**: Advanced PEFT Strategies (Multi-Adapter, Composition) |
| 7. **06**: Evaluation and Metrics with NTF Utilities |
| 8. **07**: Hyperparameter Tuning and Optimization |
| 9. **08**: Model Versioning and Experiment Tracking |
| 10. **09**: RLHF Fundamentals |
| |
| **Advanced Track (Tutorials 10-13):** |
| 11. **10**: Distributed Training at Scale |
| 12. **11**: Production Deployment and Serving |
| 13. **12**: Continual Learning and Domain Adaptation β New/refocused |
| 14. **13**: Debugging and Performance Profiling |
| |
| ### Missing Foundational Content: |
| |
| Before Tutorial 03, add: |
| ```markdown |
| ## Understanding Fine-Tuning Types |
| |
| Fine-tuning adapts pre-trained models to specific tasks. NTF supports three main approaches: |
| |
| ### 1. Full Fine-Tuning |
| - **What**: Update all model parameters |
| - **When**: Sufficient VRAM, domain shift is large |
| - **NTF Component**: `FullFinetuneTrainer` + `LayerFreezer` |
| - **Trade-offs**: Best performance, highest resource usage |
| |
| ### 2. Parameter-Efficient Fine-Tuning (PEFT) |
| - **What**: Update small adapter parameters, freeze backbone |
| - **When**: Limited VRAM, multiple tasks, quick iteration |
| - **NTF Component**: `PEFTTrainer` (LoRA, AdaLoRA, LoHa) |
| - **Trade-offs**: Lower resource usage, slightly reduced performance |
| |
| ### 3. Continual Fine-Tuning |
| - **What**: Sequential fine-tuning on multiple domains |
| - **When**: Lifelong learning, multi-domain deployment |
| - **NTF Component**: `ContinualLearningWrapper` + regularization |
| - **Trade-offs**: Maintains knowledge across domains, requires careful tuning |
| |
| Choose your approach based on resources and requirements... |
| ``` |
| |
| --- |
| |
| ## Technical Accuracy Issues |
| |
| ### Speculative Hardware Estimates (Remove or Qualify) |
| |
| **Found in**: Tutorials 00, 01, 03, 10 |
| |
| Examples to remove/qualify: |
| - β "80GB+ VRAM required for 70B models" |
| - β "Training takes 2-3 days on 8x A100" |
| - β "Batch size of 32 recommended" |
| |
| **Replacement language**: |
| - β
"VRAM requirements vary based on sequence length, batch size, precision, and optimization techniques. Use NTF's `LayerFreezer` and gradient checkpointing to reduce memory footprint." |
| - β
"Training time depends on dataset size, model architecture, and hardware configuration. Monitor progress with NTF's built-in logging." |
| - β
"Start with small batch sizes and scale up based on available memory. NTF's `FullFinetuneTrainer` automatically handles gradient accumulation." |
| |
| ### AI Jargon to Professionalize |
| |
| | Original | Professional Alternative | |
| |----------|-------------------------| |
| | "Catastrophic forgetting" | "Knowledge degradation during domain adaptation" | |
| | "Magic numbers" | "Empirically-derived hyperparameters" | |
| | "Black box" | "Complex neural network behavior" | |
| | "State-of-the-art" | "Current leading performance" | |
| | "Ground truth" | "Reference labels" or "Validated data" | |
| |
| --- |
| |
| ## Prioritized Action Items |
| |
| ### Immediate (Week 1-2) |
| 1. β
Fix tutorial numbering and file references in Table of Contents |
| 2. β
Remove all speculative hardware estimates |
| 3. β
Replace Tutorial 03 with NTF-native `FullFinetuneTrainer` example |
| 4. β
Update Tutorial 02 to use `TextDataset` instead of custom dataset |
| 5. β
Update Tutorial 09 to use `ModelRegistry` for versioning |
| 6. β
Update Tutorial 07 to use `utils/metrics.py` utilities |
| |
| ### Short-Term (Month 1) |
| 7. β
Implement missing `LayerFreezer` documentation in Tutorial 03 |
| 8. β
Rewrite Tutorial 06 to use NTF's `RewardModel` and RLHF pipeline |
| 9. β
Update Tutorial 05 to demonstrate `PEFTTrainer` |
| 10. β
Clarify distributed training capabilities in Tutorial 10 |
| 11. β
Resolve MLflow vs. ModelRegistry conflict in Tutorial 12 |
| 12. β
Add foundational fine-tuning types section before Tutorial 03 |
| |
| ### Long-Term (Quarter 1) |
| 13. π Implement missing features (multi-task learning, advanced continual learning) |
| 14. π Create interactive Colab notebooks for each tutorial |
| 15. π Add video walkthroughs for complex topics |
| 16. π Build automated testing for code examples |
| 17. π Create production deployment templates |
| 18. π Develop troubleshooting decision tree |
| |
| --- |
| |
| ## Conclusion |
| |
| The NTF tutorial series has strong foundational content but requires significant alignment with the actual NTF architecture. Key priorities: |
| |
| 1. **Replace generic HuggingFace patterns** with NTF-native components throughout |
| 2. **Document existing but unused components** (LayerFreezer, ModelRegistry, PEFTTrainer, RLHF pipeline) |
| 3. **Remove speculative claims** about hardware requirements and training times |
| 4. **Restructure learning progression** to build knowledge incrementally |
| 5. **Clarify feature availability** to manage user expectations |
| |
| By addressing these issues, the tutorials will become a reliable, professional resource that accurately represents NTF's capabilities and guides users from beginner to production-ready implementations. |
| |
| --- |
| |
| ## Appendix: Quick Reference - NTF Components by Tutorial |
| |
| | Tutorial | Current Approach | Recommended NTF Approach | |
| |----------|-----------------|-------------------------| |
| | 02 | Custom Dataset | `TextDataset` + `create_data_collator` | |
| | 03 | HF Trainer | `FullFinetuneTrainer` + `LayerFreezer` | |
| | 05 | Manual LoRA | `PEFTTrainer` + adapter management | |
| | 06 | Generic Reward Model | `RewardModel` + `PreferenceDataset` + RLHF pipeline | |
| | 07 | Manual Metrics | `compute_perplexity`, `evaluate_generation`, etc. | |
| | 09 | Manual Versioning | `ModelRegistry` with semantic versioning | |
| | 12 | MLflow Registry | NTF `ModelRegistry` Β± MLflow integration | |
| |
| --- |
| |
| *Report Generated: NTF Documentation QA Review* |
| *Reviewer: Documentation Quality Assurance Team* |
| *Scope: Architecture Alignment, Completeness, Technical Accuracy, Learning Progression* |
| |