Upload folder using huggingface_hub

d187b57 verified 10 months ago

5.54 kB

	# Toxic Comment Classification using Deep Learning

	A multilingual toxic comment classification system using language-aware transformers and advanced deep learning techniques.

	## 🏗️ Architecture Overview

	### Core Components

	1. LanguageAwareTransformer
	- Base: XLM-RoBERTa Large
	- Custom language-aware attention mechanism
	- Gating mechanism for feature fusion
	- Language-specific dropout rates
	- Support for 7 languages with English fallback

	2. ToxicDataset
	- Efficient caching system
	- Language ID mapping
	- Memory pinning for CUDA optimization
	- Automatic handling of missing values

	3. Training System
	- Mixed precision training (BF16/FP16)
	- Gradient accumulation
	- Language-aware loss weighting
	- Distributed training support
	- Automatic threshold optimization

	### Key Features

	- Language Awareness
	- Language-specific embeddings
	- Dynamic dropout rates per language
	- Language-aware attention mechanism
	- Automatic fallback to English for unsupported languages

	- Performance Optimization
	- Gradient checkpointing
	- Memory-efficient attention
	- Automatic mixed precision
	- Caching system for processed data
	- CUDA optimization with memory pinning

	- Training Features
	- Weighted focal loss with language awareness
	- Dynamic threshold optimization
	- Early stopping with patience
	- Gradient flow monitoring
	- Comprehensive metric tracking

	## 📊 Data Processing

	### Input Format
	```python
	{
	'comment_text': str, # The text to classify
	'lang': str, # Language code (en, ru, tr, es, fr, it, pt)
	'toxic': int, # Binary labels for each category
	'severe_toxic': int,
	'obscene': int,
	'threat': int,
	'insult': int,
	'identity_hate': int
	}
	```

	### Language Support
	- Primary: en, ru, tr, es, fr, it, pt
	- Default fallback: en (English)
	- Language ID mapping: {en: 0, ru: 1, tr: 2, es: 3, fr: 4, it: 5, pt: 6}

	## 🚀 Model Architecture

	### Base Model
	- XLM-RoBERTa Large
	- Hidden size: 1024
	- Attention heads: 16
	- Max sequence length: 128

	### Custom Components

	1. Language-Aware Classifier
	```python
	- Input: Hidden states [batch_size, hidden_size]
	- Language embeddings: [batch_size, 64]
	- Projection: hidden_size + 64 -> 512
	- Output: 6 toxicity predictions
	```

	2. Language-Aware Attention
	```python
	- Input: Hidden states + Language embeddings
	- Scaled dot product attention
	- Gating mechanism for feature fusion
	- Memory-efficient implementation
	```

	## 🛠️ Training Configuration

	### Hyperparameters
	```python
	{
	"batch_size": 32,
	"grad_accum_steps": 2,
	"epochs": 4,
	"lr": 2e-5,
	"weight_decay": 0.01,
	"warmup_ratio": 0.1,
	"label_smoothing": 0.01,
	"model_dropout": 0.1,
	"freeze_layers": 2
	}
	```

	### Optimization
	- Optimizer: AdamW
	- Learning rate scheduler: Cosine with warmup
	- Mixed precision: BF16/FP16
	- Gradient clipping: 1.0
	- Gradient accumulation steps: 2

	## 📈 Metrics and Monitoring

	### Training Metrics
	- Loss (per language)
	- AUC-ROC (macro)
	- Precision, Recall, F1
	- Language-specific metrics
	- Gradient norms
	- Memory usage

	### Validation Metrics
	- AUC-ROC (per class and language)
	- Optimal thresholds per language
	- Critical class performance (threat, identity_hate)
	- Distribution shift monitoring

	## 🔧 Usage

	### Training
	```bash
	python model/train.py
	```

	### Inference
	```python
	from model.predict import predict_toxicity

	results = predict_toxicity(
	text="Your text here",
	model=model,
	tokenizer=tokenizer,
	config=config
	)
	```

	## 🔍 Code Structure

	```
	model/
	├── language_aware_transformer.py # Core model architecture
	├── train.py # Training loop and utilities
	├── predict.py # Inference utilities
	├── evaluation/
	│ ├── evaluate.py # Evaluation functions
	│ └── threshold_optimizer.py # Dynamic threshold optimization
	├── data/
	│ └── sampler.py # Custom sampling strategies
	└── training_config.py # Configuration management
	```

	## 🤖 AI/ML Specific Notes

	1. Tensor Shapes
	- Input IDs: [batch_size, seq_len]
	- Attention Mask: [batch_size, seq_len]
	- Language IDs: [batch_size]
	- Hidden States: [batch_size, seq_len, hidden_size]
	- Language Embeddings: [batch_size, embed_dim]

	2. Critical Components
	- Language ID handling in forward pass
	- Attention mask shape management
	- Memory-efficient attention implementation
	- Gradient flow in language-aware components

	3. Performance Considerations
	- Cache management for processed data
	- Memory pinning for GPU transfers
	- Gradient accumulation for large batches
	- Language-specific dropout rates

	4. Error Handling
	- Language ID validation
	- Shape compatibility checks
	- Gradient norm monitoring
	- Device placement verification

	## 📝 Notes for AI Systems

	1. When modifying the code:
	- Maintain language ID handling in forward pass
	- Preserve attention mask shape management
	- Keep device consistency checks
	- Handle BatchEncoding security in PyTorch 2.6+

	2. Key attention points:
	- Language ID tensor shape and type
	- Attention mask broadcasting
	- Memory-efficient attention implementation
	- Gradient flow through language-aware components

	3. Common pitfalls:
	- Incorrect attention mask shapes
	- Language ID type mismatches
	- Memory leaks in caching
	- Device inconsistencies