OZVERI-636M: Turkish-Centric Decoder-Only Language Model
Model Description
OZVERI-636M is a 636-million-parameter decoder-only Transformer language model trained from scratch using a phase-based curriculum learning approach. The model is optimized for Turkish language understanding with secondary capabilities in English translation, basic Python code comprehension, and limited domain knowledge in religious content.
Key Features:
- 🎯 Turkish-centric design for morphologically rich language
- 📚 Multi-phase curriculum learning strategy
- ⚡ FlashAttention v2 with variable-length sequences
- 🔄 Zero-padding architecture for memory efficiency
- 💻 Code-aware training with Python corpus
- 🌐 Cross-lingual Turkish-English alignment
Previous Work / Series Context
You can interactively test and compare with our previous smaller research model OZVERI-20M here: www.ozveri.com.
Model Details
Model Description
- Developed by: Irfan Yildiz (Independent Researcher)
- Model type: Decoder-only Transformer (Causal Language Model)
- Language(s): Turkish (primary), English (secondary), Arabic script (limited)
- License: Apache 2.0 (Research-Only)
- Finetuned from model: N/A (trained from scratch)
Model Sources
- Repository: [Training code and methodology]
- Paper: [arXiv preprint - Coming soon]
- Weights: Not publicly available (Research documentation only)
Research Purpose
⚠️ Model weights are NOT publicly distributed.
This model card serves as research documentation for:
- Curriculum learning methodology in low-resource language modeling
- Phase-based training strategies and their impact on convergence
- FlashAttention v2 varlen efficiency analysis
- Training infrastructure design for large-scale language models
- Ablation studies on multi-phase data presentation
- Controlled comparison: Variable-length vs. Padded attention mechanisms
Intended Research Applications
This documentation can inform:
- Researchers developing Turkish or low-resource language models
- Studies on curriculum learning effectiveness
- Memory-efficient attention mechanism implementations
- Training stability analysis in multi-phase learning
- Comparative studies on attention mechanisms (varlen vs. padded)
Methodology Replication
Researchers can replicate the training approach using:
- The documented curriculum learning phases
- Phase-specific hyperparameter schedules
- Data preprocessing and tokenization strategies
- FlashAttention v2 integration patterns
- Checkpoint and recovery mechanisms
Bias, Risks, and Limitations
⚠️ Note: This is research documentation. Model weights are not distributed.
Documented Limitations
- Model Scale: 636M parameters - suitable for research on mid-scale models
- Context Length: Maximum 1024 tokens in current implementation
- Training Data: Web-sourced with inherent biases and limitations
- Language Balance: Designed as Turkish-primary with English secondary
- Domain Coverage: Limited in religious/Arabic script content
- Code Understanding: Basic Python comprehension only
Research Considerations
Researchers building upon this methodology should consider:
- Data source diversity and representation
- Phase transition strategies and their impact on model behavior
- Hyperparameter sensitivity across different domains
- Hardware requirements for FlashAttention v2
- Checkpoint frequency trade-offs (storage vs. recovery granularity)
Methodological Transparency
This documentation provides:
- Complete training hyperparameters for reproducibility
- Phase-by-phase curriculum design rationale
- Ablation study results and observations
- Training stability metrics and failure mode analysis
- Memory and compute efficiency measurements
Training Methodology
This section documents the complete training approach for research and reproducibility purposes.
Training Data
Total Corpus: ~4.0B tokens across 418M+ sequences
| Phase | Source Type | Sequences | Tokens | Description |
|---|---|---|---|---|
| Phase 0 | General Turkish | 4.4M | ~696M | Web crawl, general texts |
| Phase 1 | Conversational | 245M | ~1.54B | Subtitle data, dialogue |
| Phase 2 | Code | 21.7M | ~235M | Python programming corpus |
| Phase 3 | Parallel | 147M | ~1.50B | Turkish-English aligned texts |
| Phase 4 | Re-anchor | Mixed | ~696M | General Turkish (stabilization) |
Data Processing:
- Language filtering and deduplication
- Toxic content normalization (not removal)
- Minimum quality thresholds applied
- Character-level encoding validation
Training Procedure
Curriculum Learning Phases
The model follows a structured 5-phase curriculum:
Phase 0: Foundation → General Turkish linguistic base
Phase 1: Conversational → Natural dialogue patterns
Phase 2: Code → Structured, deterministic patterns
Phase 3: Parallel → Cross-lingual alignment
Phase 4: Re-anchor → Catastrophic forgetting prevention
Preprocessing
Tokenization:
- Method: SentencePiece BPE
- Vocabulary size: 64,000
- Byte fallback: Enabled
- Character coverage: 0.9995
Tokenizer Training Mix (Character-Weighted):
- General Turkish: 62% (~620M chars)
- Conversational: 20% (~200M chars)
- Code: 10% (~100M chars)
- Parallel: 8% (~80M chars)
Chunking Strategy:
- Target chunk size: ~1024 tokens
- Overlap between chunks: Limited
- Minimum sequence length filtering applied
- BOS/EOS tokens added to each sequence
Training Hyperparameters
Model Architecture:
- Parameters: 636M
- Layers: 24
- Hidden size: 1280
- Attention heads: 16
- Activation: SiLU
- Position embeddings: Learned
- Weight tying: Enabled (embedding ↔ output)
Optimization:
- Optimizer: AdamW
- Precision: bfloat16 (BF16)
- Gradient clipping: max_norm=1.0
Variant-Specific Settings:
| Setting | Variant A (Varlen) | Variant B (Padded) |
|---|---|---|
| Batch size | 7 | 4 |
| Gradient accumulation | 8 steps | 14 steps |
| Effective batch size | 56 sequences | 56 sequences |
| Attention mechanism | FlashAttention v2 varlen | Standard padded |
Phase-Specific Hyperparameters:
| Phase | Learning Rate | Weight Decay | Dropout | Warmup |
|---|---|---|---|---|
| Phase 0 | 3e-4 | 0.10 | 0.1 | 5% |
| Phase 1 | 2e-4 | 0.08 | 0.1 | 3% |
| Phase 2 | 1.5e-4 | 0.05 | 0.0 | 2% |
| Phase 3 | 1e-4 | 0.05 | 0.1 | 2% |
| Phase 4 | 6e-5 | 0.12 | 0.1 | 1% |
Learning Rate Schedule:
- Type: Cosine annealing with warmup
- Phase-specific warmup ratios
- Monotonically decreasing across phases
Speeds, Sizes, Times
Hardware:
- GPU: NVIDIA A100 (40GB HBM2)
- Platform: Google Colab Pro+
- Parallelization: DDP with NCCL
Training Efficiency (Variant A - Varlen):
- Attention: FlashAttention v2 (varlen mode)
- GPU Memory: 7.17 GB
- Training Speed: 1,740 steps/hour
- Total Training Time: 49.42 hours
- Throughput: ~100 tokens/sec
Training Efficiency (Variant B - Padded):
- Attention: Standard padded causal attention
- GPU Memory: 6.71 GB
- Training Speed: 874 steps/hour
- Total Training Time: 98.43 hours
- Throughput: ~100 tokens/sec
Checkpointing:
- Frequency: Every 2,000 steps
- State saved: Model, optimizer, scheduler, phase info
- Recovery: Full resume capability from any checkpoint
Evaluation
Quick Comparison Summary
Model 1: Padding + Transformers
Model 2: FlashAttention v2 + Varlen
Comprehensive comparison of Padding+Transformers vs FlashAttention v2+Varlen across key metrics
Training Strategy Overview
Performance Summary:
| Strategy | Mean Loss | Perplexity (PPL) | Final Loss | Training Time |
|---|---|---|---|---|
| Padding + Classic Transformer | 3.9959 | 54.37 | 2.9792 | 98.43h |
| FlashAttention v2 + VarLen | 3.8280 | 45.97 | 3.1088 | 49.42h |
| Improvement | -4.2% | -15.4% | +4.3% | -49.8% |
Key Insights:
- ✅ Varlen has 15.4% lower perplexity across training (45.97 vs 54.37)
- ✅ Varlen achieves 4.2% better mean loss (3.8280 vs 3.9959)
- ⚠️ Padded has 4.3% better final loss (2.9792 vs 3.1088)
- ✅ Varlen completes training 49.8% faster (49.42h vs 98.43h)
Variant A: FlashAttention v2 + Varlen (COMPLETED)
Training Stability:
- ✅ Smooth loss convergence across all phases
- ✅ Stable gradient norms throughout training
- ✅ Zero NaN/Inf events after warmup period
- ✅ Successful phase transitions without divergence
- ✅ 430 training steps completed
Final Performance:
- Final Loss: 3.1088
- Mean Loss: 3.8280
- Perplexity (PPL): 45.97
- Loss Reduction: 95.2% from initial loss
- Training Steps: 430
- Total Training Time: 49.42 hours
Efficiency Metrics:
- Training Speed: 1,740 steps/hour (99.2% faster than padded)
- GPU Memory Usage: 7.17 GB
- Throughput: ~100 tokens/sec
- Time Efficiency: 49.8% time savings vs. padded
Training Curves:
Figure 1a: Training loss across all curriculum phases (Varlen)
Figure 2a: Learning rate schedule (Varlen)
Figure 3a: Gradient norm stability throughout training (Varlen)
Figure 4a: GPU memory and throughput analysis (Varlen)
Figure 5a: Step time and efficiency metrics (Varlen)
Variant B: Standard Padded Attention (COMPLETED)
Training Stability:
- ✅ Successful convergence across all phases
- ✅ Stable optimization dynamics
- ✅ 435 training steps completed
- ⚠️ Significantly lower training throughput
- ⚠️ Slightly lower GPU memory usage
Final Performance:
- Final Loss: 2.9792 (4.3% better than varlen)
- Mean Loss: 3.9959
- Perplexity (PPL): 54.37
- Loss Reduction: 95.7% from initial loss
- Training Steps: 435
- Total Training Time: 98.43 hours
Efficiency Metrics:
- Training Speed: 874 steps/hour (baseline)
- GPU Memory Usage: 6.71 GB
- Throughput: ~100 tokens/sec
- Time Efficiency: 2x slower than varlen
Training Curves:
Figure 1b: Training loss across all curriculum phases (Padded)
Figure 2b: Learning rate schedule (Padded)
Figure 3b: Gradient norm stability throughout training (Padded)
Figure 4b: GPU memory and throughput analysis (Padded)
Figure 5b: Step time and efficiency metrics (Padded)
Comprehensive Comparative Analysis
1. Loss Comparison
Figure 6: Direct loss comparison between Varlen and Padded variants
Loss Metrics:
| Metric | Varlen | Padded | Winner |
|---|---|---|---|
| Final Loss | 3.1088 | 2.9792 | ✅ Padded (-4.3%) |
| Mean Loss | 3.8280 | 3.9959 | ✅ Varlen (-4.2%) |
| Loss Reduction | 95.2% | 95.7% | ✅ Padded (+0.5pp) |
Key Observations:
- Varlen maintains better average performance throughout training
- Padded achieves superior final convergence
- Both variants successfully reduce loss by >95%
2. Training Efficiency Comparison
Figure 7: Training speed, memory usage, and throughput comparison
Efficiency Metrics:
| Metric | Varlen | Padded | Improvement |
|---|---|---|---|
| Training Speed | 1,740 steps/h | 874 steps/h | +99.2% |
| Total Time | 49.42 h | 98.43 h | -49.8% |
| GPU Memory | 7.17 GB | 6.71 GB | +6.8% |
| Throughput | ~100 tok/s | ~100 tok/s | Similar |
Key Findings:
- Varlen provides near 2x speedup in training
- Padded uses slightly less GPU memory
- Throughput remains comparable between variants
3. Loss Convergence Rate Comparison
Figure 8: Rate of loss reduction across training phases
Convergence Analysis:
- Early Phase (0-100 steps): Varlen shows faster initial convergence
- Mid Phase (100-300 steps): Similar convergence rates
- Late Phase (300+ steps): Padded achieves slightly better final convergence
Phase-Specific Convergence:
| Phase | Varlen Rate | Padded Rate | Winner |
|---|---|---|---|
| Phase 0 | Faster | Baseline | ✅ Varlen |
| Phase 1 | Similar | Similar | 🤝 Tie |
| Phase 2 | Faster | Baseline | ✅ Varlen |
| Phase 3 | Similar | Similar | 🤝 Tie |
| Phase 4 | Baseline | Better | ✅ Padded |
4. Resource Utilization Comparison
Figure 9: GPU memory, compute efficiency, and resource allocation
Resource Metrics:
| Resource | Varlen | Padded | Efficiency |
|---|---|---|---|
| Peak Memory | 7.17 GB | 6.71 GB | Padded -6.8% |
| Avg Memory | ~7.0 GB | ~6.5 GB | Padded better |
| Compute Time | 49.42h | 98.43h | Varlen -50% |
| Steps/GPU-hour | 1,740 | 874 | Varlen +99% |
Resource Efficiency Score:
- Varlen: High speed, moderate memory → Ideal for rapid iteration
- Padded: Lower memory, slower speed → Ideal for memory-constrained setups
5. Gradient Norm Comparison
Figure 10: Gradient stability and optimization dynamics
Gradient Stability:
| Metric | Varlen | Padded | Observation |
|---|---|---|---|
| Mean Gradient Norm | Stable | Stable | Both stable |
| Gradient Variance | Low | Low | Consistent |
| Spikes/Anomalies | None | None | Clean training |
| Clipping Events | Minimal | Minimal | Well-tuned |
Key Insights:
- Both variants maintain stable gradient norms
- No evidence of gradient explosion or vanishing
- FlashAttention v2 does not introduce gradient artifacts
- Phase transitions handled smoothly in both cases
6. Throughput Over Time Comparison
Figure 11: Token processing throughput across training duration
Throughput Analysis:
| Time Period | Varlen | Padded | Speedup |
|---|---|---|---|
| 0-10h | ~100 tok/s | ~100 tok/s | Similar |
| 10-30h | ~100 tok/s | ~100 tok/s | Similar |
| 30-50h | ~100 tok/s | N/A | Varlen finishes |
| 50-98h | N/A | ~100 tok/s | Padded continues |
Key Observations:
- Consistent throughput maintained throughout training
- Varlen completes entire curriculum in 49.42h
- Padded requires additional 49h to complete same curriculum
- No throughput degradation over time in either variant
Perplexity Analysis
Perplexity Comparison:
| Variant | Perplexity (PPL) | Interpretation |
|---|---|---|
| Varlen | 45.97 | Lower = Better average prediction confidence |
| Padded | 54.37 | Higher perplexity despite better final loss |
| Difference | -15.4% | Varlen has significantly better mean performance |
Why Varlen Has Lower Perplexity Despite Higher Final Loss:
- Mean vs. Final: Perplexity reflects average performance across all training steps
- Varlen Advantage: Better optimization dynamics throughout training
- Padded Trade-off: Slower convergence but better final point
- Practical Implication: Varlen provides more consistent predictions during training
Consolidated Performance Summary
Overall Performance Matrix:
| Dimension | Varlen | Padded | Winner | Magnitude |
|---|---|---|---|---|
| Final Loss | 3.1088 | 2.9792 | Padded | -4.3% |
| Mean Loss | 3.8280 | 3.9959 | Varlen | -4.2% |
| Perplexity | 45.97 | 54.37 | Varlen | -15.4% |
| Training Speed | 1,740 steps/h | 874 steps/h | Varlen | +99.2% |
| Total Time | 49.42h | 98.43h | Varlen | -49.8% |
| GPU Memory | 7.17 GB | 6.71 GB | Padded | -6.8% |
| Loss Reduction | 95.2% | 95.7% | Padded | +0.5pp |
Ablation Studies
Phase Contribution Analysis:
| Experiment | Impact | Observation |
|---|---|---|
| No curriculum | Divergence | Training instability, failed convergence |
| No re-anchor phase | +0.17 loss | Catastrophic forgetting detected |
| No parallel data | Weak alignment | Degraded cross-lingual performance |
| Randomized phase order | High variance | Unstable optimization dynamics |
| Skip conversational phase | Poor dialogue | Reduced natural language fluency |
| Varlen (full pipeline) | Fast + Good | 49.42h, PPL 45.97, loss 3.1088 |
| Padded (full pipeline) | Slow + Optimal | 98.43h, PPL 54.37, loss 2.9792 |
Attention Mechanism Trade-offs:
| Aspect | Varlen | Padded | Winner |
|---|---|---|---|
| Training speed | 1,740 steps/h | 874 steps/h | ✅ Varlen (2x faster) |
| Total time | 49.42h | 98.43h | ✅ Varlen (50% saved) |
| Mean loss | 3.8280 | 3.9959 | ✅ Varlen (4.2% better) |
| Final loss | 3.1088 | 2.9792 | ✅ Padded (4.3% better) |
| Perplexity | 45.97 | 54.37 | ✅ Varlen (15.4% better) |
| GPU memory | 7.17 GB | 6.71 GB | ✅ Padded (6.8% lower) |
| Loss reduction | 95.2% | 95.7% | ✅ Padded (0.5pp better) |
| Implementation | Moderate | Simple | ⚠️ Padded |
Decision Framework
Recommendations by Use Case:
| Use Case | Recommended | Rationale |
|---|---|---|
| Research & Prototyping | Varlen | 2x faster iteration, 15.4% better PPL |
| Production Deployment | Padded | 4.3% better final loss (2.9792) |
| Resource-Limited | Varlen | Complete training in half the time |
| Quality-Critical | Padded | Optimal final convergence |
| Memory-Constrained | Padded | 6.8% lower GPU memory usage |
| Time-Critical | Varlen | 49.8% faster training completion |
| Average Performance | Varlen | 4.2% better mean loss, 15.4% better PPL |
| Hybrid Approach | Varlen → Padded | Fast pre-training + final polishing |
Selection Criteria:
IF time_critical OR rapid_prototyping:
→ Use Varlen (2x faster, acceptable quality)
ELIF quality_critical AND time_available:
→ Use Padded (4.3% better final loss)
ELIF memory_limited:
→ Use Padded (6.8% lower memory)
ELSE:
→ Use Varlen (better mean performance, faster iteration)
Hybrid Strategy:
- Phase 1: Train with Varlen for 80% of budget (fast convergence)
- Phase 2: Switch to Padded for final 20% (polish to optimal loss)
- Benefit: Combine speed of Varlen with final quality of Padded
Statistical Significance
Training Stability Metrics:
| Metric | Varlen | Padded | Significance |
|---|---|---|---|
| Loss std deviation | Low | Low | Both stable |
| Gradient variance | Minimal | Minimal | Both stable |
| Phase transition smoothness | Smooth | Smooth | Equivalent |
| Convergence consistency | High | High | Both reliable |
Quality Gap Analysis:
- Final loss difference: 0.13 points (3.1088 vs 2.9792)
- Mean loss difference: 0.17 points (3.8280 vs 3.9959)
- Perplexity difference: 8.40 points (45.97 vs 54.37)
- Conclusion: Varlen's speed advantage outweighs small final loss gap
Environmental Impact
Hardware:
- Hardware Type: NVIDIA A100 GPU (40GB)
- Cloud Provider: Google Cloud (via Colab Pro+)
- Compute Region: [if known]
Training Duration:
- Variant A (Varlen): 49.42 hours
- Variant B (Padded): 98.43 hours
- Time Savings: 49.01 hours (49.8% reduction)
Energy Efficiency:
- FlashAttention v2 completes training in half the time
- Reduced total compute hours lower carbon footprint
- Estimated energy savings: ~49 GPU-hours
- Carbon reduction: ~50% fewer emissions for Varlen
Efficiency Measures:
- FlashAttention v2 reduces training time by 50%
- Gradient accumulation reduces communication overhead
- Early stopping and phase-based training prevent unnecessary computation
- Memory-efficient architectures enable larger effective batch sizes
Sustainability Implications:
- Varlen enables more sustainable AI research practices
- 2x training speed allows more experiments with same carbon budget
- Lower total compute hours reduce environmental impact
- Faster iteration cycles improve research productivity per watt
Technical Specifications
Model Architecture and Objective
Architecture: Decoder-only Transformer (GPT-style)
Key Components:
- Multi-head causal self-attention (16 heads)
- SiLU-activated feed-forward networks
- Pre-layer normalization
- Learned positional embeddings
- Weight tying between input/output embeddings
Training Objective: Next-token prediction (causal language modeling)
Loss Function: Cross-entropy over vocabulary
Compute Infrastructure
Training Infrastructure:
- Cloud platform: Google Colab Pro+
- GPU: NVIDIA A100 (40GB)
- Framework: PyTorch 2.x
- Distributed: DDP with NCCL backend
Optimization Features:
- Mixed precision training (BF16)
- Gradient accumulation (8 or 14 steps, variant-dependent)
- Automatic gradient clipping
- Atomic checkpointing system
- Emergency recovery mechanisms
Software
Core Dependencies:
- PyTorch >= 2.0
- Transformers >= 4.30
- Flash-Attention >= 2.0 (Variant A only)
- SentencePiece >= 0.1.99
Training Framework:
- Custom curriculum learning pipeline
- Streaming JSONL dataset loader
- Phase-aware hyperparameter controller
- Fault-tolerant checkpoint manager
Citation
BibTeX:
@article{yildiz2026ozveri636m,
title = {OZVERI-636M: Curriculum-Phased Training of Large Language Models ---
A {\textasciitilde}4B-Token Controlled Study of Efficiency--Quality
Trade-offs between FlashAttention v2 and Standard Attention},
author = {Yildiz, Irfan},
journal = {arXiv preprint},
volume = {arXiv:XXXX.XXXXX},
year = {2026},
note = {Independent Research}
}
APA:
Yildiz, I. (2026). OZVERI-636M: Curriculum-phased training of large language models—A ~4B-token controlled study of efficiency–quality trade-offs between FlashAttention v2 and standard attention. arXiv. https://arxiv.org/abs/XXXX.XXXXX
Model Card Authors
Irfan Yildiz - Independent Researcher
Model Card Contact
For research inquiries and methodology questions:
- GitHub: [Repository - Training methodology and documentation]
- Email: irfan34yildiz@gmail.com
- Research Documentation: https://huggingface.co/irfantr/OZVERI-636M
Note: Model weights are not distributed. This page serves as research documentation for the training methodology, curriculum learning approach, and experimental findings.
Glossary
- Curriculum Learning: Structured training where data is presented in phases of increasing complexity
- FlashAttention: Memory-efficient attention mechanism optimized for GPU architecture
- Varlen: Variable-length sequences without padding
- BPE: Byte Pair Encoding, a subword tokenization method
- Re-anchoring: Final training phase to prevent catastrophic forgetting
- Catastrophic Forgetting: Loss of previously learned information when learning new tasks
- Padding: Adding tokens to sequences to make them equal length (memory inefficient)
- Perplexity (PPL): Measure of prediction confidence (lower = better)
- Mean Loss: Average loss across all training steps
Additional Information
Research Documentation Purpose
This model card serves as comprehensive documentation of:
- A 636M parameter Turkish-centric language model training methodology
- Phase-based curriculum learning implementation
- FlashAttention v2 varlen efficiency analysis
- Multi-phase training stability and convergence patterns
- Ablation studies on curriculum design choices
- Controlled comparison of varlen vs. padded attention mechanisms
- Detailed performance metrics including loss, perplexity, and efficiency
Model weights are intentionally not distributed. This documentation aims to contribute to:
- Research on low-resource language modeling
- Curriculum learning methodology development
- Training infrastructure optimization
- Memory-efficient attention mechanisms
Comparative Studies
Completed Experiments:
- ✅ OZVERI-636M-Variant-A (Varlen): 49.42h, PPL 45.97, final loss 3.1088
- ✅ OZVERI-636M-Variant-B (Padded): 98.43h, PPL 54.37, final loss 2.9792
Controlled Variables:
- Identical training data and curriculum phases
- Same hyperparameter schedules (LR, weight decay, dropout, warmup)
- Equal token budget (~4.0B tokens)
- Identical model architecture (636M parameters)
- Same hardware (NVIDIA A100 40GB)
Measured Differences:
| Dimension | Varlen | Padded | Result |
|---|---|---|---|
| Training time | 49.42h | 98.43h | -49.8% |
| Steps/hour | 1,740 | 874 | +99.2% |
| Mean loss | 3.8280 | 3.9959 | -4.2% |
| Final loss | 3.1088 | 2.9792 | +4.3% |
| Perplexity | 45.97 | 54.37 | -15.4% |
| GPU memory | 7.17 GB | 6.71 GB | +6.8% |
| Loss reduction | 95.2% | 95.7% | -0.5pp |
Primary Findings:
- Speed: Varlen achieves 2x training throughput
- Time: 50% reduction in wall-clock training time
- Mean Performance: Varlen has 4.2% better mean loss and 15.4% better perplexity
- Final Quality: Padded has 4.3% better final loss
- Memory: Padded uses 6.8% less GPU memory
- Trade-off: Speed and mean performance (Varlen) vs. final quality (Padded)
Implementation Details Available
Researchers can reference:
- Complete hyperparameter schedules
- Phase transition strategies
- Data preprocessing pipelines
- Tokenizer training methodology
- Checkpoint and recovery systems
- FlashAttention v2 integration patterns
- Varlen batching implementation
- Comparative performance metrics
Framework Versions
- Transformers: 4.36.0
- PyTorch: 2.1.0
- Flash-Attention: 2.3.0
- SentencePiece: 0.1.99
Acknowledgments
- FlashAttention v2 by Tri Dao et al.
- Google Colab Pro+ for computational resources
- PyTorch and Hugging Face communities
License: Apache 2.0 (Research-Only Restriction)
Copyright © 2026 Irfan Yildiz
This model is provided for research and educational purposes only. Commercial use, redistribution of weights, or derivative commercial products require explicit written permission.
Last Updated: January 2026
Evaluation results
- final_loss_varlen on Turkish Multi-Phase Corpusself-reported3.109
- final_loss_padded on Turkish Multi-Phase Corpusself-reported2.979
- mean_loss_varlen on Turkish Multi-Phase Corpusself-reported3.828
- mean_loss_padded on Turkish Multi-Phase Corpusself-reported3.996
- perplexity_varlen on Turkish Multi-Phase Corpusself-reported45.970
- perplexity_padded on Turkish Multi-Phase Corpusself-reported54.370