Update README.md
Browse files
README.md
CHANGED
|
@@ -100,28 +100,28 @@ Our training methodology combined multiple data sources and validation strategie
|
|
| 100 |
|
| 101 |
### Data Pipeline (5-day development cycle)
|
| 102 |
|
| 103 |
-
**Phase 1: Initial Generation
|
| 104 |
- Few-shot generation using base Llama 3.1
|
| 105 |
- Context-aware synthetic examples
|
| 106 |
- Balanced across all six sentiment categories
|
| 107 |
|
| 108 |
-
**Phase 2: Consensus Filtering
|
| 109 |
- Trained multiple LoRA variants on hand-annotated data
|
| 110 |
- Consensus filtering: kept examples where ≥2 models agreed
|
| 111 |
- Reduced noise and improved training data quality
|
| 112 |
|
| 113 |
-
**Phase 3: Corpus Mining
|
| 114 |
- Mined authentic Ancient Latin texts from Perseus Digital Library
|
| 115 |
- Extracted high-confidence positive examples (previously underrepresented)
|
| 116 |
- Combined ~40,000 corpus examples with synthetic data
|
| 117 |
|
| 118 |
-
**Phase 4: Final Training & Iteration
|
| 119 |
- Balanced dataset: 9,000 examples (1,500 per category)
|
| 120 |
- Distributed training with data-parallel strategy
|
| 121 |
- Multiple training runs to optimize hyperparameters
|
| 122 |
|
| 123 |
### Final Training Configuration
|
| 124 |
-
- **Training Examples:** 9,000 (balanced across
|
| 125 |
- **Training Epochs:** 15
|
| 126 |
- **Architecture:** LoRA adapter (rank: 128, alpha: 256)
|
| 127 |
- **Optimization:** 8-bit quantization for efficiency
|
|
|
|
| 100 |
|
| 101 |
### Data Pipeline (5-day development cycle)
|
| 102 |
|
| 103 |
+
**Phase 1: Initial Generation**
|
| 104 |
- Few-shot generation using base Llama 3.1
|
| 105 |
- Context-aware synthetic examples
|
| 106 |
- Balanced across all six sentiment categories
|
| 107 |
|
| 108 |
+
**Phase 2: Consensus Filtering**
|
| 109 |
- Trained multiple LoRA variants on hand-annotated data
|
| 110 |
- Consensus filtering: kept examples where ≥2 models agreed
|
| 111 |
- Reduced noise and improved training data quality
|
| 112 |
|
| 113 |
+
**Phase 3: Corpus Mining**
|
| 114 |
- Mined authentic Ancient Latin texts from Perseus Digital Library
|
| 115 |
- Extracted high-confidence positive examples (previously underrepresented)
|
| 116 |
- Combined ~40,000 corpus examples with synthetic data
|
| 117 |
|
| 118 |
+
**Phase 4: Final Training & Iteration**
|
| 119 |
- Balanced dataset: 9,000 examples (1,500 per category)
|
| 120 |
- Distributed training with data-parallel strategy
|
| 121 |
- Multiple training runs to optimize hyperparameters
|
| 122 |
|
| 123 |
### Final Training Configuration
|
| 124 |
+
- **Training Examples:** 9,000 (balanced across 7 categories)
|
| 125 |
- **Training Epochs:** 15
|
| 126 |
- **Architecture:** LoRA adapter (rank: 128, alpha: 256)
|
| 127 |
- **Optimization:** 8-bit quantization for efficiency
|