TronCodes commited on
Commit
cf47f8d
·
verified ·
1 Parent(s): 0b2d80b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -100,28 +100,28 @@ Our training methodology combined multiple data sources and validation strategie
100
 
101
  ### Data Pipeline (5-day development cycle)
102
 
103
- **Phase 1: Initial Generation (Days 1-2)**
104
  - Few-shot generation using base Llama 3.1
105
  - Context-aware synthetic examples
106
  - Balanced across all six sentiment categories
107
 
108
- **Phase 2: Consensus Filtering (Day 2-3)**
109
  - Trained multiple LoRA variants on hand-annotated data
110
  - Consensus filtering: kept examples where ≥2 models agreed
111
  - Reduced noise and improved training data quality
112
 
113
- **Phase 3: Corpus Mining (Day 3-4)**
114
  - Mined authentic Ancient Latin texts from Perseus Digital Library
115
  - Extracted high-confidence positive examples (previously underrepresented)
116
  - Combined ~40,000 corpus examples with synthetic data
117
 
118
- **Phase 4: Final Training & Iteration (Days 4-6)**
119
  - Balanced dataset: 9,000 examples (1,500 per category)
120
  - Distributed training with data-parallel strategy
121
  - Multiple training runs to optimize hyperparameters
122
 
123
  ### Final Training Configuration
124
- - **Training Examples:** 9,000 (balanced across 6 categories)
125
  - **Training Epochs:** 15
126
  - **Architecture:** LoRA adapter (rank: 128, alpha: 256)
127
  - **Optimization:** 8-bit quantization for efficiency
 
100
 
101
  ### Data Pipeline (5-day development cycle)
102
 
103
+ **Phase 1: Initial Generation**
104
  - Few-shot generation using base Llama 3.1
105
  - Context-aware synthetic examples
106
  - Balanced across all six sentiment categories
107
 
108
+ **Phase 2: Consensus Filtering**
109
  - Trained multiple LoRA variants on hand-annotated data
110
  - Consensus filtering: kept examples where ≥2 models agreed
111
  - Reduced noise and improved training data quality
112
 
113
+ **Phase 3: Corpus Mining**
114
  - Mined authentic Ancient Latin texts from Perseus Digital Library
115
  - Extracted high-confidence positive examples (previously underrepresented)
116
  - Combined ~40,000 corpus examples with synthetic data
117
 
118
+ **Phase 4: Final Training & Iteration**
119
  - Balanced dataset: 9,000 examples (1,500 per category)
120
  - Distributed training with data-parallel strategy
121
  - Multiple training runs to optimize hyperparameters
122
 
123
  ### Final Training Configuration
124
+ - **Training Examples:** 9,000 (balanced across 7 categories)
125
  - **Training Epochs:** 15
126
  - **Architecture:** LoRA adapter (rank: 128, alpha: 256)
127
  - **Optimization:** 8-bit quantization for efficiency