Upload README.md
Browse files
README.md
CHANGED
|
@@ -76,6 +76,27 @@ A multi-task sentence embedding model that uses **Reinforcement Learning** to dy
|
|
| 76 |
|
| 77 |
FireDevourerEmbedder introduces an **RL-based adaptive task weighting system** that automatically adjusts the importance of each training task based on validation performance. Instead of using fixed task weights, a policy network learns optimal weight distributions during training, leading to better overall performance across diverse NLU benchmarks.
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
## Model Details
|
| 80 |
|
| 81 |
| Property | Value |
|
|
@@ -134,6 +155,27 @@ The model was trained on 5 balanced datasets with 100,000 samples each (500,000
|
|
| 134 |
| [PAWS](https://huggingface.co/datasets/google-research-datasets/paws) | Paraphrase Detection | Adversarial | 100,000 |
|
| 135 |
| [MRPC](https://huggingface.co/datasets/nyu-mll/glue) | Paraphrase Detection | News | 100,000 |
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
### Training Configuration
|
| 138 |
|
| 139 |
| Parameter | Value |
|
|
|
|
| 76 |
|
| 77 |
FireDevourerEmbedder introduces an **RL-based adaptive task weighting system** that automatically adjusts the importance of each training task based on validation performance. Instead of using fixed task weights, a policy network learns optimal weight distributions during training, leading to better overall performance across diverse NLU benchmarks.
|
| 78 |
|
| 79 |
+
## Why Multi-Task? Information-Dense Embeddings
|
| 80 |
+
|
| 81 |
+
The core philosophy behind FireDevourerEmbedder is that **multi-task learning creates richer, more information-dense embeddings** than single-task approaches.
|
| 82 |
+
|
| 83 |
+
By training with multiple task heads simultaneously, the shared encoder is forced to learn representations that capture:
|
| 84 |
+
|
| 85 |
+
| Dimension | Learned From | What It Captures |
|
| 86 |
+
|-----------|--------------|------------------|
|
| 87 |
+
| **Semantic Similarity** | STS-B | Fine-grained meaning overlap |
|
| 88 |
+
| **Logical Relationships** | MultiNLI | Entailment, contradiction, neutrality |
|
| 89 |
+
| **Question Semantics** | QQP | Intent and duplicate detection |
|
| 90 |
+
| **Adversarial Patterns** | PAWS | Word-order sensitivity, paraphrase robustness |
|
| 91 |
+
| **Domain Awareness** | All datasets | Context-appropriate representations |
|
| 92 |
+
|
| 93 |
+
This results in embeddings that are:
|
| 94 |
+
- **More robust** - trained to handle diverse linguistic phenomena
|
| 95 |
+
- **More transferable** - generalize better to unseen tasks
|
| 96 |
+
- **More informative** - each dimension of the embedding vector carries meaningful semantic signal
|
| 97 |
+
|
| 98 |
+
Unlike single-task embedders that optimize for one objective, FireDevourerEmbedder's embeddings simultaneously encode multiple facets of meaning, making them suitable for a wide range of downstream applications without fine-tuning.
|
| 99 |
+
|
| 100 |
## Model Details
|
| 101 |
|
| 102 |
| Property | Value |
|
|
|
|
| 155 |
| [PAWS](https://huggingface.co/datasets/google-research-datasets/paws) | Paraphrase Detection | Adversarial | 100,000 |
|
| 156 |
| [MRPC](https://huggingface.co/datasets/nyu-mll/glue) | Paraphrase Detection | News | 100,000 |
|
| 157 |
|
| 158 |
+
### Data Augmentation Strategy
|
| 159 |
+
|
| 160 |
+
To prevent training bias, all datasets were balanced to exactly **100,000 samples** each:
|
| 161 |
+
|
| 162 |
+
| Dataset | Original Size | Augmentation Method |
|
| 163 |
+
|---------|---------------|---------------------|
|
| 164 |
+
| STS-B | ~8,600 | Repetition (~12x) + pair swapping |
|
| 165 |
+
| MultiNLI | ~433,000 | Subsampling |
|
| 166 |
+
| QQP | ~400,000 | Subsampling |
|
| 167 |
+
| PAWS | ~49,000 | Repetition (~2x) + pair swapping |
|
| 168 |
+
| MRPC | ~3,600 | Repetition (~10x, capped) + pair swapping |
|
| 169 |
+
|
| 170 |
+
**Why this matters:**
|
| 171 |
+
- Without balancing, larger datasets (QQP, MultiNLI) would dominate training
|
| 172 |
+
- Smaller but valuable datasets (MRPC, STS-B) would be underrepresented
|
| 173 |
+
- Equal representation ensures the model learns equally from all task types
|
| 174 |
+
|
| 175 |
+
**Augmentation techniques:**
|
| 176 |
+
- **Repetition**: Smaller datasets repeated up to 10x maximum to prevent memorization
|
| 177 |
+
- **Sentence pair swapping**: For symmetric tasks, (A, B) pairs also trained as (B, A)
|
| 178 |
+
|
| 179 |
### Training Configuration
|
| 180 |
|
| 181 |
| Parameter | Value |
|