Update README
Browse files
README.md
CHANGED
|
@@ -1,3 +1,139 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- dllm
|
| 6 |
+
- diffusion
|
| 7 |
+
- llm
|
| 8 |
+
- text_generation
|
| 9 |
+
---
|
| 10 |
+
# LLaDA2.0-mini-CAP
|
| 11 |
+
|
| 12 |
+
**LLaDA2.0-mini-CAP** is an enhanced version of LLaDA2.0-mini that incorporates **Confidence-Aware Parallel (CAP) Training** for significantly improved inference efficiency. Built upon the 16B-A1B Mixture-of-Experts (MoE) diffusion architecture, this model achieves faster parallel decoding while maintaining strong performance across diverse benchmarks.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## π Performance Comparison
|
| 17 |
+
### Efficiency vs. Quality Trade-off
|
| 18 |
+
| Model | Average Score | Tokens/Forward (TPF) | Speedup |
|
| 19 |
+
| :---: | :---: | :---: | :---: |
|
| 20 |
+
| LLaDA2.0-mini | 70.15 | 2.55 | 1.0Γ |
|
| 21 |
+
| **LLaDA2.0-mini-CAP** | **67.32** | **3.72** | **1.46Γ** |
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
_Evaluated on 12 diverse benchmarks covering knowledge, reasoning, coding, and mathematics._
|
| 25 |
+
|
| 26 |
+
### Key Insights
|
| 27 |
+
+ **1.46Γ faster generation** with only a 2.83% performance trade-off
|
| 28 |
+
+ Ideal for latency-sensitive applications requiring real-time responses
|
| 29 |
+
+ Maintains competitive accuracy across all task categories
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## π¬ What is CAP Training?
|
| 34 |
+
**Confidence-Aware Parallel (CAP) Training** is a novel training technique designed to enhance parallel decoding efficiency in diffusion language models.
|
| 35 |
+
|
| 36 |
+
### Technical Overview
|
| 37 |
+
The training objective combines two complementary losses:
|
| 38 |
+
|
| 39 |
+
$ \mathcal{L}(\theta) = \mathcal{L}_{\text{SFT}}(\theta) + \lambda \mathcal{L}_{\text{conf}}(\theta) $
|
| 40 |
+
|
| 41 |
+
Where:
|
| 42 |
+
|
| 43 |
+
+ $ \mathcal{L}_{\text{SFT}} $: Supervised fine-tuning loss ensuring prediction correctness
|
| 44 |
+
+ $ \mathcal{L}_{\text{conf}} $: Confidence loss that minimizes entropy only for correctly predicted tokens
|
| 45 |
+
+ $ \lambda $: Hyperparameter balancing the two objectives
|
| 46 |
+
|
| 47 |
+
### Why CAP Works
|
| 48 |
+
1. **Sharpens Correct Predictions**: While standard training ensures correctness, it provides diminishing incentive to increase confidence on already-correct tokens. CAP explicitly optimizes for high-confidence predictions.
|
| 49 |
+
2. **Enables Aggressive Parallelism**: Higher confidence allows the model to decode multiple tokens simultaneously with greater reliability, reducing the total number of forward passes needed.
|
| 50 |
+
3. **Selective Optimization**: By focusing only on correct predictions, CAP avoids penalizing the model's exploration of uncertain outputs.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## π¦ Model Variants
|
| 55 |
+
| Model ID | Description | Hugging Face Link |
|
| 56 |
+
| --- | --- | --- |
|
| 57 |
+
| `inclusionAI/LLaDA2.0-mini-CAP` | CAP-enhanced model optimized for fast inference | [π€ Model Card](https://huggingface.co/inclusionAI/LLaDA2.0-mini-CAP) |
|
| 58 |
+
| `inclusionAI/LLaDA2.0-mini` | Base instruction-tuned model | [π€ Model Card](https://huggingface.co/inclusionAI/LLaDA2.0-mini) |
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## π Model Overview
|
| 64 |
+
**LLaDA2.0-mini-CAP** inherits the architecture of LLaDA2.0-mini:
|
| 65 |
+
|
| 66 |
+
+ **Type**: Mixture-of-Experts (MoE) Diffusion Language Model with CAP Training
|
| 67 |
+
+ **Total Parameters (Non-Embedding)**: 16B
|
| 68 |
+
+ **Number of Layers**: 20
|
| 69 |
+
+ **Attention Heads**: 16
|
| 70 |
+
+ **Context Length**: 32,768 tokens
|
| 71 |
+
+ **Position Embedding**: Rotary (RoPE)
|
| 72 |
+
+ **Vocabulary Size**: 157,184
|
| 73 |
+
+ **Training Enhancement**: Confidence-Aware Parallel (CAP) Training
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## π» Usage
|
| 78 |
+
### π€ Hugging Face Transformers
|
| 79 |
+
```python
|
| 80 |
+
import torch
|
| 81 |
+
import torch.nn.functional as F
|
| 82 |
+
from transformers import AutoModelForCausalLM
|
| 83 |
+
from transformers import AutoTokenizer
|
| 84 |
+
|
| 85 |
+
model_path = "/path/to/LLaDA2.0-mini-CAP"
|
| 86 |
+
device = "cuda:0"
|
| 87 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 88 |
+
model_path, trust_remote_code=True, device_map=device
|
| 89 |
+
)
|
| 90 |
+
model = model.to(torch.bfloat16)
|
| 91 |
+
model.eval()
|
| 92 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 93 |
+
|
| 94 |
+
prompt = "Why does Camus think that Sisyphus is happy?"
|
| 95 |
+
input_ids = tokenizer.apply_chat_template(
|
| 96 |
+
[{"role": "user", "content": prompt}],
|
| 97 |
+
add_generation_prompt=True,
|
| 98 |
+
tokenize=True,
|
| 99 |
+
return_tensors="pt",
|
| 100 |
+
)
|
| 101 |
+
generated_tokens = model.generate(
|
| 102 |
+
inputs=input_ids,
|
| 103 |
+
eos_early_stop=True,
|
| 104 |
+
gen_length=512,
|
| 105 |
+
block_length=32,
|
| 106 |
+
steps=32,
|
| 107 |
+
temperature=0.0,
|
| 108 |
+
)
|
| 109 |
+
generated_answer = tokenizer.decode(
|
| 110 |
+
generated_tokens[0],
|
| 111 |
+
skip_special_tokens=True,
|
| 112 |
+
)
|
| 113 |
+
print(generated_answer)
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
### Best Practices
|
| 119 |
+
To achieve optimal performance, we recommend the following settings:
|
| 120 |
+
|
| 121 |
+
1. **Sampling Parameters**:
|
| 122 |
+
We suggest using `Temperature=0.0`, `block_length=32`, and `steps=32`. Using a higher temperature value may occasionally result in language mixing and a slight decrease in model performance.
|
| 123 |
+
2. **Adequate Output Length**:
|
| 124 |
+
We recommend using an output length of 32768 tokens for most queries.
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## π License
|
| 129 |
+
This project is licensed under the terms of the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## π€ Contact & Collaboration
|
| 134 |
+
For questions, collaborations, or feedback, please reach out via [Hugging Face](https://huggingface.co/inclusionAI/LLaDA2.0-mini-CAP) or open an issue in the [repository](https://github.com/inclusionAI).
|
| 135 |
+
|
| 136 |
+
π Join us in advancing open, efficient, and intelligent language models!
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|