dd101bb commited on
Commit
d583610
·
verified ·
1 Parent(s): 6b14801

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -238
README.md CHANGED
@@ -1,238 +1,238 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- base_model:
5
- - meta-llama/Llama-3.2-1B-Instruct
6
- ---
7
- # CoLaR Model
8
-
9
- <div align="center">
10
-
11
- [![HuggingFace](https://img.shields.io/badge/🤗%20HuggingFace-Model-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/dd101bb/latent-tts-colar)
12
-
13
- </div>
14
-
15
- ## Overview
16
-
17
- **CoLaR** (Continuous Latent Reasoning) is a latent reasoning model based on LLaMA that uses a specialized LatentHead module for generating continuous latent representations. This model is part of the [Parallel Test-Time Scaling for Latent Reasoning Models](https://arxiv.org/abs/2510.07745) framework.
18
-
19
- ## Model Details
20
-
21
- - **Base Architecture**: LLaMA Language Model
22
- - **Model Class**: `ColarLlama` (extends `LlamaForCausalLM`)
23
- - **Special Features**: LatentHead module for latent space generation
24
- - **Latent Tokens**: Uses special token `<|latent|>` for latent reasoning
25
- - **End Token**: Uses `###` as the end-of-latent marker
26
- - **Input Format**: Direct input format with latent tokens
27
-
28
- ## Related Models
29
-
30
- This repository includes other latent reasoning models that you might find useful:
31
-
32
- [ModalityDance/latent-tts](https://huggingface.co/collections/ModalityDance/latent-tts)
33
-
34
- ## Installation
35
-
36
- Download the model from HuggingFace:
37
-
38
- ```bash
39
- huggingface-cli download dd101bb/latent-tts-colar --local-dir checkpoints/colar
40
- ```
41
-
42
- ## Quick Start
43
-
44
- ### Basic Usage
45
-
46
- ```python
47
- import torch
48
- from transformers import AutoTokenizer
49
- from src.generation_mixin import LatentGenerationMixin, LatentGenerationConfig
50
- from src.paths import MODELS
51
-
52
- # Load tokenizer
53
- model_id = "checkpoints/colar"
54
- tokenizer = AutoTokenizer.from_pretrained(model_id)
55
- if tokenizer.pad_token is None:
56
- tokenizer.pad_token = tokenizer.eos_token
57
-
58
- # Get latent token IDs
59
- latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
60
- end_id = tokenizer.convert_tokens_to_ids("###")
61
-
62
- # Create model class with generation mixin
63
- class LatentCoLaR(MODELS["colar"]["class"], LatentGenerationMixin):
64
- pass
65
-
66
- # Load model
67
- model = LatentCoLaR.from_pretrained(
68
- model_id,
69
- device_map="auto",
70
- torch_dtype=torch.bfloat16, # Recommended for LLaMA models
71
- )
72
-
73
- # Prepare input
74
- question = "What is 2 + 2?<|latent|>"
75
- inputs = tokenizer(question, return_tensors="pt").to(model.device)
76
-
77
- # Configure generation
78
- generation_config = LatentGenerationConfig(
79
- max_new_tokens=128,
80
- max_latent_length=64, # CoLaR uses max_latent_length instead of latent_length
81
- latent_do_sample=True,
82
- latent_do_sample_by="dropout", # or "noise"
83
- dropout_p=0.1,
84
- pad_token_id=tokenizer.pad_token_id,
85
- eos_token_id=tokenizer.eos_token_id,
86
- )
87
-
88
- # Generate
89
- output = model.generate(
90
- **inputs,
91
- generation_config=generation_config,
92
- num_return_sequences=1,
93
- )
94
-
95
- # Decode result
96
- result = tokenizer.decode(output[0], skip_special_tokens=True)
97
- print(result)
98
- ```
99
-
100
- ### Batch Processing
101
-
102
- The model fully supports batch processing with Transformers:
103
-
104
- ```python
105
- import torch
106
-
107
- # Prepare batch inputs
108
- questions = [
109
- "What is 2 + 2?<|latent|>",
110
- "What is 5 * 3?<|latent|>",
111
- "What is 10 - 4?<|latent|>",
112
- ]
113
- inputs = tokenizer(questions, return_tensors="pt", padding=True).to(model.device)
114
-
115
- # Generate for batch
116
- outputs = model.generate(
117
- **inputs,
118
- generation_config=generation_config,
119
- num_return_sequences=1,
120
- )
121
-
122
- # Decode batch results
123
- results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
124
- for result in results:
125
- print(result)
126
- ```
127
-
128
- ## Model Architecture
129
-
130
- ### LatentHead Module
131
-
132
- CoLaR uses a specialized LatentHead for generating latent representations:
133
-
134
- ```python
135
- class LatentHead(nn.Module):
136
- def __init__(self, feature_size, intermediate_size=512):
137
- super().__init__()
138
- self.fc = nn.Sequential(
139
- nn.Linear(feature_size, intermediate_size),
140
- nn.GELU(),
141
- nn.Linear(intermediate_size, intermediate_size),
142
- nn.LayerNorm(intermediate_size),
143
- )
144
- self.mean = nn.Linear(intermediate_size, feature_size)
145
- ```
146
-
147
- The latent embeddings are scaled by `latent_embedding_std` (default: 0.018 for LLaMA-3.2 models).
148
-
149
- ## Generation Parameters
150
-
151
- ### LatentGenerationConfig
152
-
153
- - `max_new_tokens` (int): Maximum number of tokens to generate
154
- - `max_latent_length` (int): Maximum number of latent tokens (default: 64)
155
- - `latent_do_sample` (bool): Whether to use stochastic sampling
156
- - `latent_do_sample_by` (str): Sampling method - `"dropout"` or `"noise"`
157
- - `dropout_p` (float): Dropout probability for Monte Carlo Dropout (e.g., 0.1)
158
- - `noise_std` (float): Standard deviation for Additive Gaussian Noise
159
-
160
- ### Sampling Methods
161
-
162
- 1. **Monte Carlo Dropout**: Randomly drops activations during forward passes
163
-
164
- ```python
165
- generation_config = LatentGenerationConfig(
166
- latent_do_sample_by="dropout",
167
- dropout_p=0.1,
168
- # ...
169
- )
170
- ```
171
- 2. **Additive Gaussian Noise**: Injects noise into latent embeddings
172
-
173
- ```python
174
- generation_config = LatentGenerationConfig(
175
- latent_do_sample_by="noise",
176
- noise_std=0.1,
177
- # ...
178
- )
179
- ```
180
-
181
- ## Answer Extraction
182
-
183
- CoLaR uses a special answer format with "Answer:" prefix:
184
-
185
- ```python
186
- from src.paths import colar_extract_answer_number
187
-
188
- # Extract answer from generated text
189
- answer = colar_extract_answer_number(result)
190
- print(f"Answer: {answer}")
191
- ```
192
-
193
- ## Evaluation
194
-
195
- Run evaluation using the provided scripts:
196
-
197
- ```bash
198
- # For CoLaR (LLaMA based models)
199
- ./run_tests_llama.sh
200
- ```
201
-
202
- ## Model Card
203
-
204
- - **Paper**: [Parallel Test-Time Scaling for Latent Reasoning Models](https://arxiv.org/abs/2510.07745)
205
- - **HuggingFace**: [dd101bb/latent-tts-colar](https://huggingface.co/dd101bb/latent-tts-colar)
206
- - **Benchmarks**: GSM8K Test, GSM8K Hard, MultiArith
207
-
208
- ## Notes
209
-
210
- - **Data Type**: Recommended to use `torch.bfloat16` or `torch.float16` for LLaMA models
211
- - **Memory**: LLaMA models typically require more GPU memory than GPT-2 models
212
- - **Latent Length**: CoLaR uses `max_latent_length` instead of fixed `latent_length`
213
-
214
- ## Citation
215
-
216
- If you use this model, please cite:
217
-
218
- ```bibtex
219
- @misc{you2025paralleltesttimescalinglatent,
220
- title={Parallel Test-Time Scaling for Latent Reasoning Models},
221
- author={Runyang You and Yongqi Li and Meng Liu and Wenjie Wang and Liqiang Nie and Wenjie Li},
222
- year={2025},
223
- eprint={2510.07745},
224
- archivePrefix={arXiv},
225
- primaryClass={cs.CL},
226
- url={https://arxiv.org/abs/2510.07745},
227
- }
228
-
229
- @misc{tan2025thinksilentlythinkfast,
230
- title={Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains},
231
- author={Wenhui Tan and Jiaze Li and Jianzhong Ju and Zhenbo Luo and Jian Luan and Ruihua Song},
232
- year={2025},
233
- eprint={2505.16552},
234
- archivePrefix={arXiv},
235
- primaryClass={cs.CL},
236
- url={https://arxiv.org/abs/2505.16552},
237
- }
238
- ```
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ base_model:
5
+ - meta-llama/Llama-3.2-1B-Instruct
6
+ ---
7
+ # CoLaR Model
8
+
9
+ <div align="center">
10
+
11
+ [![HuggingFace](https://img.shields.io/badge/🤗%20HuggingFace-Model-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/ModalityDance/latent-tts-colar)
12
+
13
+ </div>
14
+
15
+ ## Overview
16
+
17
+ **CoLaR** (Continuous Latent Reasoning) is a latent reasoning model based on LLaMA that uses a specialized LatentHead module for generating continuous latent representations. This model is part of the [Parallel Test-Time Scaling for Latent Reasoning Models](https://arxiv.org/abs/2510.07745) framework.
18
+
19
+ ## Model Details
20
+
21
+ - **Base Architecture**: LLaMA Language Model
22
+ - **Model Class**: `ColarLlama` (extends `LlamaForCausalLM`)
23
+ - **Special Features**: LatentHead module for latent space generation
24
+ - **Latent Tokens**: Uses special token `<|latent|>` for latent reasoning
25
+ - **End Token**: Uses `###` as the end-of-latent marker
26
+ - **Input Format**: Direct input format with latent tokens
27
+
28
+ ## Related Models
29
+
30
+ This repository includes other latent reasoning models that you might find useful:
31
+
32
+ [ModalityDance/latent-tts](https://huggingface.co/collections/ModalityDance/latent-tts)
33
+
34
+ ## Installation
35
+
36
+ Download the model from HuggingFace:
37
+
38
+ ```bash
39
+ huggingface-cli download ModalityDance/latent-tts-colar --local-dir checkpoints/colar
40
+ ```
41
+
42
+ ## Quick Start
43
+
44
+ ### Basic Usage
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import AutoTokenizer
49
+ from src.generation_mixin import LatentGenerationMixin, LatentGenerationConfig
50
+ from src.paths import MODELS
51
+
52
+ # Load tokenizer
53
+ model_id = "checkpoints/colar"
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+ if tokenizer.pad_token is None:
56
+ tokenizer.pad_token = tokenizer.eos_token
57
+
58
+ # Get latent token IDs
59
+ latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
60
+ end_id = tokenizer.convert_tokens_to_ids("###")
61
+
62
+ # Create model class with generation mixin
63
+ class LatentCoLaR(MODELS["colar"]["class"], LatentGenerationMixin):
64
+ pass
65
+
66
+ # Load model
67
+ model = LatentCoLaR.from_pretrained(
68
+ model_id,
69
+ device_map="auto",
70
+ torch_dtype=torch.bfloat16, # Recommended for LLaMA models
71
+ )
72
+
73
+ # Prepare input
74
+ question = "What is 2 + 2?<|latent|>"
75
+ inputs = tokenizer(question, return_tensors="pt").to(model.device)
76
+
77
+ # Configure generation
78
+ generation_config = LatentGenerationConfig(
79
+ max_new_tokens=128,
80
+ max_latent_length=64, # CoLaR uses max_latent_length instead of latent_length
81
+ latent_do_sample=True,
82
+ latent_do_sample_by="dropout", # or "noise"
83
+ dropout_p=0.1,
84
+ pad_token_id=tokenizer.pad_token_id,
85
+ eos_token_id=tokenizer.eos_token_id,
86
+ )
87
+
88
+ # Generate
89
+ output = model.generate(
90
+ **inputs,
91
+ generation_config=generation_config,
92
+ num_return_sequences=1,
93
+ )
94
+
95
+ # Decode result
96
+ result = tokenizer.decode(output[0], skip_special_tokens=True)
97
+ print(result)
98
+ ```
99
+
100
+ ### Batch Processing
101
+
102
+ The model fully supports batch processing with Transformers:
103
+
104
+ ```python
105
+ import torch
106
+
107
+ # Prepare batch inputs
108
+ questions = [
109
+ "What is 2 + 2?<|latent|>",
110
+ "What is 5 * 3?<|latent|>",
111
+ "What is 10 - 4?<|latent|>",
112
+ ]
113
+ inputs = tokenizer(questions, return_tensors="pt", padding=True).to(model.device)
114
+
115
+ # Generate for batch
116
+ outputs = model.generate(
117
+ **inputs,
118
+ generation_config=generation_config,
119
+ num_return_sequences=1,
120
+ )
121
+
122
+ # Decode batch results
123
+ results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
124
+ for result in results:
125
+ print(result)
126
+ ```
127
+
128
+ ## Model Architecture
129
+
130
+ ### LatentHead Module
131
+
132
+ CoLaR uses a specialized LatentHead for generating latent representations:
133
+
134
+ ```python
135
+ class LatentHead(nn.Module):
136
+ def __init__(self, feature_size, intermediate_size=512):
137
+ super().__init__()
138
+ self.fc = nn.Sequential(
139
+ nn.Linear(feature_size, intermediate_size),
140
+ nn.GELU(),
141
+ nn.Linear(intermediate_size, intermediate_size),
142
+ nn.LayerNorm(intermediate_size),
143
+ )
144
+ self.mean = nn.Linear(intermediate_size, feature_size)
145
+ ```
146
+
147
+ The latent embeddings are scaled by `latent_embedding_std` (default: 0.018 for LLaMA-3.2 models).
148
+
149
+ ## Generation Parameters
150
+
151
+ ### LatentGenerationConfig
152
+
153
+ - `max_new_tokens` (int): Maximum number of tokens to generate
154
+ - `max_latent_length` (int): Maximum number of latent tokens (default: 64)
155
+ - `latent_do_sample` (bool): Whether to use stochastic sampling
156
+ - `latent_do_sample_by` (str): Sampling method - `"dropout"` or `"noise"`
157
+ - `dropout_p` (float): Dropout probability for Monte Carlo Dropout (e.g., 0.1)
158
+ - `noise_std` (float): Standard deviation for Additive Gaussian Noise
159
+
160
+ ### Sampling Methods
161
+
162
+ 1. **Monte Carlo Dropout**: Randomly drops activations during forward passes
163
+
164
+ ```python
165
+ generation_config = LatentGenerationConfig(
166
+ latent_do_sample_by="dropout",
167
+ dropout_p=0.1,
168
+ # ...
169
+ )
170
+ ```
171
+ 2. **Additive Gaussian Noise**: Injects noise into latent embeddings
172
+
173
+ ```python
174
+ generation_config = LatentGenerationConfig(
175
+ latent_do_sample_by="noise",
176
+ noise_std=0.1,
177
+ # ...
178
+ )
179
+ ```
180
+
181
+ ## Answer Extraction
182
+
183
+ CoLaR uses a special answer format with "Answer:" prefix:
184
+
185
+ ```python
186
+ from src.paths import colar_extract_answer_number
187
+
188
+ # Extract answer from generated text
189
+ answer = colar_extract_answer_number(result)
190
+ print(f"Answer: {answer}")
191
+ ```
192
+
193
+ ## Evaluation
194
+
195
+ Run evaluation using the provided scripts:
196
+
197
+ ```bash
198
+ # For CoLaR (LLaMA based models)
199
+ ./run_tests_llama.sh
200
+ ```
201
+
202
+ ## Model Card
203
+
204
+ - **Paper**: [Parallel Test-Time Scaling for Latent Reasoning Models](https://arxiv.org/abs/2510.07745)
205
+ - **HuggingFace**: [ModalityDance/latent-tts-colar](https://huggingface.co/ModalityDance/latent-tts-colar)
206
+ - **Benchmarks**: GSM8K Test, GSM8K Hard, MultiArith
207
+
208
+ ## Notes
209
+
210
+ - **Data Type**: Recommended to use `torch.bfloat16` or `torch.float16` for LLaMA models
211
+ - **Memory**: LLaMA models typically require more GPU memory than GPT-2 models
212
+ - **Latent Length**: CoLaR uses `max_latent_length` instead of fixed `latent_length`
213
+
214
+ ## Citation
215
+
216
+ If you use this model, please cite:
217
+
218
+ ```bibtex
219
+ @misc{you2025paralleltesttimescalinglatent,
220
+ title={Parallel Test-Time Scaling for Latent Reasoning Models},
221
+ author={Runyang You and Yongqi Li and Meng Liu and Wenjie Wang and Liqiang Nie and Wenjie Li},
222
+ year={2025},
223
+ eprint={2510.07745},
224
+ archivePrefix={arXiv},
225
+ primaryClass={cs.CL},
226
+ url={https://arxiv.org/abs/2510.07745},
227
+ }
228
+
229
+ @misc{tan2025thinksilentlythinkfast,
230
+ title={Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains},
231
+ author={Wenhui Tan and Jiaze Li and Jianzhong Ju and Zhenbo Luo and Jian Luan and Ruihua Song},
232
+ year={2025},
233
+ eprint={2505.16552},
234
+ archivePrefix={arXiv},
235
+ primaryClass={cs.CL},
236
+ url={https://arxiv.org/abs/2505.16552},
237
+ }
238
+ ```