levossadtchi commited on
Commit
ed260ca
·
verified ·
1 Parent(s): 6803247

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +67 -250
  2. modeling_qed.py +81 -0
  3. vocab.json +0 -0
README.md CHANGED
@@ -1,287 +1,104 @@
1
  ---
2
  license: mit
 
3
  pipeline_tag: text-generation
4
  tags:
5
- - causal-lm
6
- - decoder-only
7
- - pytorch
8
- - rope
9
- - rmsnorm
10
- - swiglu
11
- - custom-architecture
12
  language:
13
- - en
14
- model_type: qed
15
- library_name: transformers
16
  ---
17
 
18
-
19
- ![Frame 33](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/cAL5N0oH6uViOVxXNUWs5.png)
20
-
21
-
22
  # QED-75M
23
 
24
- QED-75M is a compact **decoder-only causal language model** implemented for Hugging Face using a custom `transformers` module. The model architecture combines **RoPE** (rotary position embeddings), **RMSNorm**, **SwiGLU** feed-forward blocks, and causal self-attention implemented via `torch.nn.functional.scaled_dot_product_attention`. The token embedding weights can be tied with the output projection (`tie_word_embeddings`).
25
-
26
- This model card focuses on the **model itself** (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository `README.md`.
27
-
28
- ## Table of Contents
29
-
30
- - [Model Details](#model-details)
31
- - [Uses](#uses)
32
- - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
33
- - [Training Details](#training-details)
34
- - [Evaluation](#evaluation)
35
- - [Technical Specifications](#technical-specifications)
36
- - [Model Architecture](#model-architecture)
37
- - [Attention and RoPE](#attention-and-rope)
38
- - [MLP (SwiGLU)](#mlp-swiglu)
39
- - [Embeddings and Output Head](#embeddings-and-output-head)
40
- - [Input/Output Interface](#inputoutput-interface)
41
- - [KV Cache and Generation Semantics](#kv-cache-and-generation-semantics)
42
- - [Attention Masking](#attention-masking)
43
- - [Length Constraints](#length-constraints)
44
- - [Default Hyperparameters](#default-hyperparameters)
45
- - [How to Get Started with the Model](#how-to-get-started-with-the-model)
46
- - [Citation](#citation)
47
- - [Model Card Contact](#model-card-contact)
48
-
49
- ---
50
-
51
- # Model Details
52
-
53
- ## Model Description
54
-
55
- QED is a **next-token prediction** model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When `labels` are provided, the model computes the training loss as cross-entropy over the next-token targets (with `ignore_index=-100`).
56
-
57
- The Hugging Face integration provides:
58
-
59
- - `QEDConfig` (`model_type: qed`)
60
- - `QEDForCausalLM`
61
-
62
- Both classes are defined in the repo module `modeling_qed.py` and are loaded with `trust_remote_code=True`.
63
-
64
- ## Model Sources
65
-
66
- - Code: the repository containing `modeling_qed.py` and the exported model artifacts.
67
- - Transformers implementation: `modeling_qed.py` (remote code in the model repo).
68
-
69
- ---
70
-
71
- # Uses
72
-
73
- ## Direct Use
74
-
75
- - Text generation using `model.generate(...)`.
76
- - Scoring / evaluating conditional likelihoods via `model(input_ids=..., labels=...)`.
77
-
78
- ## Downstream Use
79
-
80
- - Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.
81
-
82
- ## Out-of-Scope Use
83
-
84
- - Using the model for high-stakes decisions (medical, legal, finance) without human verification.
85
- - Assuming the model is always factually correct or always safe.
86
- - Using the model to bypass safety systems or to generate disallowed content.
87
-
88
- ---
89
-
90
- # Bias, Risks, and Limitations
91
-
92
- Like other language models, QED may produce:
93
-
94
- - **Hallucinations** (confident but incorrect statements).
95
- - **Pattern repetition** from training data.
96
- - **Uneven quality** across topics and languages, depending on what the specific checkpoint was trained on.
97
-
98
- Mitigations:
99
-
100
- - Use output filtering and constrain the generation strategy when deploying in real applications.
101
- - Perform domain-specific evaluations before relying on the model.
102
- - Treat the model as a suggestion engine, not a ground-truth source.
103
-
104
- ---
105
-
106
- # Training Details
107
-
108
- The full training pipeline (tokenizer training, pretraining, context-length annealing, and SFT preparation) is described in the repository `README.md`. This model card deliberately avoids duplicating training steps; it documents the **resulting model interface and architecture**.
109
-
110
- ---
111
-
112
- # Evaluation
113
-
114
- We evaluated the following models with a custom evaluation pipeline based on the Hugging Face **LightEval** harness used in the SmolLM2 model evaluations. The evaluation also reports a **"general"** average over a fixed suite of tasks:
115
 
116
- - `MMLU` (aggregated over its MMLU subtasks in the LightEval leaderboard)
117
- - `HellaSwag`
118
- - `ARC-Challenge`
119
- - `Winogrande`
120
- - `CommonsenseQA`
121
 
122
- | Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu |
123
- |---|---:|---:|---:|---:|---:|---:|
124
- | `HuggingFaceTB/SmolLM2-135M` | 0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 |
125
- | `levossadtchi/QED-75M` | 0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 |
126
- | `EleutherAI/gpt-neo-125m` | 0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 |
127
- | `EleutherAI/pythia-160m-deduped` | 0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 |
128
- | `openai-community/gpt2` | 0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 |
129
 
130
- ---
131
-
132
- # Technical Specifications
133
-
134
- ## Model Architecture
135
-
136
- QEDForCausalLM is a decoder-only transformer with the following high-level structure:
137
-
138
- - Token embeddings: `embed_tokens = Embedding(vocab_size, d_model)`
139
- - `n_layers` identical blocks (`TransformerBlock`), each applying:
140
- - Residual attention: `x = x + Attention(RMSNorm(x))`
141
- - Residual MLP: `x = x + SwiGLU(RMSNorm(x))`
142
- - Final normalization: `norm = RMSNorm(d_model)`
143
- - Output head: `lm_head = Linear(d_model, vocab_size, bias=True)`
144
-
145
- The attention uses RoPE on Q and K and runs causal masking semantics.
146
-
147
- ## Attention and RoPE
148
-
149
- - Projection layers (per attention block):
150
- - `q_proj`, `k_proj`, `v_proj`, `o_proj` are `Linear(d_model, d_model, bias=config.bias)`
151
- - Number of heads: `n_heads`
152
- - Head dimension: `head_dim = d_model / n_heads`
153
- - RoPE:
154
- - Rotary embedding precomputes `cos_cached` and `sin_cached` up to `max_seq_len`
155
- - RoPE is applied to Q and K using `position_ids`
156
- - Attention kernel:
157
- - Implemented with `torch.nn.functional.scaled_dot_product_attention`
158
- - Uses explicit scaling `scale = head_dim ** -0.5`
159
-
160
- ## MLP (SwiGLU)
161
-
162
- The feed-forward sublayer is a SwiGLU variant:
163
-
164
- - `gate_proj: Linear(d_model, ffn_hidden_dim)`
165
- - `up_proj: Linear(d_model, ffn_hidden_dim)`
166
- - `down_proj: Linear(ffn_hidden_dim, d_model)`
167
- - Compute:
168
- - `SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )`
169
-
170
- ## Embeddings and Output Head
171
-
172
- - `embed_tokens`: size `[vocab_size, d_model]`
173
- - `lm_head`: size `[d_model, vocab_size]` with **bias enabled**
174
- - Weight tying:
175
- - When `tie_word_embeddings=True`, `lm_head.weight` is tied to `embed_tokens.weight`
176
- - The `lm_head` bias remains a separate parameter.
177
-
178
- ## Input/Output Interface
179
-
180
- Typical usage via Transformers:
181
-
182
- - `input_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
183
- - Optional:
184
- - `position_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
185
- - `attention_mask`: `torch.Tensor` of shape `[batch_size, seq_len]`
186
- - `labels`: `torch.LongTensor` of shape `[batch_size, seq_len]` (positions with `-100` are ignored)
187
- - `past_key_values`: list of length `n_layers` with cached keys/values
188
- - Outputs:
189
- - `logits`: `[batch_size, seq_len, vocab_size]`
190
- - `loss`: scalar when `labels` are provided
191
- - `past_key_values`: cached KV tensors when `use_cache=True`
192
-
193
- ## KV Cache and Generation Semantics
194
-
195
- - The model uses a **legacy tuple KV cache** format (not the newer `DynamicCache` object). The integration explicitly disables default dynamic cache support (`_supports_default_dynamic_cache()` returns `False`).
196
- - In `prepare_inputs_for_generation(...)`:
197
- - If `past_key_values` is provided, generation continues by feeding only the **last token** (`input_ids[:, -1:]`).
198
- - The attention layer concatenates past and current KV along the sequence dimension.
199
-
200
- Expected KV shapes (conceptually):
201
-
202
- - For each layer, `(key, value)` have shape `[batch_size, n_heads, kv_len, head_dim]`.
203
-
204
- ## Attention Masking
205
-
206
- When `attention_mask` is provided, the model converts it to a key-padding boolean mask:
207
 
208
- - `key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)`
209
 
210
- Then it builds:
211
 
212
- - causal constraint (positions cannot attend to future keys)
213
- - AND with `key_padding_mask` (mask out padded keys)
214
 
215
- Practical recommendation:
 
 
216
 
217
- - Use the standard HF convention: `attention_mask` values should be `1` for real tokens and `0` for padding tokens.
218
 
219
- ## Length Constraints
 
 
 
 
 
 
220
 
221
- The model enforces:
 
 
 
222
 
223
- - `total_seq_len = past_length + seq_len <= config.max_seq_len`
224
 
225
- If `total_seq_len` exceeds `max_seq_len`, the model raises a `ValueError`.
226
 
227
- Default `max_seq_len` in the exported config for this checkpoint is `8192`.
228
 
229
- ## Default Hyperparameters
 
230
 
231
- The exported `config.json` for the QED-75M checkpoint sets:
232
 
233
- | Hyperparameter | Value |
234
- |---|---:|
235
- | Approx. parameter count | ~75M |
236
- | `n_layers` | 32 |
237
- | `d_model` | 384 |
238
- | `n_heads` | 6 |
239
- | `head_dim` | 64 |
240
- | `ffn_hidden_dim` | 1024 |
241
- | `vocab_size` | 49152 |
242
- | `max_seq_len` | 8192 |
243
- | `rope_theta` | 10000.0 |
244
- | `rms_norm_eps` | 1e-5 |
245
- | `dropout` | 0.0 |
246
- | `tie_word_embeddings` | true |
247
- | internal linear `bias` (QKV/MLP) | false |
248
 
249
- Tokenizer / special tokens (from exported `tokenizer_config.json`):
 
 
 
250
 
251
- - `<pad>` id `0`
252
- - `<bos>` id `1`
253
- - `<eos>` id `2`
254
- - `<unk>` id `3`
255
 
256
- ---
 
 
257
 
258
- # How to Get Started with the Model
259
 
260
- ```python
261
- import torch
262
- from transformers import AutoModelForCausalLM, AutoTokenizer
263
 
264
- repo_id = "levossadtchi/QED-75M"
265
 
266
- tokenizer = AutoTokenizer.from_pretrained(repo_id)
267
- model = AutoModelForCausalLM.from_pretrained(
268
- repo_id,
269
- trust_remote_code=True,
270
- torch_dtype=torch.bfloat16, # optional
271
- )
272
 
273
- inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
274
- out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
275
- print(tokenizer.decode(out[0], skip_special_tokens=True))
 
 
 
276
  ```
277
 
278
- For loss computation:
279
-
280
- - pass `labels` with the same shape as `input_ids`
281
- - use `-100` in positions you want to ignore.
282
-
283
  ---
284
 
285
- # Model Card Contact
286
-
287
- For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.
 
1
  ---
2
  license: mit
3
+ library_name: transformers
4
  pipeline_tag: text-generation
5
  tags:
6
+ - causal-lm
7
+ - pytorch
8
+ - decoder-only
9
+ - rope
10
+ - custom-architecture
 
 
11
  language:
12
+ - en
13
+ - ru
 
14
  ---
15
 
 
 
 
 
16
  # QED-75M
17
 
18
+ **QED-75M** компактная декодер-only языковая модель (~75M параметров) в духе современных LLM: **RoPE**, **RMSNorm**, **SwiGLU**, **causal self-attention** (через `scaled_dot_product_attention`), **weight tying** между входными эмбеддингами и выходной проекцией. Архитектура совместима по именам весов с внутренним обучающим стеком **SLLM** из репозитория обучения.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ Модель предназначена для **текстовой генерации** (causal LM) после предобучения и SFT; это исследовательская / учебная шкала, а не production-уровень у больших коммерческих LLM.
 
 
 
 
21
 
22
+ ## Краткие характеристики
 
 
 
 
 
 
23
 
24
+ | Параметр | Значение |
25
+ |----------|----------|
26
+ | Параметры | ~75M |
27
+ | Слои | 32 |
28
+ | `d_model` | 384 |
29
+ | Головы внимания | 6 (`head_dim` = 64) |
30
+ | FFN (`ffn_hidden_dim`) | 1024 |
31
+ | Словарь | 49 152 (BPE) |
32
+ | Контекст | до **8192** токенов |
33
+ | RoPE θ | 10 000 |
34
+ | Bias в линейных слоях блока | нет (`bias: false`) |
35
+ | LM head | с bias; веса привязаны к `embed_tokens` при `tie_word_embeddings: true` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ Специальные токены: `<pad>` (0), `<bos>` (1), `<eos>` (2), `<unk>` (3).
38
 
39
+ ## Использование
40
 
41
+ Требуется **`trust_remote_code=True`**: классы `QEDConfig` и `QEDForCausalLM` подгружаются из `modeling_qed.py` в репозитории модели.
 
42
 
43
+ ```python
44
+ import torch
45
+ from transformers import AutoModelForCausalLM, AutoTokenizer
46
 
47
+ model_id = "YOUR_USERNAME/QED-75M" # замените на id репозитория на Hub
48
 
49
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ model_id,
52
+ trust_remote_code=True,
53
+ torch_dtype=torch.bfloat16, # опционально, если поддерживается
54
+ device_map="auto", # опционально
55
+ )
56
 
57
+ inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
58
+ outputs = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
59
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
60
+ ```
61
 
62
+ Генерация использует **legacy tuple KV-cache** (совместимость с `transformers.generate`); `supports_gradient_checkpointing` в текущей реализации `False`.
63
 
64
+ ## Обучение (кратко)
65
 
66
+ Пайплайн в исходном репозитории:
67
 
68
+ 1. **Предобучение**: смесь открытых корпусов (конфигурируемый data mix), stage 1 с последовательностью порядка **2048** токенов, затем **annealing** на **8192**.
69
+ 2. **SFT**: instruct/диалоговые данные (в т.ч. подмножества `HuggingFaceTB/smoltalk`, `HuggingFaceH4/ultrachat_200k` и др. по весам в конфиге); метки — next-token только на **assistant**-частях диалога.
70
 
71
+ Точный состав смеси, число шагов и чекпоинт, из которого собрана эта публикация, укажите в карточке при выкладке конкретного чекпоинта (рекомендуется добавить поле в этот README или в описание релиза на Hub).
72
 
73
+ ## Ограничения и риски
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
+ - Небольшая модель: ограниченные рассуждения, факты и многоязычие по сравнению с крупными LLM.
76
+ - Возможны **галлюцинации**, устаревшие или неверные утверждения; не использовать как единственный источник истины.
77
+ - Поведение зависит от **промпта**, температуры и пост-обработки; для продакшена нужны политики безопасности и фильтрация.
78
+ - Загрузка **remote code** — осознанный компромисс: доверяйте только репозиториям от проверенных авторов и фиксируйте ревизию (`revision=...`) при воспроизводимости.
79
 
80
+ ## Файлы репозитория
 
 
 
81
 
82
+ - `config.json` — `QEDConfig` / `auto_map` для Auto-классов.
83
+ - `modeling_qed.py` — реализация модели для `transformers`.
84
+ - веса в формате **SafeTensors** (и/или PyTorch), токенизатор (`tokenizer.json` и метаданные).
85
 
86
+ ## Лицензия
87
 
88
+ Код и веса в этом репозитории: **MIT** (см. поле `license` выше). Данные обучения имеют собственные лицензии источников — при публикации уточните их в разделе «Датасеты» на странице модели.
 
 
89
 
90
+ ## Цитирование
91
 
92
+ Если используете эту модель в работе, укажите ссылку на репозиторий модели на Hugging Face и при наличии — на репозиторий обучающего кода.
 
 
 
 
 
93
 
94
+ ```bibtex
95
+ @misc{qed-75m,
96
+ title = {QED-75M: A Small Decoder-Only Language Model},
97
+ howpublished = {\url{https://huggingface.co/YOUR_USERNAME/QED-75M}},
98
+ note = {Accessed: YYYY-MM-DD}
99
+ }
100
  ```
101
 
 
 
 
 
 
102
  ---
103
 
104
+ *Карточка согласована с архитектурой `hf_hub/modeling_qed.py` и конфигом экспорта `config.json`.*
 
 
modeling_qed.py CHANGED
@@ -216,6 +216,87 @@ class QEDForCausalLM(PreTrainedModel, GenerationMixin):
216
  """Use legacy tuple KV cache; DynamicCache expects standard HF config fields."""
217
  return False
218
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  def __init__(self, config: QEDConfig) -> None:
220
  super().__init__(config)
221
  self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
 
216
  """Use legacy tuple KV cache; DynamicCache expects standard HF config fields."""
217
  return False
218
 
219
+ @torch.no_grad()
220
+ def _sample_next_token(
221
+ self,
222
+ next_token_logits: torch.Tensor,
223
+ temperature: float,
224
+ top_k: int | None,
225
+ ) -> torch.Tensor:
226
+ """
227
+ Sample next token from logits.
228
+ Matches behavior of the training-time SLLM generator.
229
+ """
230
+ if temperature <= 0:
231
+ return torch.argmax(next_token_logits, dim=-1, keepdim=True)
232
+
233
+ next_token_logits = next_token_logits / temperature
234
+ if top_k is not None and top_k > 0:
235
+ top_k = min(top_k, next_token_logits.size(-1))
236
+ values, _ = torch.topk(next_token_logits, top_k)
237
+ cutoff = values[:, [-1]]
238
+ next_token_logits = next_token_logits.masked_fill(
239
+ next_token_logits < cutoff, float("-inf")
240
+ )
241
+ probs = F.softmax(next_token_logits, dim=-1)
242
+ return torch.multinomial(probs, num_samples=1)
243
+
244
+ @torch.no_grad()
245
+ def generate(
246
+ self,
247
+ input_ids: torch.LongTensor,
248
+ max_new_tokens: int = 128,
249
+ temperature: float = 0.8,
250
+ top_k: int | None = 50,
251
+ eos_token_id: Optional[int] = None,
252
+ do_sample: bool = False,
253
+ **kwargs,
254
+ ) -> torch.LongTensor:
255
+ """
256
+ Generate tokens using the same logic as `src/sllm/model.py::SLLMForCausalLM.generate`.
257
+
258
+ We override HF's `GenerationMixin.generate()` because its cache/position semantics can differ from
259
+ this model's legacy KV cache path. This makes HF inference match your local script output.
260
+ """
261
+ _ = kwargs
262
+ if eos_token_id is None:
263
+ eos_token_id = getattr(self.config, "eos_token_id", None)
264
+
265
+ # For compatibility: if caller doesn't want sampling, force greedy decoding.
266
+ if not do_sample:
267
+ temperature = 0.0
268
+
269
+ generated = input_ids[:, -self.config.max_seq_len :]
270
+ outputs = self(generated, use_cache=True)
271
+ past_key_values = outputs.past_key_values
272
+ next_token_logits = outputs.logits[:, -1, :]
273
+
274
+ for _ in range(max_new_tokens):
275
+ next_token = self._sample_next_token(
276
+ next_token_logits, temperature=temperature, top_k=top_k
277
+ )
278
+ generated = torch.cat([generated, next_token], dim=1)
279
+
280
+ if eos_token_id is not None and torch.all(next_token.squeeze(-1) == eos_token_id):
281
+ break
282
+
283
+ if generated.size(1) >= self.config.max_seq_len:
284
+ # Sliding window when the context is full.
285
+ context = generated[:, -self.config.max_seq_len :]
286
+ outputs = self(context, use_cache=True)
287
+ else:
288
+ # One-step decode with cached KV.
289
+ outputs = self(
290
+ next_token,
291
+ past_key_values=past_key_values,
292
+ use_cache=True,
293
+ )
294
+
295
+ past_key_values = outputs.past_key_values
296
+ next_token_logits = outputs.logits[:, -1, :]
297
+
298
+ return generated
299
+
300
  def __init__(self, config: QEDConfig) -> None:
301
  super().__init__(config)
302
  self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
vocab.json CHANGED
The diff for this file is too large to render. See raw diff