levossadtchi
/

QED-75M

@@ -1,287 +1,104 @@
 ---
 license: mit
 pipeline_tag: text-generation
 tags:
-- causal-lm
-- decoder-only
-- pytorch
-- rope
-- rmsnorm
-- swiglu
-- custom-architecture
 language:
-- en
-model_type: qed
-library_name: transformers
 ---
-![Frame 33](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/cAL5N0oH6uViOVxXNUWs5.png)
 # QED-75M
-QED-75M is a compact **decoder-only causal language model** implemented for Hugging Face using a custom `transformers` module. The model architecture combines **RoPE** (rotary position embeddings), **RMSNorm**, **SwiGLU** feed-forward blocks, and causal self-attention implemented via `torch.nn.functional.scaled_dot_product_attention`. The token embedding weights can be tied with the output projection (`tie_word_embeddings`).
-This model card focuses on the **model itself** (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository `README.md`.
-## Table of Contents
-- [Model Details](#model-details)
-- [Uses](#uses)
-- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
-- [Training Details](#training-details)
-- [Evaluation](#evaluation)
-- [Technical Specifications](#technical-specifications)
-  - [Model Architecture](#model-architecture)
-  - [Attention and RoPE](#attention-and-rope)
-  - [MLP (SwiGLU)](#mlp-swiglu)
-  - [Embeddings and Output Head](#embeddings-and-output-head)
-  - [Input/Output Interface](#inputoutput-interface)
-  - [KV Cache and Generation Semantics](#kv-cache-and-generation-semantics)
-  - [Attention Masking](#attention-masking)
-  - [Length Constraints](#length-constraints)
-  - [Default Hyperparameters](#default-hyperparameters)
-- [How to Get Started with the Model](#how-to-get-started-with-the-model)
-- [Citation](#citation)
-- [Model Card Contact](#model-card-contact)
----
-# Model Details
-## Model Description
-QED is a **next-token prediction** model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When `labels` are provided, the model computes the training loss as cross-entropy over the next-token targets (with `ignore_index=-100`).
-The Hugging Face integration provides:
-- `QEDConfig` (`model_type: qed`)
-- `QEDForCausalLM`
-Both classes are defined in the repo module `modeling_qed.py` and are loaded with `trust_remote_code=True`.
-## Model Sources
-- Code: the repository containing `modeling_qed.py` and the exported model artifacts.
-- Transformers implementation: `modeling_qed.py` (remote code in the model repo).
----
-# Uses
-## Direct Use
-- Text generation using `model.generate(...)`.
-- Scoring / evaluating conditional likelihoods via `model(input_ids=..., labels=...)`.
-## Downstream Use
-- Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.
-## Out-of-Scope Use
-- Using the model for high-stakes decisions (medical, legal, finance) without human verification.
-- Assuming the model is always factually correct or always safe.
-- Using the model to bypass safety systems or to generate disallowed content.
----
-# Bias, Risks, and Limitations
-Like other language models, QED may produce:
-- **Hallucinations** (confident but incorrect statements).
-- **Pattern repetition** from training data.
-- **Uneven quality** across topics and languages, depending on what the specific checkpoint was trained on.
-Mitigations:
-- Use output filtering and constrain the generation strategy when deploying in real applications.
-- Perform domain-specific evaluations before relying on the model.
-- Treat the model as a suggestion engine, not a ground-truth source.
----
-# Training Details
-The full training pipeline (tokenizer training, pretraining, context-length annealing, and SFT preparation) is described in the repository `README.md`. This model card deliberately avoids duplicating training steps; it documents the **resulting model interface and architecture**.
----
-# Evaluation
-We evaluated the following models with a custom evaluation pipeline based on the Hugging Face **LightEval** harness used in the SmolLM2 model evaluations. The evaluation also reports a **"general"** average over a fixed suite of tasks:
-- `MMLU` (aggregated over its MMLU subtasks in the LightEval leaderboard)
-- `HellaSwag`
-- `ARC-Challenge`
-- `Winogrande`
-- `CommonsenseQA`
-| Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu |
-|---|---:|---:|---:|---:|---:|---:|
-| `HuggingFaceTB/SmolLM2-135M` | 0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 |
-| `levossadtchi/QED-75M` | 0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 |
-| `EleutherAI/gpt-neo-125m` | 0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 |
-| `EleutherAI/pythia-160m-deduped` | 0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 |
-| `openai-community/gpt2` | 0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 |
----
-# Technical Specifications
-## Model Architecture
-QEDForCausalLM is a decoder-only transformer with the following high-level structure:
-- Token embeddings: `embed_tokens = Embedding(vocab_size, d_model)`
-- `n_layers` identical blocks (`TransformerBlock`), each applying:
-  - Residual attention: `x = x + Attention(RMSNorm(x))`
-  - Residual MLP: `x = x + SwiGLU(RMSNorm(x))`
-- Final normalization: `norm = RMSNorm(d_model)`
-- Output head: `lm_head = Linear(d_model, vocab_size, bias=True)`
-The attention uses RoPE on Q and K and runs causal masking semantics.
-## Attention and RoPE
-- Projection layers (per attention block):
-  - `q_proj`, `k_proj`, `v_proj`, `o_proj` are `Linear(d_model, d_model, bias=config.bias)`
-- Number of heads: `n_heads`
-- Head dimension: `head_dim = d_model / n_heads`
-- RoPE:
-  - Rotary embedding precomputes `cos_cached` and `sin_cached` up to `max_seq_len`
-  - RoPE is applied to Q and K using `position_ids`
-- Attention kernel:
-  - Implemented with `torch.nn.functional.scaled_dot_product_attention`
-  - Uses explicit scaling `scale = head_dim ** -0.5`
-## MLP (SwiGLU)
-The feed-forward sublayer is a SwiGLU variant:
-- `gate_proj: Linear(d_model, ffn_hidden_dim)`
-- `up_proj: Linear(d_model, ffn_hidden_dim)`
-- `down_proj: Linear(ffn_hidden_dim, d_model)`
-- Compute:
-  - `SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )`
-## Embeddings and Output Head
-- `embed_tokens`: size `[vocab_size, d_model]`
-- `lm_head`: size `[d_model, vocab_size]` with **bias enabled**
-- Weight tying:
-  - When `tie_word_embeddings=True`, `lm_head.weight` is tied to `embed_tokens.weight`
-  - The `lm_head` bias remains a separate parameter.
-## Input/Output Interface
-Typical usage via Transformers:
-- `input_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
-- Optional:
-  - `position_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
-  - `attention_mask`: `torch.Tensor` of shape `[batch_size, seq_len]`
-  - `labels`: `torch.LongTensor` of shape `[batch_size, seq_len]` (positions with `-100` are ignored)
-  - `past_key_values`: list of length `n_layers` with cached keys/values
-- Outputs:
-  - `logits`: `[batch_size, seq_len, vocab_size]`
-  - `loss`: scalar when `labels` are provided
-  - `past_key_values`: cached KV tensors when `use_cache=True`
-## KV Cache and Generation Semantics
-- The model uses a **legacy tuple KV cache** format (not the newer `DynamicCache` object). The integration explicitly disables default dynamic cache support (`_supports_default_dynamic_cache()` returns `False`).
-- In `prepare_inputs_for_generation(...)`:
-  - If `past_key_values` is provided, generation continues by feeding only the **last token** (`input_ids[:, -1:]`).
-- The attention layer concatenates past and current KV along the sequence dimension.
-Expected KV shapes (conceptually):
-- For each layer, `(key, value)` have shape `[batch_size, n_heads, kv_len, head_dim]`.
-## Attention Masking
-When `attention_mask` is provided, the model converts it to a key-padding boolean mask:
-- `key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)`
-Then it builds:
-- causal constraint (positions cannot attend to future keys)
-- AND with `key_padding_mask` (mask out padded keys)
-Practical recommendation:
-- Use the standard HF convention: `attention_mask` values should be `1` for real tokens and `0` for padding tokens.
-## Length Constraints
-The model enforces:
-- `total_seq_len = past_length + seq_len <= config.max_seq_len`
-If `total_seq_len` exceeds `max_seq_len`, the model raises a `ValueError`.
-Default `max_seq_len` in the exported config for this checkpoint is `8192`.
-## Default Hyperparameters
-The exported `config.json` for the QED-75M checkpoint sets:
-| Hyperparameter | Value |
-|---|---:|
-| Approx. parameter count | ~75M |
-| `n_layers` | 32 |
-| `d_model` | 384 |
-| `n_heads` | 6 |
-| `head_dim` | 64 |
-| `ffn_hidden_dim` | 1024 |
-| `vocab_size` | 49152 |
-| `max_seq_len` | 8192 |
-| `rope_theta` | 10000.0 |
-| `rms_norm_eps` | 1e-5 |
-| `dropout` | 0.0 |
-| `tie_word_embeddings` | true |
-| internal linear `bias` (QKV/MLP) | false |
-Tokenizer / special tokens (from exported `tokenizer_config.json`):
-- `<pad>` id `0`
-- `<bos>` id `1`
-- `<eos>` id `2`
-- `<unk>` id `3`
----
-# How to Get Started with the Model
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-repo_id = "levossadtchi/QED-75M"
-tokenizer = AutoTokenizer.from_pretrained(repo_id)
-model = AutoModelForCausalLM.from_pretrained(
-    repo_id,
-    trust_remote_code=True,
-    torch_dtype=torch.bfloat16,  # optional
-)
-inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
-out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
-print(tokenizer.decode(out[0], skip_special_tokens=True))
 ```
-For loss computation:
-- pass `labels` with the same shape as `input_ids`
-- use `-100` in positions you want to ignore.
 ---
-# Model Card Contact
-For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.

 ---
 license: mit
+library_name: transformers
 pipeline_tag: text-generation
 tags:
+  - causal-lm
+  - pytorch
+  - decoder-only
+  - rope
+  - custom-architecture
 language:
+  - en
+  - ru
 ---
 # QED-75M
+**QED-75M** — компактная декодер-only языковая модель (~75M параметров) в духе современных LLM: **RoPE**, **RMSNorm**, **SwiGLU**, **causal self-attention** (через `scaled_dot_product_attention`), **weight tying** между входными эмбеддингами и выходной проекцией. Архитектура совместима по именам весов с внутренним обучающим стеком **SLLM** из репозитория обучения.
+Модель предназначена для **текстовой генерации** (causal LM) после предобучения и SFT; это исследовательская / учебная шкала, а не production-уровень у больших коммерческих LLM.
+## Краткие характеристики
+| Параметр | Значение |
+|----------|----------|
+| Параметры | ~75M |
+| Слои | 32 |
+| `d_model` | 384 |
+| Головы внимания | 6 (`head_dim` = 64) |
+| FFN (`ffn_hidden_dim`) | 1024 |
+| Словарь | 49 152 (BPE) |
+| Контекст | до **8192** токенов |
+| RoPE θ | 10 000 |
+| Bias в линейных слоях блока | нет (`bias: false`) |
+| LM head | с bias; веса привязаны к `embed_tokens` при `tie_word_embeddings: true` |
+Специальные токены: `<pad>` (0), `<bos>` (1), `<eos>` (2), `<unk>` (3).
+## Использование
+Требуется **`trust_remote_code=True`**: классы `QEDConfig` и `QEDForCausalLM` подгружаются из `modeling_qed.py` в репозитории модели.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "YOUR_USERNAME/QED-75M"  # замените на id репозитория на Hub
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,  # опционально, если поддерживается
+    device_map="auto",           # опционально
+)
+inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+Генерация использует **legacy tuple KV-cache** (совместимость с `transformers.generate`); `supports_gradient_checkpointing` в текущей реализации — `False`.
+## Обучение (кратко)
+Пайплайн в исходном репозитории:
+1. **Предобучение**: смесь открытых корпусов (конфигурируемый data mix), stage 1 с последовательностью порядка **2048** токенов, затем **annealing** на **8192**.
+2. **SFT**: instruct/диалоговые данные (в т.ч. подмножества `HuggingFaceTB/smoltalk`, `HuggingFaceH4/ultrachat_200k` и др. по весам в конфиге); метки — next-token только на **assistant**-частях диалога.
+Точный состав смеси, число шагов и чекпоинт, из которого собрана эта публикация, укажите в карточке при выкладке конкретного чекпоинта (рекомендуется добавить поле в этот README или в описание релиза на Hub).
+## Ограничения и риски
+- Небольшая модель: ограниченные рассуждения, факты и многоязычие по сравнению с крупными LLM.
+- Возможны **галлюцинации**, устаревшие или неверные утверждения; не использовать как единственный источник истины.
+- Поведение зависит от **промпта**, температуры и пост-обработки; для продакшена нужны политики безопасности и фильтрация.
+- Загрузка **remote code** — осознанный компромисс: доверяйте только репозиториям от проверенных авторов и фиксируйте ревизию (`revision=...`) при воспроизводимости.
+## Файлы репозитория
+- `config.json` — `QEDConfig` / `auto_map` для Auto-классов.
+- `modeling_qed.py` — реализация модели для `transformers`.
+- веса в формате **SafeTensors** (и/или PyTorch), токенизатор (`tokenizer.json` и метаданные).
+## Лицензия
+Код и веса в этом репозитории: **MIT** (см. поле `license` выше). Данные обучения имеют собственные лицензии источников — при публикации уточните их в разделе «Датасеты» на странице модели.
+## Цитирование
+Если используете эту модель в работе, укажите ссылку на репозиторий модели на Hugging Face и при наличии — на репозиторий обучающего кода.
+```bibtex
+@misc{qed-75m,
+  title        = {QED-75M: A Small Decoder-Only Language Model},
+  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/QED-75M}},
+  note         = {Accessed: YYYY-MM-DD}
+}
 ```
 ---
+*Карточка согласована с архитектурой `hf_hub/modeling_qed.py` и конфигом экспорта `config.json`.*

modeling_qed.py CHANGED Viewed

@@ -216,6 +216,87 @@ class QEDForCausalLM(PreTrainedModel, GenerationMixin):
         """Use legacy tuple KV cache; DynamicCache expects standard HF config fields."""
         return False
     def __init__(self, config: QEDConfig) -> None:
         super().__init__(config)
         self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)

         """Use legacy tuple KV cache; DynamicCache expects standard HF config fields."""
         return False
+    @torch.no_grad()
+    def _sample_next_token(
+        self,
+        next_token_logits: torch.Tensor,
+        temperature: float,
+        top_k: int | None,
+    ) -> torch.Tensor:
+        """
+        Sample next token from logits.
+        Matches behavior of the training-time SLLM generator.
+        """
+        if temperature <= 0:
+            return torch.argmax(next_token_logits, dim=-1, keepdim=True)
+        next_token_logits = next_token_logits / temperature
+        if top_k is not None and top_k > 0:
+            top_k = min(top_k, next_token_logits.size(-1))
+            values, _ = torch.topk(next_token_logits, top_k)
+            cutoff = values[:, [-1]]
+            next_token_logits = next_token_logits.masked_fill(
+                next_token_logits < cutoff, float("-inf")
+            )
+        probs = F.softmax(next_token_logits, dim=-1)
+        return torch.multinomial(probs, num_samples=1)
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: torch.LongTensor,
+        max_new_tokens: int = 128,
+        temperature: float = 0.8,
+        top_k: int | None = 50,
+        eos_token_id: Optional[int] = None,
+        do_sample: bool = False,
+        **kwargs,
+    ) -> torch.LongTensor:
+        """
+        Generate tokens using the same logic as `src/sllm/model.py::SLLMForCausalLM.generate`.
+        We override HF's `GenerationMixin.generate()` because its cache/position semantics can differ from
+        this model's legacy KV cache path. This makes HF inference match your local script output.
+        """
+        _ = kwargs
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        # For compatibility: if caller doesn't want sampling, force greedy decoding.
+        if not do_sample:
+            temperature = 0.0
+        generated = input_ids[:, -self.config.max_seq_len :]
+        outputs = self(generated, use_cache=True)
+        past_key_values = outputs.past_key_values
+        next_token_logits = outputs.logits[:, -1, :]
+        for _ in range(max_new_tokens):
+            next_token = self._sample_next_token(
+                next_token_logits, temperature=temperature, top_k=top_k
+            )
+            generated = torch.cat([generated, next_token], dim=1)
+            if eos_token_id is not None and torch.all(next_token.squeeze(-1) == eos_token_id):
+                break
+            if generated.size(1) >= self.config.max_seq_len:
+                # Sliding window when the context is full.
+                context = generated[:, -self.config.max_seq_len :]
+                outputs = self(context, use_cache=True)
+            else:
+                # One-step decode with cached KV.
+                outputs = self(
+                    next_token,
+                    past_key_values=past_key_values,
+                    use_cache=True,
+                )
+            past_key_values = outputs.past_key_values
+            next_token_logits = outputs.logits[:, -1, :]
+        return generated
     def __init__(self, config: QEDConfig) -> None:
         super().__init__(config)
         self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)

vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff