|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- trl |
|
|
- sft |
|
|
- gemma |
|
|
- qwen |
|
|
- merge |
|
|
- disc |
|
|
license: osl-3.0 |
|
|
datasets: |
|
|
- HuggingFaceH4/ultrachat_200k |
|
|
- TIGER-Lab/MathInstruct |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-1.7B |
|
|
- google/gemma-3-1b-it |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
# Model Card for Qemma-Q-1.7B |
|
|
## Gap Envelope Integral |
|
|
* My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions. |
|
|
## Redux |
|
|
* This Model underwent an additional merge between Qemma-redux and Qwen3-1.7B, in addition to adding Rope Scaling. |
|
|
### Additionally |
|
|
* Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment. |
|
|
* **Qemma** is a HuggingFace-native hybrid model that merges **Gemma-3 (1B)** and **Qwen-3 (1.7B)** at the weight level (no adapters). |
|
|
* Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning. |
|
|
* This variant uses Yarn based Rope Scaling with 1:* Ratio from max_position_embeddings = 242144 |
|
|
* |
|
|
## Quick start |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_id = "reaperdoesntknow/Qemma-Q1.7B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval() |
|
|
|
|
|
text = ( |
|
|
"<|user|>" |
|
|
"What makes the sky blue?." |
|
|
"<|assistant|>" |
|
|
"<think><reasoning_step>" |
|
|
) |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True) |
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
model.eval() |
|
|
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
``` |
|
|
|
|
|
## What’s inside |
|
|
|
|
|
* **Architecture:** |
|
|
* **Gemma-3 backbone** (26 layers, hidden 1152, MLP 6912) |
|
|
* **Qwen-style attention** regrouped to Gemma’s 4×256 heads. (head_dim=128, hidden=2048, intermediate_size=6144, num_attn_heads=16, KV heads=8, num_hidd_layers=28) |
|
|
* **Tokenizer:** Gemma-3 tokenizer and chat template (see `chat_template.jinja`). |
|
|
* **Training:** SFT for instruction following and stepwise reasoning. |
|
|
|
|
|
## Intended use & limitations |
|
|
|
|
|
**Use:** research, instruction following, code/help, analysis, further SFT/RLHF. |
|
|
**Limits:** may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
* ~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment. |
|
|
* ~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k) |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
* TRL: 0.25.0 |
|
|
* Transformers: 4.57.1 |
|
|
* Pytorch: 2.8.0+cpu |
|
|
* Datasets: 4.4.1 |
|
|
* Tokenizers: 0.22.1 |
|
|
|
|
|
## Citations |
|
|
|
|
|
|
|
|
|
|
|
Cite TRL as: |
|
|
|
|
|
```bibtex |
|
|
@misc{vonwerra2022trl, |
|
|
title = {{TRL: Transformer Reinforcement Learning}}, |
|
|
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, |
|
|
year = 2020, |
|
|
journal = {GitHub repository}, |
|
|
publisher = {GitHub}, |
|
|
howpublished = {\url{https://github.com/huggingface/trl}} |
|
|
} |
|
|
``` |