--- library_name: transformers tags: - trl - sft - gemma - qwen - merge - disc license: osl-3.0 datasets: - HuggingFaceH4/ultrachat_200k - TIGER-Lab/MathInstruct language: - en base_model: - google/gemma-3-1b-it - Qwen/Qwen3-14B pipeline_tag: text-generation --- # Model Card for Qemma-Q14B ## Gap Envelope Integral * My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions. ## Redux * This Model underwent an additional merge between Qemma-redux and Qwen3-14B, in addition to adding Rope Scaling. ### Additionally * Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment. * **Qemma** is a HuggingFace-native hybrid model that merges **Gemma-3 (1B)** and **Qwen-3 (14B)** at the weight level (no adapters). * Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning. * This variant uses Yarn based Rope Scaling with 1:* Ratio from max_position_embeddings = 524288 * ## Quick start ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "reaperdoesntknow/Qemma-Q14B" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval() text = ( "<|user|>" "What makes the sky blue?." "<|assistant|>" "" ) inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): model.eval() outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## What’s inside * **Architecture:** * **Gemma-3 backbone** (26 layers, hidden 1152, MLP 6912) * **Qwen-style attention** regrouped to Gemma’s 4×256 heads. (head_dim=128, hidden=5120, intermediate_size=17408, num_attn_heads=40, KV heads=8, num_hidd_layers=40) * **Tokenizer:** Gemma-3 tokenizer and chat template (see `chat_template.jinja`). * **Training:** SFT for instruction following and stepwise reasoning. ## Intended use & limitations **Use:** research, instruction following, code/help, analysis, further SFT/RLHF. **Limits:** may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses. ## Training procedure * ~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment. * ~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k) ### Framework versions * TRL: 0.25.0 * Transformers: 4.57.1 * Pytorch: 2.8.0+cpu * Datasets: 4.4.1 * Tokenizers: 0.22.1 ## Citations Cite TRL as: ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ```