pinktilde32 / README.md
Bogula's picture
update README.md
1352fdd verified
---
license: cc-by-4.0
base_model: TildeAI/TildeOpen-30b-64k
base_model_relation: finetune
language:
- de
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- rag
- retrieval-augmented-generation
- summarization
- information-extraction
- instruction-following
- german
- english
- chatml
datasets:
- nvidia/Nemotron-Instruction-Following-Chat-v1
- DiscoResearch/germanrag
- abisee/cnn_dailymail
- wikimedia/wikipedia
---
# pinktilde32
A chat / instruct model specialized for **retrieval-augmented generation (RAG), summarization,
information extraction, and structured Markdown output**, fine-tuned from
[**TildeAI/TildeOpen-30b-64k**](https://huggingface.co/TildeAI/TildeOpen-30b-64k) — a 30B European
multilingual base model with a 64k context window (extended via YaRN). Focus languages: **German + English**.
## Intended use
- Answering questions **strictly from a provided context** (RAG), with source citations `[n]`.
- **Honest refusal** when the answer is not in the context (no hallucination).
- **Summarization** and **information extraction** from long inputs.
- **Structured output** in Markdown (headings, bullet lists, tables).
Not intended for: code generation, free-standing factual answers without context, clinical/legal advice.
## Prompt format
The model uses **chatml** (`<|im_start|>` / `<|im_end|>`). Recommended system prompt (the RAG contract):
```
Answer the question or extract the information STRICTLY from the provided context.
Cite the sources you use as [n]. Present the answer in clear Markdown structure.
If the information is not in the context, say so honestly and do not guess.
```
### Example
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Bogula/pinktilde32"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
system = ("Answer strictly from the context. Cite sources as [n]. Use Markdown. "
"If the info is missing, say so honestly.")
context = "[1] Muster AG reported revenue of EUR 142M in 2025.\n[2] ..."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: What was the 2025 revenue?"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
return_tensors="pt", return_dict=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.3,
eos_token_id=tok.convert_tokens_to_ids("<|im_end|>"))
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
## Training
- **Method:** LoRA SFT (all linear layers + `embed_tokens`/`lm_head`), then merged into the base model.
- **Training context length:** 32k (`sequence_len=32768`, sample packing).
- **Format:** chatml; loss computed on assistant turns only.
### Data mix
| Source | Language | Purpose |
| --- | --- | --- |
| nvidia/Nemotron-Instruction-Following-Chat-v1 | EN | Instruction / format adherence, structured outputs |
| DiscoResearch/germanrag | DE | RAG grounding with citations + "unanswerable" cases |
| abisee/cnn_dailymail | EN | Summarization (Markdown) |
| wikimedia/wikipedia (de, business/psychology) | DE | Summarization (Markdown) |
| Internal company dialogues | DE | Domain / style anchor |
## Limitations
- **Long context:** The target behaviors (grounding, formatting) were trained up to ~32k. For inputs
between 32k and 64k only the base long-context capability of TildeOpen applies, where reliability
may degrade.
- **Language balance:** The instruction-following data is English; German format adherence benefits
from transfer but may lag behind English.
- May still occasionally hallucinate or imperfectly follow formatting instructions. Verify outputs.
## License & attribution
The base model **TildeOpen-30b-64k** is licensed under **CC-BY-4.0**; this derivative is released under
the same license. Training data includes, among others: Nemotron-Instruction-Following-Chat-v1
(ODC-BY / CC-BY-4.0), DiscoResearch/germanrag (**CC-BY-SA-4.0**, derived from GermanDPR),
CNN/DailyMail, and German Wikipedia (**CC-BY-SA**).
> Note: Some training sources are under share-alike licenses (CC-BY-SA). Whether and to what extent
> these propagate to model weights is not legally settled. This is **not legal advice** — please verify
> license compliance for your specific use case and attribute the sources accordingly.