vdmbrsv commited on
Commit
289b9e7
·
verified ·
1 Parent(s): 4463029

Update weights: W4 attention surgery (stepped 0.1/0.3/0.6/0.8) - 27/30 instruction following

Browse files
README.md CHANGED
@@ -1,375 +1,112 @@
1
  ---
2
- library_name: transformers
3
- license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE
4
- pipeline_tag: text-generation
5
- extra_gated_prompt: >
6
- ### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT
7
-
8
-
9
- Version 1.0 — January 2025
10
-
11
-
12
- "Faust-1" refers to the language model weights, code, and documentation made
13
- available by Tabularis AI GmbH ("Tabularis") under this agreement.
14
-
15
-
16
- 1. License Grant
17
-
18
- You are granted a non-exclusive, non-transferable, royalty-free license to
19
- use, copy, and modify Faust-1 for non-commercial research and personal
20
- purposes only.
21
-
22
-
23
- 2. Non-Commercial Use
24
-
25
- "Non-commercial" means academic research, personal projects, and educational
26
- use. Any use intended to generate revenue, provide commercial services, or
27
- benefit a for-profit entity requires a separate commercial license.
28
-
29
-
30
- 3. Commercial Licensing
31
-
32
- For commercial use, please contact: info@tabularis.ai
33
-
34
-
35
- 4. Attribution
36
-
37
- You must include "Built with Faust-1 by Tabularis AI" in any derivative work
38
- or publication.
39
-
40
-
41
- 5. No Warranty
42
-
43
- Faust-1 is provided "as is" without warranties of any kind.
44
-
45
-
46
- 6. Termination
47
-
48
- This license terminates automatically if you violate any terms.
49
-
50
-
51
- ---
52
-
53
- ### Additional Access Requirement
54
-
55
- Access to this repository is approval-based.
56
-
57
- You must join our Discord server: https://discord.gg/7WqEKw652R
58
- extra_gated_fields:
59
- Name: text
60
- Email: text
61
- Affiliation: text
62
- I have joined the Tabularis AI Discord server: checkbox
63
- I accept the Faust-1 Non-Commercial License Agreement: checkbox
64
- extra_gated_description: |
65
- Faust-1 is for non-commercial use only.
66
- For commercial licensing contact info@tabularis.ai
67
-
68
- Approval requires Discord membership.
69
- Join: https://discord.gg/7WqEKw652R
70
- extra_gated_button_content: Submit
71
  language:
72
- - de
73
- - en
 
 
 
 
 
74
  tags:
75
- - llama.cpp
76
- - synthetic data
 
 
 
 
77
  ---
78
 
 
79
 
80
- <!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;">
81
- <img
82
- alt="Faust-1 Demo"
83
- src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white"
84
- style="display: inline-block; vertical-align: middle;"
85
- />
86
- </a> -->
87
-
88
-
89
- <p align="center">
90
- <img src="./logo-faust.webp" alt="Faust-1 Logo" width="220">
91
- </p>
92
 
93
- # Faust-1 German-First Large Language Model (1.6B)
94
 
95
- Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale.
96
 
97
- Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled input–output pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment.
98
 
99
- Demo: [faust.tabularis.ai](https://faust.tabularis.ai)
100
 
 
 
 
 
 
 
101
 
102
- > [!TIP]
103
- > **Designed for local and cost-efficient deployment.**
104
- > Faust-1 is deliberately sized and optimized to run on **consumer-grade hardware** and **does not require expensive data-center GPUs**.
105
- >
106
- > **Typical deployment examples:**
107
- > - **Laptop / Desktop (CPU or small GPU):**
108
- > Runs on modern CPUs or entry-level GPUs (e.g. Apple Silicon, RTX 3060/4060, RX 6600) using optimized runtimes such as GGUF, MLX, or ONNX.
109
- > - **Single-GPU workstation:**
110
- > Efficiently serves interactive workloads on a single consumer GPU with low VRAM requirements compared to larger multilingual models.
111
- > - **On-device / privacy-sensitive setups:**
112
- > Suitable for local assistants, offline document analysis, and private RAG pipelines where data must not leave the machine.
113
- >
114
- > This makes Faust-1 practical for **researchers, developers, and small teams** who want strong German language performance without cloud dependency or high inference costs.
115
- ---
116
-
117
- ## Model summary
118
 
119
- - Repository: tabularisai/Faust-1
120
- - Model type: decoder-only causal language model MoE
121
- - Parameters: 1.6B
122
- - Interface: conversational / instruction (chat template provided)
123
- - Primary language: German (~90%)
124
- - Custom State-of-the-Art tokenizer for German language
125
 
126
- ---
127
 
128
- ## Quickstart
 
 
 
129
 
130
- ### Conversational usage (recommended)
131
 
132
- ```python
133
- from transformers import AutoTokenizer, AutoModelForCausalLM
134
- import torch
 
 
 
 
 
 
135
 
136
- model_id = "tabularisai/Faust-1"
 
 
 
137
 
138
- tokenizer = AutoTokenizer.from_pretrained(model_id)
139
- model = AutoModelForCausalLM.from_pretrained(
140
- model_id,
141
- torch_dtype=torch.float16,
142
- device_map="auto",
143
- )
144
 
145
- messages = [
146
- {"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."}
147
- ]
148
-
149
- inputs = tokenizer.apply_chat_template(
150
- messages,
151
- add_generation_prompt=True,
152
- return_tensors="pt",
153
- ).to(model.device)
154
-
155
- outputs = model.generate(
156
- inputs,
157
- max_new_tokens=256,
158
- temperature=0.6,
159
- do_sample=True,
160
- )
161
-
162
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
163
- ```
164
-
165
- ---
166
-
167
- ## Conditional Generation
168
 
169
  ```python
170
- !pip install git+https://github.com/tabularis-ai/guidegen.git
171
-
172
- import sys
173
- import os
174
- import json
175
- import time
176
-
177
- import guidegen as gg
178
- from pydantic import BaseModel, Field
179
- from typing import Literal, List
180
-
181
- # Hugging Face access token - set via environment variable or .env file
182
- # You can set it with: export HUGGINGFACE_HUB_TOKEN=your_token_here
183
- # Or create a .env file with: HUGGINGFACE_HUB_TOKEN=your_token_here
184
-
185
- MODEL_NAME = "tabularisai/Faust-1"
186
-
187
-
188
- # --- Schema ---
189
- class EmailSummary(BaseModel):
190
- """Structured summary of an email."""
191
- Absender: str = Field(description="Der Name des Absenders.")
192
- Betreff: str = Field(description="Worum geht es in der E-Mail? (max 5 Wörter)")
193
- Zusammenfassung: str = Field(description="Kurze Zusammenfassung (max 2 Sätze).")
194
- Prioritaet: Literal["hoch", "mittel", "niedrig"] = Field(description="Wie wichtig die E-Mail ist.")
195
- # AntwortNoetig: bool = Field(description="Muss man auf die E-Mail antworten?")
196
-
197
-
198
- # --- Input ---
199
- email_text = """Hallo Jens,
200
-
201
- wir hatten uns bei CampusFounders im Rahmen unserer Pre-Seed-Runde kennengelernt.
202
- Seitdem haben wir große Fortschritte gemacht und bereiten aktuell unsere Seed-Runde vor.
203
-
204
- Wir entwickeln eine Infrastruktur für hocheffiziente, lokal trainierbare KI-Modelle – vollständig ohne Cloud.
205
- Sehr gern würden wir uns mit dir austauschen und prüfen, ob ein Intro zu US-VCs oder ein Gespräch mit Crestlight möglich wäre.
206
-
207
- Anbei ein kurzer OnePager zur Weiterleitung.
208
-
209
- Beste Grüße
210
- Ricard"""
211
-
212
 
 
 
213
 
214
- # --- Prompt ---
215
- prompt = f"""
216
- Du bist ein intelligenter Assistent, der E-Mails analysiert und als JSON zusammenfasst.
217
- Halte die Zusammenfassung kurz (1-2 Sätze). Betreff maximal 5 Wörter.
218
-
219
- --- Beispiel ---
220
- E-Mail-Text:
221
- Sehr geehrte Damen und Herren, ich wollte nur nachfragen, ob meine Bestellung #12345 schon versandt wurde. Vielen Dank, Max Mustermann
222
- JSON-Antwort:
223
- {{
224
- "Absender": "Max Mustermann",
225
- "Betreff": "Bestellstatus Anfrage",
226
- "Zusammenfassung": "Anfrage zum Versandstatus der Bestellung #12345.",
227
- "Prioritaet": "mittel",
228
- }}
229
- --- Ende Beispiel ---
230
-
231
- Jetzt analysiere die folgende E-Mail und erstelle das JSON-Objekt.
232
-
233
- E-Mail-Text:
234
- {email_text}
235
- """
236
-
237
-
238
- def main():
239
- print("=" * 60)
240
- print("EMAIL SUMMARIZATION WITH GUIDEGEN")
241
- print("=" * 60)
242
-
243
- print(f"\nLoading model: {MODEL_NAME}")
244
- load_start = time.time()
245
-
246
- gen = gg.GuideGen(
247
- MODEL_NAME,
248
- verbose=True,
249
- use_chat_template=True,
250
- enable_thinking=False,
251
- )
252
-
253
- load_time = time.time() - load_start
254
- print(f"Model loaded in {load_time:.2f}s")
255
-
256
- # --- Generate ---
257
- print("\nGenerating structured summary...")
258
- gen_start = time.time()
259
-
260
- options = gg.GuideGenOptions(
261
- temperature=0.6,
262
- max_tokens=400,
263
- do_sample=False,
264
- )
265
-
266
- summary = gen.generate(prompt, EmailSummary, options=options)
267
-
268
- gen_time = time.time() - gen_start
269
- print(f"Generation complete in {gen_time:.2f}s")
270
-
271
- # --- Output ---
272
- print("\n--- Email Summary (JSON) ---")
273
- print(json.dumps(summary.model_dump(), indent=2, ensure_ascii=False))
274
- print(f"\n Model load: {load_time:.2f}s | Generation: {gen_time:.2f}s | Total: {load_time + gen_time:.2f}s")
275
- ```
276
-
277
- ---
278
-
279
- ## Training focus
280
-
281
- ### German-first data distribution
282
-
283
- Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case.
284
-
285
- ### Verified synthetic data
286
-
287
- A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering:
288
-
289
- - LLM-as-judge style evaluations
290
- - rule-based and programmatic checks
291
- - consistency and self-agreement filtering
292
-
293
- This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control.
294
-
295
- ---
296
-
297
- ## Tokenizer optimized for German
298
-
299
- Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail.
300
-
301
- ![Tokenizer efficiency on German language](tokenizer_bench.png)
302
-
303
- Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs.
304
-
305
-
306
- <img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800">
307
-
308
-
309
- ---
310
-
311
- ## German benchmark performance
312
-
313
- Faust-1 is evaluated on a set of standard German-language benchmarks:
314
-
315
- - ARC_de
316
- - GSM8K_de
317
- - HellaSwag_de
318
- - MMLU_de
319
- - TruthfulQA_de
320
-
321
- ![German benchmark performance](faust_bench.png)
322
-
323
- The target is best-in-class performance within the 1–2B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines.
324
-
325
- ---
326
-
327
- ## Deployment examples
328
-
329
- Faust-1 can be deployed with common inference stacks that support decoder-only language models.
330
-
331
- vLLM (OpenAI-compatible API)
332
- ```sh
333
- vllm serve tabularisai/Faust-1 --dtype float16
334
- ```
335
-
336
- SGLang
337
- ```sh
338
- python -m sglang.launch_server \
339
- --model-path tabularisai/Faust-1 \
340
- --dtype float16
341
- ```
342
 
343
- llama.cpp (GGUF, local / on-device)
344
- ```sh
345
- ./llama-cli \
346
- -m faust_1_q8_0.gguf \
347
- -p "Erkläre kurz, was ein großes Sprachmodell ist."
348
  ```
349
 
350
- The repository includes a prebuilt Q8_0 GGUF file for efficient local inference.
351
-
352
- ---
353
-
354
- ## Intended use
355
 
356
- - German conversational assistants
357
- - research and benchmarking on German NLP tasks
358
- - local and privacy-sensitive deployments
359
- - on-device or edge experimentation
360
-
361
- ---
362
-
363
- ## Roadmap
364
-
365
- - Reasoning-focused variant (comming soon)
366
- - Agent-oriented variant (comming soon)
367
-
368
- ---
369
 
370
  ## Citation
371
 
372
- A technical paper describing training methodology, tokenizer design, and evaluation is in preparation.
 
 
 
 
 
 
 
373
 
 
374
 
375
- Developed by [tabularis.ai](https://tabularis.ai) in Tübingen.
 
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
+ - de
4
+ - en
5
+ license: apache-2.0
6
+ library_name: transformers
7
+ base_model:
8
+ - tabularisai/Faust-1
9
+ - Qwen/Qwen3-1.7B
10
  tags:
11
+ - merge
12
+ - german
13
+ - medical
14
+ - instruction-following
15
+ - attention-surgery
16
+ pipeline_tag: text-generation
17
  ---
18
 
19
+ # Faust-1-Merged
20
 
21
+ **German language model with enhanced instruction following via attention surgery.**
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ## What is this?
24
 
25
+ This is [tabularisai/Faust-1](https://huggingface.co/tabularisai/Faust-1) (1.7B, Qwen3 architecture, custom German tokenizer) with attention layers partially replaced from [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) base model to improve instruction following while preserving Faust's German language capabilities.
26
 
27
+ ## Merge Method: Attention Surgery
28
 
29
+ Unlike traditional model merging (SLERP, TIES, DARE), this uses **targeted attention-only surgery** with a stepped alpha schedule:
30
 
31
+ | Layer Range | Alpha | Effect |
32
+ |------------|-------|--------|
33
+ | 0-6 (early) | 0.1 | Light touch — protect embedding-adjacent layers |
34
+ | 7-13 (mid-early) | 0.3 | Moderate blend |
35
+ | 14-20 (mid-late) | 0.6 | Strong instruction signal |
36
+ | 21-27 (late) | 0.8 | Maximum instruction following |
37
 
38
+ **Key insight:** Only self-attention weights are modified. All MLP weights (which store factual knowledge and vocabulary) remain 100% Faust. This preserves German language quality while importing Qwen3's instruction-following behavior from its attention routing.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ ## Evaluation Results
 
 
 
 
 
41
 
42
+ Tested on 30 instruction-following tasks (deterministic, temperature=0):
43
 
44
+ | Model | Score | Accuracy |
45
+ |-------|-------|----------|
46
+ | **Faust-1-Merged** | **27/30** | **90%** |
47
+ | Faust-1 (original) | 25/30 | 83% |
48
 
49
+ ### Category Breakdown
50
 
51
+ | Category | Faust-1 | Faust-1-Merged |
52
+ |----------|---------|----------------|
53
+ | Format (lists, JSON, etc.) | 5/6 | 6/6 |
54
+ | Length control | 5/5 | 5/5 |
55
+ | Language (German, formal) | 3/4 | 4/4 |
56
+ | Constraints (forbidden words) | 4/5 | 4/5 |
57
+ | Structured output | 4/4 | 3/4 |
58
+ | Medical (Arztbrief) | 3/3 | 3/3 |
59
+ | Role playing | 2/3 | 2/3 |
60
 
61
+ ### Improvements over baseline:
62
+ - ✅ One-word answers (strict format compliance)
63
+ - ✅ No-English constraint (pure German output)
64
+ - ✅ Required word inclusion
65
 
66
+ ### Known limitations:
67
+ - "End with word" — both models struggle
68
+ - ❌ "Refuse off-topic" — requires SFT for proper role boundaries
69
+ - ❌ Markdown tables sometimes missing proper separators
 
 
70
 
71
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ model = AutoModelForCausalLM.from_pretrained("tabularisai/Faust-1-Merged", torch_dtype="auto")
77
+ tokenizer = AutoTokenizer.from_pretrained("tabularisai/Faust-1-Merged")
78
 
79
+ messages = [
80
+ {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
81
+ {"role": "user", "content": "Nenne mir 5 deutsche Städte als nummerierte Liste."}
82
+ ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
+ input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
85
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
86
+ outputs = model.generate(**inputs, max_new_tokens=200, temperature=0, do_sample=False)
87
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 
88
  ```
89
 
90
+ ## Technical Details
 
 
 
 
91
 
92
+ - **Architecture:** Qwen3 (1.7B parameters)
93
+ - **Tokenizer:** Custom Faust German tokenizer (unchanged)
94
+ - **Modified layers:** 168 self-attention parameter tensors
95
+ - **Unmodified:** All MLP layers, embeddings, lm_head, layer norms
96
+ - **Method:** Per-quartile linear interpolation of attention weights
 
 
 
 
 
 
 
 
97
 
98
  ## Citation
99
 
100
+ ```bibtex
101
+ @misc{faust1merged2026,
102
+ title={Faust-1-Merged: Attention Surgery for German Instruction Following},
103
+ author={Tabularis.AI},
104
+ year={2026},
105
+ url={https://huggingface.co/tabularisai/Faust-1-Merged}
106
+ }
107
+ ```
108
 
109
+ ## About Tabularis.AI
110
 
111
+ University of Tübingen spin-off specializing in privacy-first AI for regulated industries.
112
+ Products include EU PII Safeguard, Faust German language models, and GDPR-compliant on-premises deployment.
config.json CHANGED
@@ -50,11 +50,13 @@
50
  "num_key_value_heads": 8,
51
  "pad_token_id": 1,
52
  "rms_norm_eps": 1e-06,
53
- "rope_scaling": null,
54
- "rope_theta": 1000000,
 
 
55
  "sliding_window": null,
56
  "tie_word_embeddings": true,
57
- "transformers_version": "4.57.5",
58
  "use_cache": false,
59
  "use_sliding_window": false,
60
  "vocab_size": 100000
 
50
  "num_key_value_heads": 8,
51
  "pad_token_id": 1,
52
  "rms_norm_eps": 1e-06,
53
+ "rope_parameters": {
54
+ "rope_theta": 1000000,
55
+ "rope_type": "default"
56
+ },
57
  "sliding_window": null,
58
  "tie_word_embeddings": true,
59
+ "transformers_version": "5.2.0",
60
  "use_cache": false,
61
  "use_sliding_window": false,
62
  "vocab_size": 100000
generation_config.json CHANGED
@@ -9,5 +9,5 @@
9
  "temperature": 0.6,
10
  "top_k": 20,
11
  "top_p": 0.95,
12
- "transformers_version": "4.57.5"
13
  }
 
9
  "temperature": 0.6,
10
  "top_k": 20,
11
  "top_p": 0.95,
12
+ "transformers_version": "5.2.0"
13
  }
merge_config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "method": "attention_surgery",
3
+ "base_model": "tabularisai/Faust-1",
4
+ "donor_model": "Qwen/Qwen3-1.7B",
5
+ "schedule": "stepped_quartile",
6
+ "alphas_per_quartile": {
7
+ "0-6": 0.1,
8
+ "7-13": 0.3,
9
+ "14-20": 0.6,
10
+ "21-27": 0.8
11
+ },
12
+ "components_modified": [
13
+ "self_attn"
14
+ ],
15
+ "components_preserved": [
16
+ "mlp",
17
+ "embed_tokens",
18
+ "lm_head",
19
+ "input_layernorm",
20
+ "post_attention_layernorm",
21
+ "model.norm"
22
+ ],
23
+ "eval_score": "27/30 (90%)",
24
+ "baseline_score": "25/30 (83%)",
25
+ "eval_settings": {
26
+ "temperature": 0,
27
+ "do_sample": false,
28
+ "max_new_tokens": 300
29
+ }
30
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:771c4911227f2792ce43c5f4e285bb4ec67942b95fa15f376b20cd2227879de6
3
  size 3228455704
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d9b6e280b9aecc623307361ef79f8c93031da8ab8425eebf0ca7c8a458723f3
3
  size 3228455704
tokenizer.json CHANGED
@@ -184,7 +184,32 @@
184
  }
185
  ]
186
  },
187
- "post_processor": null,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  "decoder": {
189
  "type": "ByteLevel",
190
  "add_prefix_space": true,
 
184
  }
185
  ]
186
  },
187
+ "post_processor": {
188
+ "type": "TemplateProcessing",
189
+ "single": [
190
+ {
191
+ "Sequence": {
192
+ "id": "A",
193
+ "type_id": 0
194
+ }
195
+ }
196
+ ],
197
+ "pair": [
198
+ {
199
+ "Sequence": {
200
+ "id": "A",
201
+ "type_id": 0
202
+ }
203
+ },
204
+ {
205
+ "Sequence": {
206
+ "id": "B",
207
+ "type_id": 1
208
+ }
209
+ }
210
+ ],
211
+ "special_tokens": {}
212
+ },
213
  "decoder": {
214
  "type": "ByteLevel",
215
  "add_prefix_space": true,
tokenizer_config.json CHANGED
@@ -1,180 +1,20 @@
1
  {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "<|endoftext|>",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "<|pad|>",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "<|unk|>",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "<|bos|>",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "4": {
36
- "content": "<|eos|>",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- },
43
- "5": {
44
- "content": "<|im_start|>",
45
- "lstrip": false,
46
- "normalized": false,
47
- "rstrip": false,
48
- "single_word": false,
49
- "special": true
50
- },
51
- "6": {
52
- "content": "<|im_end|>",
53
- "lstrip": false,
54
- "normalized": false,
55
- "rstrip": false,
56
- "single_word": false,
57
- "special": true
58
- },
59
- "7": {
60
- "content": "<|im_sep|>",
61
- "lstrip": false,
62
- "normalized": false,
63
- "rstrip": false,
64
- "single_word": false,
65
- "special": true
66
- },
67
- "8": {
68
- "content": "<|special_0|>",
69
- "lstrip": false,
70
- "normalized": false,
71
- "rstrip": false,
72
- "single_word": false,
73
- "special": true
74
- },
75
- "9": {
76
- "content": "<|special_1|>",
77
- "lstrip": false,
78
- "normalized": false,
79
- "rstrip": false,
80
- "single_word": false,
81
- "special": true
82
- },
83
- "10": {
84
- "content": "<|special_2|>",
85
- "lstrip": false,
86
- "normalized": false,
87
- "rstrip": false,
88
- "single_word": false,
89
- "special": true
90
- },
91
- "11": {
92
- "content": "<|special_3|>",
93
- "lstrip": false,
94
- "normalized": false,
95
- "rstrip": false,
96
- "single_word": false,
97
- "special": true
98
- },
99
- "12": {
100
- "content": "<|special_4|>",
101
- "lstrip": false,
102
- "normalized": false,
103
- "rstrip": false,
104
- "single_word": false,
105
- "special": true
106
- },
107
- "13": {
108
- "content": "<|special_5|>",
109
- "lstrip": false,
110
- "normalized": false,
111
- "rstrip": false,
112
- "single_word": false,
113
- "special": true
114
- },
115
- "14": {
116
- "content": "<|special_6|>",
117
- "lstrip": false,
118
- "normalized": false,
119
- "rstrip": false,
120
- "single_word": false,
121
- "special": true
122
- },
123
- "15": {
124
- "content": "<|special_7|>",
125
- "lstrip": false,
126
- "normalized": false,
127
- "rstrip": false,
128
- "single_word": false,
129
- "special": true
130
- },
131
- "16": {
132
- "content": "<|special_8|>",
133
- "lstrip": false,
134
- "normalized": false,
135
- "rstrip": false,
136
- "single_word": false,
137
- "special": true
138
- },
139
- "17": {
140
- "content": "<|special_9|>",
141
- "lstrip": false,
142
- "normalized": false,
143
- "rstrip": false,
144
- "single_word": false,
145
- "special": true
146
- }
147
- },
148
- "additional_special_tokens": [
149
- "<|im_start|>",
150
- "<|im_end|>",
151
- "<|im_sep|>",
152
- "<|special_0|>",
153
- "<|special_1|>",
154
- "<|special_2|>",
155
- "<|special_3|>",
156
- "<|special_4|>",
157
- "<|special_5|>",
158
- "<|special_6|>",
159
- "<|special_7|>",
160
- "<|special_8|>",
161
- "<|special_9|>"
162
- ],
163
  "bos_token": "<|bos|>",
164
  "clean_up_tokenization_spaces": false,
165
  "eos_token": "<|im_end|>",
166
- "extra_special_tokens": {},
167
  "max_length": 2048,
 
 
 
 
168
  "model_max_length": 8192,
169
  "pad_token": "<|pad|>",
 
170
  "stride": 0,
171
- "tokenizer_class": "PreTrainedTokenizerFast",
172
  "truncation_side": "right",
173
  "truncation_strategy": "longest_first",
174
- "unk_token": "<|unk|>",
175
- "return_token_type_ids": false,
176
- "model_input_names": [
177
- "input_ids",
178
- "attention_mask"
179
- ]
180
- }
 
1
  {
2
+ "backend": "tokenizers",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  "bos_token": "<|bos|>",
4
  "clean_up_tokenization_spaces": false,
5
  "eos_token": "<|im_end|>",
6
+ "is_local": false,
7
  "max_length": 2048,
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
  "model_max_length": 8192,
13
  "pad_token": "<|pad|>",
14
+ "return_token_type_ids": false,
15
  "stride": 0,
16
+ "tokenizer_class": "TokenizersBackend",
17
  "truncation_side": "right",
18
  "truncation_strategy": "longest_first",
19
+ "unk_token": "<|unk|>"
20
+ }