Faust-1 / README.md

Update README.md

ed4aa16 verified 7 days ago

8.02 kB

	---
	library_name: transformers
	license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE
	pipeline_tag: text-generation
	extra_gated_prompt: >
	### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT


	Version 1.0 — January 2025


	"Faust-1" refers to the language model weights, code, and documentation made
	available by Tabularis AI GmbH ("Tabularis") under this agreement.


	1. License Grant

	You are granted a non-exclusive, non-transferable, royalty-free license to
	use, copy, and modify Faust-1 for non-commercial research and personal
	purposes only.


	2. Non-Commercial Use

	"Non-commercial" means academic research, personal projects, and educational
	use. Any use intended to generate revenue, provide commercial services, or
	benefit a for-profit entity requires a separate commercial license.


	3. Commercial Licensing

	For commercial use, please contact: info@tabularis.ai


	4. Attribution

	You must include "Built with Faust-1 by Tabularis AI" in any derivative work
	or publication.


	5. No Warranty

	Faust-1 is provided "as is" without warranties of any kind.


	6. Termination

	This license terminates automatically if you violate any terms.


	---

	### Additional Access Requirement

	Access to this repository is approval-based.

	You must join our Discord server: https://discord.gg/7WqEKw652R
	extra_gated_fields:
	Name: text
	Email: text
	Affiliation: text
	I have joined the Tabularis AI Discord server: checkbox
	I accept the Faust-1 Non-Commercial License Agreement: checkbox
	extra_gated_description: \|
	Faust-1 is for non-commercial use only.
	For commercial licensing contact info@tabularis.ai

	Approval requires Discord membership.
	Join: https://discord.gg/7WqEKw652R
	extra_gated_button_content: Submit
	language:
	- de
	- en
	tags:
	- llama.cpp
	- synthetic data
	---


	<!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;">
	<img
	alt="Faust-1 Demo"
	src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white"
	style="display: inline-block; vertical-align: middle;"
	/>
	</a> -->


	<p align="center">
	<img src="./logo-faust.webp" alt="Faust-1 Logo" width="220">
	</p>

	# Faust-1 — German-First Large Language Model (1.6B)

	Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale.

	Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled input–output pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment.

	Demo: [faust.tabularis.ai](https://faust.tabularis.ai)


	> [!TIP]
	> Designed for local and cost-efficient deployment.
	> Faust-1 is deliberately sized and optimized to run on consumer-grade hardware and does not require expensive data-center GPUs.
	>
	> Typical deployment examples:
	> - Laptop / Desktop (CPU or small GPU):
	> Runs on modern CPUs or entry-level GPUs (e.g. Apple Silicon, RTX 3060/4060, RX 6600) using optimized runtimes such as GGUF, MLX, or ONNX.
	> - Single-GPU workstation:
	> Efficiently serves interactive workloads on a single consumer GPU with low VRAM requirements compared to larger multilingual models.
	> - On-device / privacy-sensitive setups:
	> Suitable for local assistants, offline document analysis, and private RAG pipelines where data must not leave the machine.
	>
	> This makes Faust-1 practical for researchers, developers, and small teams who want strong German language performance without cloud dependency or high inference costs.
	---

	## Model summary

	- Repository: tabularisai/Faust-1
	- Model type: decoder-only causal language model MoE
	- Parameters: 1.6B
	- Interface: conversational / instruction (chat template provided)
	- Primary language: German (~90%)
	- Custom State-of-the-Art tokenizer for German language

	---

	## Quickstart

	### Conversational usage (recommended)

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "tabularisai/Faust-1"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	messages = [
	{"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."}
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt",
	).to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=256,
	temperature=0.6,
	do_sample=True,
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	---

	## Training focus

	### German-first data distribution

	Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case.

	### Verified synthetic data

	A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering:

	- LLM-as-judge style evaluations
	- rule-based and programmatic checks
	- consistency and self-agreement filtering

	This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control.

	---

	## Tokenizer optimized for German

	Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail.

	![Tokenizer efficiency on German language](tokenizer_bench.png)

	Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs.


	<img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800">


	---

	## German benchmark performance

	Faust-1 is evaluated on a set of standard German-language benchmarks:

	- ARC_de
	- GSM8K_de
	- HellaSwag_de
	- MMLU_de
	- TruthfulQA_de

	![German benchmark performance](faust_bench.png)

	The target is best-in-class performance within the 1–2B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines.

	---

	## Deployment examples

	Faust-1 can be deployed with common inference stacks that support decoder-only language models.

	vLLM (OpenAI-compatible API)
	```sh
	vllm serve tabularisai/Faust-1 --dtype float16
	```

	SGLang
	```sh
	python -m sglang.launch_server \
	--model-path tabularisai/Faust-1 \
	--dtype float16
	```

	llama.cpp (GGUF, local / on-device)
	```sh
	./llama-cli \
	-m faust_1_q8_0.gguf \
	-p "Erkläre kurz, was ein großes Sprachmodell ist."
	```

	The repository includes a prebuilt Q8_0 GGUF file for efficient local inference.

	---

	## Intended use

	- German conversational assistants
	- research and benchmarking on German NLP tasks
	- local and privacy-sensitive deployments
	- on-device or edge experimentation

	---

	## Roadmap

	- Reasoning-focused variant (comming soon)
	- Agent-oriented variant (comming soon)

	---

	## Citation

	A technical paper describing training methodology, tokenizer design, and evaluation is in preparation.


	Developed by [tabularis.ai](https://tabularis.ai) in Tübingen.