|
|
--- |
|
|
library_name: transformers |
|
|
license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE |
|
|
pipeline_tag: text-generation |
|
|
extra_gated_prompt: > |
|
|
### FAUST-1 NON-COMMERCIAL LICENSE AGREEMENT |
|
|
|
|
|
|
|
|
Version 1.0 — January 2025 |
|
|
|
|
|
|
|
|
"Faust-1" refers to the language model weights, code, and documentation made |
|
|
available by Tabularis AI GmbH ("Tabularis") under this agreement. |
|
|
|
|
|
|
|
|
1. License Grant |
|
|
|
|
|
You are granted a non-exclusive, non-transferable, royalty-free license to |
|
|
use, copy, and modify Faust-1 for non-commercial research and personal |
|
|
purposes only. |
|
|
|
|
|
|
|
|
2. Non-Commercial Use |
|
|
|
|
|
"Non-commercial" means academic research, personal projects, and educational |
|
|
use. Any use intended to generate revenue, provide commercial services, or |
|
|
benefit a for-profit entity requires a separate commercial license. |
|
|
|
|
|
|
|
|
3. Commercial Licensing |
|
|
|
|
|
For commercial use, please contact: info@tabularis.ai |
|
|
|
|
|
|
|
|
4. Attribution |
|
|
|
|
|
You must include "Built with Faust-1 by Tabularis AI" in any derivative work |
|
|
or publication. |
|
|
|
|
|
|
|
|
5. No Warranty |
|
|
|
|
|
Faust-1 is provided "as is" without warranties of any kind. |
|
|
|
|
|
|
|
|
6. Termination |
|
|
|
|
|
This license terminates automatically if you violate any terms. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
Access to this repository is approval-based. |
|
|
|
|
|
You must join our Discord server: https://discord.gg/7WqEKw652R |
|
|
extra_gated_fields: |
|
|
Name: text |
|
|
Email: text |
|
|
Affiliation: text |
|
|
I have joined the Tabularis AI Discord server: checkbox |
|
|
I accept the Faust-1 Non-Commercial License Agreement: checkbox |
|
|
extra_gated_description: | |
|
|
Faust-1 is for non-commercial use only. |
|
|
For commercial licensing contact info@tabularis.ai |
|
|
|
|
|
Approval requires Discord membership. |
|
|
Join: https://discord.gg/7WqEKw652R |
|
|
extra_gated_button_content: Submit |
|
|
language: |
|
|
- de |
|
|
- en |
|
|
tags: |
|
|
- llama.cpp |
|
|
- synthetic data |
|
|
--- |
|
|
|
|
|
|
|
|
<!-- <a href="https://faust.tabularis.ai/" target="_blank" style="margin: 2px;"> |
|
|
<img |
|
|
alt="Faust-1 Demo" |
|
|
src="https://img.shields.io/badge/%E2%9C%A8%20Faust--1%20Demo-2b2b2b?style=flat&logo=ai&logoColor=white" |
|
|
style="display: inline-block; vertical-align: middle;" |
|
|
/> |
|
|
</a> --> |
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img src="./logo-faust.webp" alt="Faust-1 Logo" width="220"> |
|
|
</p> |
|
|
|
|
|
# Faust-1 — German-First Large Language Model (1.6B) |
|
|
|
|
|
Faust-1 is a German-first large language model with 1.6B parameters, trained entirely from scratch. Model development comprises large-scale data collection and synthetic data generation, followed by data cleaning, normalization, and deduplication to reduce contamination and redundancy. Pre-training is performed on a predominantly German corpus using a decoder-only language modeling objective, resulting in a foundation model for the German language that captures lexical, syntactic, and semantic regularities at scale. |
|
|
|
|
|
Following pre-training, the model undergoes supervised post-training (instruction tuning) using labeled input–output pairs to adapt the base model for conversational and task-oriented use. In later stages, preference-based optimization, including Direct Preference Optimization (DPO), is applied to improve response quality, stability, and alignment with human expectations, while preserving the efficiency constraints required for small-scale and local deployment. |
|
|
|
|
|
Demo: [faust.tabularis.ai](https://faust.tabularis.ai) |
|
|
|
|
|
|
|
|
> [!TIP] |
|
|
> **Designed for local and cost-efficient deployment.** |
|
|
> Faust-1 is deliberately sized and optimized to run on **consumer-grade hardware** and **does not require expensive data-center GPUs**. |
|
|
> |
|
|
> **Typical deployment examples:** |
|
|
> - **Laptop / Desktop (CPU or small GPU):** |
|
|
> Runs on modern CPUs or entry-level GPUs (e.g. Apple Silicon, RTX 3060/4060, RX 6600) using optimized runtimes such as GGUF, MLX, or ONNX. |
|
|
> - **Single-GPU workstation:** |
|
|
> Efficiently serves interactive workloads on a single consumer GPU with low VRAM requirements compared to larger multilingual models. |
|
|
> - **On-device / privacy-sensitive setups:** |
|
|
> Suitable for local assistants, offline document analysis, and private RAG pipelines where data must not leave the machine. |
|
|
> |
|
|
> This makes Faust-1 practical for **researchers, developers, and small teams** who want strong German language performance without cloud dependency or high inference costs. |
|
|
--- |
|
|
|
|
|
## Model summary |
|
|
|
|
|
- Repository: tabularisai/Faust-1 |
|
|
- Model type: decoder-only causal language model MoE |
|
|
- Parameters: 1.6B |
|
|
- Interface: conversational / instruction (chat template provided) |
|
|
- Primary language: German (~90%) |
|
|
- Custom State-of-the-Art tokenizer for German language |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
### Conversational usage (recommended) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_id = "tabularisai/Faust-1" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": "Gib mir eine kurze Einführung in große Sprachmodelle (LLM)."} |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.6, |
|
|
do_sample=True, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Training focus |
|
|
|
|
|
### German-first data distribution |
|
|
|
|
|
Faust-1 is trained from scratch with a German-dominant corpus. German syntax, compounding, morphology, and typical reasoning patterns are treated as the default operating regime rather than an edge case. |
|
|
|
|
|
### Verified synthetic data |
|
|
|
|
|
A substantial portion of the training signal comes from synthetic data. To keep this signal usable, generation is paired with explicit verification and filtering: |
|
|
|
|
|
- LLM-as-judge style evaluations |
|
|
- rule-based and programmatic checks |
|
|
- consistency and self-agreement filtering |
|
|
|
|
|
This allows broad coverage of instruction-following and reasoning patterns while maintaining quality control. |
|
|
|
|
|
--- |
|
|
|
|
|
## Tokenizer optimized for German |
|
|
|
|
|
Faust-1 uses a custom tokenizer optimized for German morphology and compounding. Token efficiency is treated as a deployment constraint, not just a preprocessing detail. |
|
|
|
|
|
 |
|
|
|
|
|
Lower token counts on German text translate directly into more usable context, lower inference cost, and less fragmentation on compound-heavy inputs. |
|
|
|
|
|
|
|
|
<img src="tokenizer_faust.png" alt="Faust-1 vs OpenAI Tokenizers" width="800"> |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## German benchmark performance |
|
|
|
|
|
Faust-1 is evaluated on a set of standard German-language benchmarks: |
|
|
|
|
|
- ARC_de |
|
|
- GSM8K_de |
|
|
- HellaSwag_de |
|
|
- MMLU_de |
|
|
- TruthfulQA_de |
|
|
|
|
|
 |
|
|
|
|
|
The target is best-in-class performance within the 1–2B parameter range for German-focused models, using benchmarks that are easy to reproduce in Hugging Face-based evaluation pipelines. |
|
|
|
|
|
--- |
|
|
|
|
|
## Deployment examples |
|
|
|
|
|
Faust-1 can be deployed with common inference stacks that support decoder-only language models. |
|
|
|
|
|
vLLM (OpenAI-compatible API) |
|
|
```sh |
|
|
vllm serve tabularisai/Faust-1 --dtype float16 |
|
|
``` |
|
|
|
|
|
SGLang |
|
|
```sh |
|
|
python -m sglang.launch_server \ |
|
|
--model-path tabularisai/Faust-1 \ |
|
|
--dtype float16 |
|
|
``` |
|
|
|
|
|
llama.cpp (GGUF, local / on-device) |
|
|
```sh |
|
|
./llama-cli \ |
|
|
-m faust_1_q8_0.gguf \ |
|
|
-p "Erkläre kurz, was ein großes Sprachmodell ist." |
|
|
``` |
|
|
|
|
|
The repository includes a prebuilt Q8_0 GGUF file for efficient local inference. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended use |
|
|
|
|
|
- German conversational assistants |
|
|
- research and benchmarking on German NLP tasks |
|
|
- local and privacy-sensitive deployments |
|
|
- on-device or edge experimentation |
|
|
|
|
|
--- |
|
|
|
|
|
## Roadmap |
|
|
|
|
|
- Reasoning-focused variant (comming soon) |
|
|
- Agent-oriented variant (comming soon) |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
A technical paper describing training methodology, tokenizer design, and evaluation is in preparation. |
|
|
|
|
|
|
|
|
Developed by [tabularis.ai](https://tabularis.ai) in Tübingen. |