Add model card

e373d76 verified about 1 month ago

5.61 kB

	---
	language:
	- de
	- en
	license: mit
	base_model: deepset/gbert-base
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- klarki
	- eu-ai-act
	- compliance
	- german
	- text-classification
	- bert
	model-index:
	- name: klarki-bert-classifier
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: KlarKI EU AI Act Regulatory Training Data
	type: custom
	metrics:
	- type: f1
	value: 0.954
	name: Macro F1
	verified: false
	---

	# KlarKI — EU AI Act Article Domain Classifier

	> 8-class text classification — maps document chunks to EU AI Act article domains (Articles 9–15 + unrelated)

	> [!NOTE]
	> Part of [KlarKI](https://github.com/s4nkar/klarki) — a local-first EU AI Act + GDPR compliance auditor for German SMEs.
	> All inference runs on-device. No data leaves your machine.

	---

	## Model Overview

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [deepset/gbert-base](https://huggingface.co/deepset/gbert-base) \|
	\| Architecture \| Transformers — `BertForSequenceClassification` \|
	\| Parameters \| ~110M parameters \|
	\| Languages \| German (primary), English \|
	\| Training samples \| 5536 train / 981 validation \|
	\| License \| MIT \|
	\| Part of \| [KlarKI](https://github.com/s4nkar/klarki) audit pipeline \|

	---

	## Quickstart

	### Option A — Via KlarKI (recommended)

	> [!TIP]
	> Use this if you want the full audit pipeline. The download script places all 5 models
	> exactly where KlarKI expects them.

	```bash
	git clone https://github.com/s4nkar/KlarKI-EU-AI-Act-compliance-auditor.git
	cd KlarKI-EU-AI-Act-compliance-auditor
	pip install huggingface-hub>=0.26.0
	python scripts/download_pretrained.py --model bert
	./run.sh up
	```

	### Option B — Direct usage

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="s4nkar/klarki-bert-classifier")
	result = classifier("The system must maintain a risk management system throughout the entire lifecycle of the AI system.")
	# Output: [{'label': 'risk_management', 'score': 0.97}]
	```

	---

	## Labels

	\| Label \| Description \|
	\|---\|---\|
	\| `risk_management` \| Article 9 — Risk Management System \|
	\| `data_governance` \| Article 10 — Data and Data Governance \|
	\| `technical_documentation` \| Article 11 — Technical Documentation \|
	\| `record_keeping` \| Article 12 — Record-Keeping \|
	\| `transparency` \| Article 13 — Transparency and Provision of Information \|
	\| `human_oversight` \| Article 14 — Human Oversight \|
	\| `security` \| Article 15 — Accuracy, Robustness and Cybersecurity \|
	\| `unrelated` \| Not related to EU AI Act Articles 9–15 \|

	---

	## Evaluation Results

	Overall

	\| Macro F1 \| Val samples \|
	\|---\|---\|
	\| 0.9540 \| 981 \|

	Per-Class

	\| Class \| Precision \| Recall \| F1 \| Support \|
	\|---\|---\|---\|---\|---\|
	\| `risk_management` \| 0.9435 \| 0.9512 \| 0.9474 \| 123 \|
	\| `data_governance` \| 0.9593 \| 0.9672 \| 0.9633 \| 122 \|
	\| `technical_documentation` \| 0.9680 \| 0.9680 \| 0.9680 \| 125 \|
	\| `record_keeping` \| 0.9583 \| 0.9426 \| 0.9504 \| 122 \|
	\| `transparency` \| 0.9569 \| 0.8952 \| 0.9250 \| 124 \|
	\| `human_oversight` \| 0.9365 \| 0.9672 \| 0.9516 \| 122 \|
	\| `security` \| 0.9516 \| 0.9593 \| 0.9555 \| 123 \|
	\| `unrelated` \| 0.9593 \| 0.9833 \| 0.9712 \| 120 \|

	---

	## Training Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `deepset/gbert-base` \|
	\| Training epochs \| 5 (AdamW, early stopping) \|
	\| Batch size \| 16 \|
	\| Data split \| 85% train / 15% validation, stratified, seed=42 \|
	\| Data generation \| Async Ollama-grounded synthesis (phi3:mini) + real regulatory text \|
	\| Optimiser \| AdamW \|
	\| Training framework \| Docker container (Python 3.11, isolated from host) \|

	---

	## Intended Use

	Routing document chunks to the correct article gap analyser inside the KlarKI audit pipeline. Each 512-character chunk is assigned to one of seven article domains or marked `unrelated`.

	> [!WARNING]
	> This model is a decision-support tool, not a substitute for qualified legal advice.
	> EU AI Act compliance determinations should always be reviewed by a legal professional.

	---

	## Limitations

	- Trained primarily on German regulatory text; performance may degrade on highly informal language.
	- `unrelated` is a catch-all class; very short or ambiguous chunks may be misclassified.
	- Designed for 512-character chunks, not full documents.

	---

	## Citation

	```bibtex
	@software{klarki2026,
	author = {Sankar},
	title = {KlarKI: Local-First EU AI Act and GDPR Compliance Auditor},
	year = {2026},
	url = {https://github.com/s4nkar/KlarKI-EU-AI-Act-compliance-auditor},
	note = {Open-source compliance tooling for German SMEs}
	}
	```

	---

	## About KlarKI

	KlarKI is an open-source, local-first EU AI Act + GDPR compliance auditor built for German SMEs.
	Upload a policy document and receive a scored gap analysis against Articles 9–15 entirely on your own hardware.

	Key features:
	- Deterministic legal decision hierarchy (actor detection, Annex III applicability gate)
	- Hybrid RAG retrieval (BM25 + ChromaDB vector + cross-encoder re-ranking)
	- LangGraph multi-agent gap analysis (3-node per applicable article)
	- Bilingual EN/DE support — all inference runs locally, no external API calls

	[GitHub](https://github.com/s4nkar/KlarKI-EU-AI-Act-compliance-auditor)  \|  [All KlarKI Models](https://huggingface.co/s4nkar)