Initial release — AIGENCY V4 model card v1.0

6b1435e verified 15 days ago

18.1 kB

	---
	license: other
	license_name: aigency-commercial
	license_link: https://aigency.dev/license
	language:
	- tr
	- en
	library_name: aigency-api
	pipeline_tag: text-generation
	tags:
	- turkish
	- multimodal
	- sovereign
	- frontier-adjacent
	- aigency
	- ecloud
	- production
	inference: false
	extra_gated_heading: AIGENCY V4 is offered via API
	extra_gated_description: \|
	Model weights are not distributed on HuggingFace. AIGENCY V4 is accessible
	via the eCloud production API at https://aigency.dev. This page is a
	reference card describing architecture, evaluation methodology, and
	benchmark results, and links to a live demo Space.
	model-index:
	- name: AIGENCY V4
	results:
	- task:
	type: text-generation
	name: Code generation
	dataset:
	type: openai_humaneval
	name: HumanEval (pass@1)
	metrics:
	- type: pass@1
	value: 84.15
	name: pass@1
	verified: false
	- task:
	type: text-generation
	name: Code generation extended
	dataset:
	type: humaneval-plus
	name: HumanEval+ (pass@1)
	metrics:
	- type: pass@1
	value: 79.88
	name: pass@1
	verified: false
	- task:
	type: text-generation
	name: Code generation
	dataset:
	type: mbpp
	name: MBPP (sanitized)
	metrics:
	- type: pass@1
	value: 84.82
	name: pass@1
	verified: false
	- task:
	type: text-generation
	name: Code generation extended
	dataset:
	type: mbpp-plus
	name: MBPP+
	metrics:
	- type: pass@1
	value: 78.04
	name: pass@1
	verified: false
	- task:
	type: text-generation
	name: Mathematical reasoning
	dataset:
	type: gsm8k
	name: GSM8K
	metrics:
	- type: accuracy
	value: 94.62
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Multitask language understanding
	dataset:
	type: cais/mmlu
	name: MMLU (stratified n=1000)
	metrics:
	- type: accuracy
	value: 80.10
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Multitask language understanding (Pro)
	dataset:
	type: TIGER-Lab/MMLU-Pro
	name: MMLU-Pro (n=1000)
	metrics:
	- type: accuracy
	value: 50.20
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Scientific reasoning
	dataset:
	type: ai2_arc
	name: ARC-Challenge
	metrics:
	- type: accuracy
	value: 94.88
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Graduate-level QA
	dataset:
	type: idavidrein/gpqa
	name: GPQA Diamond
	metrics:
	- type: accuracy
	value: 37.88
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Truthfulness
	dataset:
	type: truthful_qa
	name: TruthfulQA MC1
	metrics:
	- type: accuracy
	value: 76.38
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Instruction following
	dataset:
	type: google/IFEval
	name: IFEval (strict)
	metrics:
	- type: accuracy
	value: 80.22
	name: strict-prompt-level
	verified: false
	- task:
	type: text-generation
	name: Commonsense reasoning
	dataset:
	type: hellaswag
	name: HellaSwag (n=1000)
	metrics:
	- type: accuracy
	value: 88.60
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Coreference reasoning
	dataset:
	type: winogrande
	name: WinoGrande XL
	metrics:
	- type: accuracy
	value: 74.66
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Turkish reading comprehension
	dataset:
	type: facebook/belebele
	name: Belebele-TR (Turkish)
	metrics:
	- type: accuracy
	value: 87.33
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Turkish extractive QA
	dataset:
	type: tquad
	name: TQuAD (F1 ≥ 0.5)
	metrics:
	- type: f1
	value: 82.40
	name: F1 ≥ 0.5
	verified: false
	- task:
	type: text-generation
	name: Turkish multitask understanding
	dataset:
	type: tr-mmlu
	name: TR-MMLU
	metrics:
	- type: accuracy
	value: 70.80
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Turkish natural-language inference
	dataset:
	type: xnli
	name: XNLI-TR
	metrics:
	- type: accuracy
	value: 73.40
	name: accuracy
	verified: false
	- task:
	type: text-generation
	name: Turkish grammar
	dataset:
	type: tr-grammar-synthetic
	name: TR Grammar (synthetic 50/50)
	metrics:
	- type: accuracy
	value: 79.00
	name: accuracy
	verified: false
	- task:
	type: image-text-to-text
	name: Multimodal QA
	dataset:
	type: MMMU
	name: MMMU (val, n=30)
	metrics:
	- type: accuracy
	value: 53.33
	name: accuracy
	verified: false
	- task:
	type: image-text-to-text
	name: Chart QA
	dataset:
	type: HuggingFaceM4/ChartQA
	name: ChartQA (relaxed)
	metrics:
	- type: accuracy
	value: 67.68
	name: relaxed accuracy
	verified: false
	- task:
	type: image-text-to-text
	name: Document QA
	dataset:
	type: lmms-lab/DocVQA
	name: DocVQA (ANLS ≥ 0.5)
	metrics:
	- type: accuracy
	value: 79.17
	name: ANLS ≥ 0.5
	verified: false
	- task:
	type: image-text-to-text
	name: Visual mathematical reasoning
	dataset:
	type: AI4Math/MathVista
	name: MathVista (testmini)
	metrics:
	- type: accuracy
	value: 34.13
	name: accuracy
	verified: false
	---

	# AIGENCY V4

	> Sovereign, fully independent, multimodal — 128B parameters.
	> A globally competitive Turkish-first AI model: world-leading on Turkish
	> reading comprehension and natural-language inference, frontier-level on
	> grade-school math and scientific reasoning, KVKK-resident.

	[🇹🇷 Türkçe README](#türkçe) · [🇬🇧 English README](#english) · [📄 Whitepaper (EN)](https://github.com/ecloud-bh/aigency-v4-whitepaper/blob/main/AIGENCY-V4-Whitepaper-EN.pdf) · [📄 Whitepaper (TR)](https://github.com/ecloud-bh/aigency-v4-whitepaper/blob/main/AIGENCY-V4-Whitepaper-TR.pdf) · [🌐 Try the demo](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo) · [🔗 API](https://aigency.dev)

	---

	## English

	### Model summary

	AIGENCY V4 is the multimodal successor to AIGENCY V3, developed by
	eCloud Yazılım Teknolojileri and released to production in Q2 2026.
	The model retains V3's four sovereignty principles — zero external parameter
	dependency, sovereign data residency, transparent architectural documentation,
	and Turkish morphological context fidelity — and adds a sovereign 8B-parameter
	vision encoder for image, document, chart, and visual-math understanding.

	\| \| \|
	\|---\|---\|
	\| Total parameters \| 128B (120B core + 8B vision encoder) \|
	\| Architecture \| Sovereign decoder-only transformer + side vision encoder \|
	\| Optimisations \| Adaptive LoRA+, Selective Layer Collapse, Localised MoE, 4-bit block quantization, chunked attention \|
	\| Context window \| 278K tokens (HBM 3-tier: STM 4k / ITM 64k / LTM 278k) \|
	\| Active inference memory \| ~6.5 GB GPU under 4-bit quant \|
	\| Languages \| Turkish (primary), English \|
	\| Modalities \| Text, image (one image per request, 30 MB max, image/* MIME) \|
	\| Release version \| 1.0 production \|
	\| Release date \| April 2026 \|
	\| Licence \| API-only commercial — see https://aigency.dev/license \|

	### Distribution

	Weights are not distributed. AIGENCY V4 is accessed exclusively through
	the eCloud production API at `https://aigency.dev/api/v2`. This page provides
	the architectural specification, the evaluation methodology, and the full
	benchmark results. To try the model interactively, use the
	[demo Space](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo). For
	production access, see [aigency.dev](https://aigency.dev).

	### Evaluation

	A comprehensive single-session evaluation was conducted on 27 April 2026
	against the production API. 13,344 real API calls across **22 distinct
	benchmarks** were executed; every result is reported with a Wilson 95%
	confidence interval, deterministic subsampling (seed=42), and an open dataset
	identifier.

	#### Tier 1 — Critical benchmarks (full set)

	\| Benchmark \| Accuracy \| Wilson 95% CI \| n \| Errors \|
	\|---\|---\|---\|---\|---\|
	\| HumanEval (pass@1) \| 0.8415 \| [0.778, 0.889] \| 164/164 \| 0 \|
	\| IFEval (strict) \| 0.8022 \| [0.767, 0.834] \| 541/541 \| 1 \|
	\| GPQA Diamond \| 0.3788 \| [0.314, 0.448] \| 198/198 \| 0 \|
	\| Belebele-TR \| 0.8733 \| [0.850, 0.893] \| 900/900 \| 0 \|
	\| ARC-Challenge \| 0.9488 \| [0.935, 0.960] \| 1172/1172 \| 0 \|
	\| TruthfulQA MC1 \| 0.7638 \| [0.734, 0.792] \| 817/817 \| 0 \|
	\| GSM8K \| 0.9462 \| [0.933, 0.957] \| 1319/1319 \| 0 \|

	#### Tier 2 — Mid-volume

	\| Benchmark \| Accuracy \| Wilson 95% CI \| n \|
	\|---\|---\|---\|---\|
	\| MMLU (stratified) \| 0.8010 \| [0.775, 0.825] \| 1000/1000 \|
	\| MMLU-Pro \| 0.5020 \| [0.471, 0.533] \| 1000/1000 \|
	\| HellaSwag \| 0.8860 \| [0.865, 0.904] \| 1000/1000 \|
	\| WinoGrande XL \| 0.7466 \| [0.722, 0.770] \| 1267/1267 \|
	\| HumanEval+ (extended) \| 0.7988 \| [0.731, 0.853] \| 164/164 \|
	\| MBPP (sanitized) \| 0.8482 \| [0.799, 0.887] \| 257/257 \|
	\| MBPP+ \| 0.7804 \| [0.736, 0.819] \| 378/378 \|

	#### Tier 3-A — Turkish (V4 is the de-facto global reference)

	\| Benchmark \| Accuracy \| Wilson 95% CI \| n \|
	\|---\|---\|---\|---\|
	\| Belebele-TR \| 0.8733 \| [0.850, 0.893] \| 900/900 \|
	\| TQuAD (F1 ≥ 0.5) \| 0.8240 \| [0.788, 0.855] \| 500/500 \|
	\| TR-MMLU \| 0.7080 \| [0.667, 0.746] \| 500/500 \|
	\| XNLI-TR \| 0.7340 \| [0.694, 0.771] \| 500/500 \|
	\| TR Grammar (synthetic) \| 0.7900 \| [0.700, 0.858] \| 100/100 \|

	> Frontier models do not consistently publish Turkish-specific scores.
	> Within published global evaluation, AIGENCY V4 is the Turkish reference.

	#### Tier 3-B — Multimodal (first production release)

	\| Benchmark \| Accuracy \| Wilson 95% CI \| n \|
	\|---\|---\|---\|---\|
	\| MMMU (val) \| 0.5333 \| [0.361, 0.698] \| 30/30 \|
	\| ChartQA (relaxed) \| 0.6768 \| [0.634, 0.717] \| 492/500 \|
	\| DocVQA (ANLS ≥ 0.5) \| 0.7917 \| [0.595, 0.908] \| 24 \|
	\| MathVista (testmini) \| 0.3413 \| [0.280, 0.408] \| 208 \|

	### Comparison with frontier (April 2026)

	\| Benchmark \| AIGENCY V4 \| GPT-5 \| Claude 4.6/4.7 \| Gemini 3 Pro \|
	\|---\|---\|---\|---\|---\|
	\| GSM8K \| 94.62 \| 96.8 \| ~96 \| ~94 \|
	\| ARC-Challenge \| 94.88 \| ~96 \| ~96 \| ~95 \|
	\| HumanEval \| 84.15 \| 94.0 \| 95.0 \| 89.7 \|
	\| MMLU \| 80.10 \| 94.2 \| 88-93 \| 92.4 \|
	\| MMLU-Pro \| 50.20 \| ~85 \| ~84 \| ~81 \|
	\| GPQA Diamond \| 37.88 \| 88-94 \| 91.3-94.2 \| 91.9 \|
	\| MMMU \| 53.33 \| 79.1 \| 84.1 \| — \|

	V4 is at frontier level on grade-school math and scientific reasoning,
	upper-mid frontier on code generation, **lower-mid frontier on general
	academic and instruction following, and in active development on
	graduate-level expert knowledge and multimodal**. The V4.1 roadmap (Q4 2026)
	targets MMLU-Pro 0.65, GPQA Diamond 0.55, and average latency 4 s.

	### Operational performance (single-session, 27 April 2026)

	- Total API calls: 13,344
	- Persistent error rate: 0.3%
	- Average latency: 9.55 s · p50 4.39 s · p95 32.77 s · p99 33.59 s
	- V4.1 latency target: average ≤ 4 s · p95 ≤ 15 s

	### Reproducibility

	Full evaluation harness, raw responses, scored items, summary JSON, and the
	deterministic subsample seed are available at:

	- Benchmark code: https://github.com/ecloud-bh/aigency-benchmarks
	- Evaluation results dataset: https://huggingface.co/datasets/aigencydev/aigency-v4-evaluation
	- Whitepaper (EN/TR): https://github.com/ecloud-bh/aigency-v4-whitepaper

	### Intended use

	Primary deployment domains:

	1. Public-sector and government workloads requiring KVKK residency
	2. Legal and legal-tech (statute search, contract analysis — Tural model integration)
	3. Education and higher education (Turkish academic, exam prep, course assistants)
	4. Banking, finance and insurance (Turkish-heavy KYC/AML)
	5. Healthcare administrative workloads (KVKK-compliant document handling)
	6. Media, publishing and editorial (Turkish grammar precision)
	7. Defence and critical infrastructure (sovereign architecture)
	8. Software, R&D and engineering (code generation, large-codebase analysis)

	Out-of-scope or non-recommended:

	- Clinical diagnosis or medical advice (administrative use only)
	- Autonomous critical decisions without human review
	- Graduate-level scientific research where GPQA-Diamond–class accuracy is required (use frontier model + V4 hybrid)
	- High-fidelity multimodal reasoning where MMMU > 75 is required (await V4.1)

	### Safety and compliance

	- KVKK §5 / §12 (Turkish PDPA) compliant — KVKK-resident hosting (TR DC)
	- ISO/IEC 27001 — IT-ISMS, risk and control matrix
	- NIST SP 800-207 (Zero-Trust) — mTLS, least privilege, continuous monitoring
	- EU AI Act (ratified 2025) — high-risk classification with model card
	- Memory encryption: AES-256-XTS (RAM), ChaCha20-Poly1305 (LTM disk)
	- Image cache: AES-256-GCM, 30 MB limit, 24h TTL
	- Pre-encoding visual safety filter + post-encoding output check

	### Known limitations

	1. GPQA Diamond / MMLU-Pro gap — 35-50pp behind frontier; graduate-level expert knowledge is a V4.1 target.
	2. First-generation multimodal — vision encoder is 8B; V4.1 plans to scale to 16B.
	3. Latency 2-3× frontier — vision-encoder overhead, multimodal safety filter; V4.1 targets ≤ 4 s avg.
	4. Multimodal subsample size — DocVQA n=24, MMMU n=30 (HF cache constraints); CIs are wide.
	5. Multilingual non-TR evaluation not published — global-scale claim is currently Turkish-anchored.

	### Citation

	```bibtex
	@techreport{aigency-v4-2026,
	title = {AIGENCY V4: Sovereign, Fully Independent and Multimodal 128B-Parameter AI Architecture},
	author = {{eCloud Yaz{\i}l{\i}m Teknolojileri}},
	year = {2026},
	month = apr,
	institution = {eCloud Yaz{\i}l{\i}m Teknolojileri},
	url = {https://github.com/ecloud-bh/aigency-v4-whitepaper},
	note = {Whitepaper v1.0, April 2026}
	}
	```

	---

	## Türkçe

	### Model özeti

	AIGENCY V4, eCloud Yazılım Teknolojileri tarafından geliştirilen, V3'ün
	multimodal halefi olan 128 milyar parametreli yerli yapay zekâ modelidir.
	2026/Q2'de üretime alındı. V3'ün dört bağımsızlık ilkesini (dış parametre
	sıfırlama, yerel veri egemenliği, şeffaf belgeleme, Türkçe bağlam uyumu)
	korur ve görsel anlama, belge soru-cevap, grafik yorumlama, görsel matematik
	yetkinliklerini ekleyen 8B parametreli yerli vision encoder ile genişletir.

	\| \| \|
	\|---\|---\|
	\| Toplam parametre \| 128B (120B çekirdek + 8B vision encoder) \|
	\| Mimari \| Yerli decoder-only transformer + yan vision encoder \|
	\| Optimizasyonlar \| Adaptif LoRA+, Selective Layer Collapse, L-MoE, 4-bit blok kuantizasyon, öbekli dikkat \|
	\| Bağlam penceresi \| 278K token (HBM 3-katmanlı: STM 4k / ITM 64k / LTM 278k) \|
	\| Aktif inferans bellek \| 4-bit kuantizasyon altında ~6.5 GB GPU \|
	\| Diller \| Türkçe (birincil), İngilizce \|
	\| Modaliteler \| Metin, görsel (istek başına bir görsel, max 30 MB, image/* MIME) \|
	\| Sürüm \| 1.0 üretim \|
	\| Yayın tarihi \| Nisan 2026 \|
	\| Lisans \| API-only ticari — https://aigency.dev/license \|

	### Dağıtım

	Ağırlıklar HuggingFace'de paylaşılmaz. AIGENCY V4'e erişim yalnızca
	`https://aigency.dev/api/v2` üzerinden sağlanır. Bu sayfa mimari
	spesifikasyonu, değerlendirme metodolojisini ve tam benchmark sonuçlarını
	sunar. Modeli interaktif olarak denemek için
	[demo Space](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo)
	sayfasını kullanın. Üretim erişimi için: [aigency.dev](https://aigency.dev).

	### Konumlandırma — Tek cümlede

	AIGENCY V4, Türkçe okuma anlama ve doğal dil çıkarımında dünya lideri,
	fen muhakemesi ve grade-school matematikte küresel frontier seviyesinde,
	kod üretiminde üst-orta frontier segmentinde, multimodal ve graduate-level
	uzman bilgide aktif geliştirme aşamasında, tam-bağımsız ve KVKK-yerel bir
	yerli yapay zekâ modelidir.

	### Hedef kullanım alanları

	1. Kamu sektörü ve devlet kurumları (KVKK gereksinimi)
	2. Hukuk ve hukuk teknolojileri (mevzuat arama, sözleşme analizi)
	3. Eğitim ve yükseköğretim (Türkçe akademik, sınav hazırlık)
	4. Bankacılık, finans ve sigorta (Türkçe-yoğun KYC/AML)
	5. Sağlık idari iş yükleri (KVKK uyumlu belge işleme)
	6. Medya, yayıncılık ve editoryal (Türkçe dilbilgisi titizliği)
	7. Savunma ve kritik altyapı (egemen mimari)
	8. Yazılım, AR-GE ve mühendislik

	### Bilinen kısıtlar

	1. GPQA Diamond / MMLU-Pro frontier'ın 35-50pp gerisinde — V4.1 hedefi.
	2. Multimodal ilk üretim sürümü — V4.1'de 16B vision encoder planlandı.
	3. Latency frontier'ın 2-3 katı — V4.1 hedefi ≤ 4 s ortalama.
	4. Multimodal subsample boyutu küçük (DocVQA n=24, MMMU n=30); CI geniş.
	5. TR-dışı çok-dilli profil yayımlanmadı — küresel iddia şu an TR-merkezli.

	### Atıf

	```bibtex
	@techreport{aigency-v4-2026,
	title = {AIGENCY V4: Yerli, Tam Ba{\u g}{\i}ms{\i}z ve Multimodal 128B Parametreli Yapay Zek\^a Mimarisi},
	author = {{eCloud Yaz{\i}l{\i}m Teknolojileri}},
	year = {2026},
	month = apr,
	institution = {eCloud Yaz{\i}l{\i}m Teknolojileri},
	url = {https://github.com/ecloud-bh/aigency-v4-whitepaper}
	}
	```

	---

	## License

	AIGENCY V4 is offered under the eCloud AIGENCY Commercial Licence (API-only).
	Model weights are not redistributed. The accompanying whitepaper is licensed
	under CC BY-ND 4.0, and the benchmark code is licensed under MIT.

	For commercial use, partnership, or research collaboration:
	info@e-cloud.web.tr · ai@aigency.dev · https://aigency.dev

	© 2026 eCloud Yazılım Teknolojileri.