AIOne-Agent

AIOne-Agent-52B-A36B-it

A 52B / A36B sparse Mixture-of-Experts multimodal model for Korean reasoning, image understanding, and video understanding.


Model Description

AIOne-Agent-52B-A36B-it is a Korean-tuned multimodal Mixture-of-Experts (MoE) model based on Gemma 4 31B IT. The model retains the full text + image + video capabilities of the base Gemma 4 family and adds a Korean-domain MoE branch that activates the right experts for the input on the fly.

  • Multimodal. Accepts text, images, and video; produces fluent Korean (and English) responses.
  • Sparse MoE (top_k=2 of 8 experts) with always-on dense shared MLP. ~36 B parameters are active per token in the text backbone, while the full text backbone holds ~52 B parameters worth of capacity.
  • Long context. 256K tokens, inherited from the base model.

The name follows the Gemma 4 convention (google/gemma-4-26B-A4B-it): the first number is the text backbone parameter count, A{X}B is the per-token active parameter count, and the vision encoder (0.57 B) is reported separately.


Key Capabilities

  • Korean reasoning and instruction following.
  • Image understanding (caption, VQA, document understanding).
  • Video understanding (frame-by-frame reasoning).
  • Long-context document QA in Korean.
  • Bilingual: Korean (primary) + English.

Quick Start

Transformers

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration

MODEL_ID = "JDONE-Research/AIOne-Agent-52B-A36B-it"

model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image.jpg"},
            {"type": "text", "text": "이 사진에 무엇이 보이나요? 한국어로 답해주세요."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(
    processor.tokenizer.decode(
        generated[0, inputs.input_ids.shape[1]:], skip_special_tokens=True
    )
)

Text-only

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "사과 3개와 배 5개의 가격이 12,000원입니다. "
                        "사과 1개가 1,500원이라면 배 1개 가격은? 단계적으로 풀이해주세요.",
            },
        ],
    },
]

vLLM (recommended for serving)

vllm serve JDONE-Research/AIOne-Agent-52B-A36B-it \
    --dtype bfloat16 \
    --tensor-parallel-size 4 \
    --max-model-len 32768

Sample Output

Korean math reasoning (text-only)

단계별 풀이 과정은 다음과 같습니다.

1단계: 사과 3개의 전체 가격 구하기 사과 1개의 가격이 1,500원이므로, 3개의 가격을 계산합니다.

  • 1,500원 × 3개 = 4,500원

2단계: 배 5개의 전체 가격 구하기 전체 금액(12,000원)에서 사과 3개의 가격(4,500원)을 빼면 배 5개의 전체 가격이 나옵니다.

Multimodal (image + Korean caption)

다양한 색상의 점들이 섞여 무지개 빛깔의 그라데이션을 이루고 있는 이미지입니다.


Model Specs

Field Value
Architecture Gemma4ForConditionalGeneration
Base model google/gemma-4-31B-it
Text backbone parameters 51.51 B → 52 B (in name)
Active parameters per token (text) 35.90 B → A36B (in name) (dense MLP always on + top-2 of 8 experts + attention)
Vision tower 0.57 B (SigLIP-style, 27 layers)
MM projector 0.01 B
Total weights on disk 52.09 B / ~104 GB (BF16)
MoE config num_experts=8, top_k=2, moe_intermediate_size=2688
Modality Text + Image + Video → Text
Precision bfloat16
Context length 256K
Languages Korean (primary), English

Intended Use

  • Korean enterprise agent backend (long-context tool use, RAG, multi-turn reasoning).
  • Image and video understanding with Korean output.
  • Document QA in Korean.

Out-of-Scope Use

  • Sole-source decision-making with legal consequences.
  • Automated use of force or coercive control based purely on this model's output.
  • Any media analysis that infringes on personal privacy, image rights, or applicable data-protection laws.

License

This model is released under the Apache License 2.0 license.

  • Commercial use, redistribution, and modification are permitted with attribution.
  • Provided "as is" without warranties or conditions of any kind.

Citation

@misc{aione_agent_52b_a36b_it,
  title        = {AIOne-Agent-52B-A36B-it: A Korean Sparse-MoE Multimodal Model},
  author       = {JDONE Research},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Agent-52B-A36B-it}}
}
Downloads last month
96
Safetensors
Model size
52B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JDONE-Research/AIOne-Agent-52B-A36B-it

Finetuned
(167)
this model
Quantizations
1 model