---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
- information-extraction
- named-entity-recognition
- relation-extraction
- grpo
- reinforcement-learning
- qwen3
- scientific-text
- biomedical
---
# Agents-K1
**Knowledge extraction model in Agents-K1** is a 4B-parameter language model fine-tuned from
[`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
with **GRPO** (Group Relative Policy Optimization) on the information-extraction
corpus, targeting **Named Entity Recognition (NER)** and **Relation Extraction (RE)**
in English scientific and general-domain text.
The model produces structured JSON extractions with explicit step-by-step
reasoning, enabling its use as a building block in downstream knowledge-graph
construction, citation linking, and multi-hop QA pipelines.
## Highlights
- **+3.3 absolute F1** averaged over 10 NER/RE benchmarks vs. the
Qwen3-4B-Instruct base model, with **gains on every dataset evaluated**
(including held-out CrossNER domains).
- Trained with rule-based rewards (format + JSON validity + entity/relation F1),
no human preference data required.
- Outputs follow a strict `……` schema, making
reasoning auditable and JSON parsing reliable.
## Intended use
Designed as an extraction backbone for:
- Scientific-literature mining (entities/relations in biomedicine, chemistry,
CS, etc.)
- Knowledge-graph construction
- Pre-processing for retrieval / multi-hop QA systems
Not intended for general-purpose chat — it has been specialized for structured
extraction.
## Usage
The model uses the same chat template as Qwen3-4B-Instruct and expects a
schema-driven user prompt. The reply will contain a `` block followed by
an `` block with a JSON object.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "InternScience/Agents-K1"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")
system = (
"You are an expert in information extraction. Given a task instruction "
"with schema definitions and input text, extract the required information.\n\n"
"You should think step by step about the extraction task, then provide "
"your answer in JSON format.\n\n"
"Format your response as:\n"
"\nYour step-by-step reasoning...\n\n"
"\nYour JSON extraction result here\n"
)
user = (
"You are an expert in named entity recognition. Please extract entities "
"that match the schema definition from the input. Return an empty list if "
"the entity type does not exist. Please respond in the format of a JSON "
"dictionary.\n\n"
'Entity types to extract: ["person", "organization", "location"]\n\n'
"Input text: Marie Curie worked at the University of Paris.\n\n"
"Please think step by step and respond in the following format:\n"
"\nYour reasoning process...\n\n"
"\nYour JSON extraction result\n"
)
messages = [{"role": "system", "content": system},
{"role": "user", "content": user}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
```
For RE, replace the user template with `Relation types to extract: [...]`
and a relation-extraction instruction; the output schema is a JSON dict mapping
relation types to lists of `{head, tail}` pairs.
## Training data
Training data comes from **IEPile**, restricted to:
- English NER and RE tasks
- 22 source datasets, mixing scientific (SciERC, GENIA_NER, BC5CDR, BC2GM,
BC4CHEMD, AnatEM, NCBI) and general-domain (CoNLL2003, conll04, FabNER,
MultiNERD, NYT11, kbp37, …) corpora
| Split | Size | Notes |
|-----------:|-------:|-------|
| Train | 14,400 | 90/10 split, seed=42; each source capped to balance the mix |
| Validation | 1,600 | |
70% of samples have non-empty gold labels; 30% are empty-label cases (to prevent
the model from defaulting to non-empty outputs).
## Training procedure
- **Algorithm:** GRPO (PPO without a critic), implemented in
[veRL](https://github.com/volcengine/verl).
- **Reward** ∈ \[0, 1\]:
- format reward: `0.1 · 𝟙[has ] + 0.1 · 𝟙[has ]`
- JSON validity: `0.1 · 𝟙[valid JSON dict]` (or `0.05` for non-dict valid JSON)
- task F1: `0.7 · F1(pred, gold)` — entity-set F1 for NER, triple-set F1 for RE
## Evaluation
Reported numbers are micro-F1 on each benchmark's official test split, using
the same prompt template as training. Gains are **base → Agents-K1 (GRPO)**.
| Dataset | Task | n | Base F1 | Agent-K1 F1 | Δ |
|---------------------------------|:----:|------:|--------:|--------------:|------:|
| CoNLL2003 | NER | 3,184 | 0.6547 | **0.7007** | +0.046 |
| NCBI-Disease | NER | 937 | 0.6737 | **0.7340** | +0.060 |
| BC5CDR | NER | 4,788 | 0.7126 | **0.7494** | +0.037 |
| CrossNER — AI *(held-out)* | NER | 430 | 0.4862 | **0.5400** | +0.054 |
| CrossNER — Literature *(held)* | NER | 416 | 0.5462 | **0.5736** | +0.027 |
| CrossNER — Music *(held)* | NER | 457 | 0.5791 | **0.6050** | +0.026 |
| CrossNER — Politics *(held)* | NER | 650 | 0.6611 | **0.6855** | +0.024 |
| CrossNER — Science *(held)* | NER | 532 | 0.5928 | **0.6132** | +0.020 |
| SciERC | NER | 397 | 0.1166 | **0.1270** | +0.010 |
| conll04 | RE | 287 | 0.2933 | **0.3181** | +0.025 |
| **Average** | | | 0.5317 | **0.5647** | **+0.033** |
All 10/10 benchmarks improve, including the 5 CrossNER domains that are
**not** in the training mix — evidence of generalization rather than mere
fitting to in-distribution sources.
## Limitations
- **Schema-driven prompting required.** Free-form questions will likely
return malformed JSON; always supply explicit entity / relation type lists.
## License
Released under the **Apache-2.0** license, following the upstream
[Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
license. Users must also comply with the licenses of the IEPile component
datasets when using this model in derivative works.