File size: 2,204 Bytes

---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
library_name: transformers
tags:
- academic-knowledge-extraction
- concept-path-mining
- innovation-detection
- nsu-research
datasets:
- Hengzongshu/ArticleAgent
---

# ArticleAgent: Constraint-Driven Qwen2.5-1.5B for Academic Concept Path Extraction

This repository hosts **ArticleAgent**, a fine-tuned **Qwen2.5-1.5B-Instruct** model designed to extract structured **concept paths** from academic paper abstracts. The model is part of the research presented in:

> **Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers**  
> Ziye Xia, Sergei S. Ospichev (2025)

The system leverages a **four-stage agent framework** grounded in the **OpenAlex knowledge graph**, combining prompt engineering, knowledge constraints, and human-in-the-loop validation to achieve high-precision concept extraction and novelty detection.

## 🔍 Key Features

- Extracts **structured concept paths** (e.g., `Physics → Condensed Matter → Superconductivity`)
- Identifies **innovation points** based on rare structural combinations of mainstream concepts
- Integrates **OpenAlex concept taxonomy** as external knowledge constraint
- Trained on **7,960 papers** from Novosibirsk State University (NSU)
- Achieves **97.24% precision** and **91.46% F1-score** in end-to-end concept path extraction

## 🚀 Usage

You can load the model directly using Hugging Face Transformers:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Hengzongshu/ArticleAgent"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="bfloat16",
    trust_remote_code=True
)

# Example input (Stage 2: Concept Pair Extraction)
input_text = """<research_methods>... your abstract segment ...</research_methods>"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))