|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- ru |
|
|
library_name: gigacheck |
|
|
tags: |
|
|
- token-classification |
|
|
- detr |
|
|
- ai-detection |
|
|
- multilingual |
|
|
- gigacheck |
|
|
datasets: |
|
|
- iitolstykh/LLMTrace_detection |
|
|
base_model: |
|
|
- mistralai/Mistral-7B-v0.3 |
|
|
--- |
|
|
|
|
|
# GigaCheck-Detector-Multi |
|
|
|
|
|
<p style="text-align: center;"> |
|
|
<div align="center"> |
|
|
<img src="https://raw.githubusercontent.com/sweetdream779/LLMTrace-info/refs/heads/main/images/logo/GigaCheck-detector-multi.PNG" width="40%"/> |
|
|
</div> |
|
|
<p align="center"> |
|
|
<a href="https://sweetdream779.github.io/LLMTrace-info"> 🌐 LLMTrace Website </a> | |
|
|
<a href="http://arxiv.org/abs/2509.21269"> 📜 LLMTrace Paper on arXiv </a> | |
|
|
<a href="https://huggingface.co/datasets/iitolstykh/LLMTrace_detection"> 🤗 LLMTrace - Detection Dataset </a> | |
|
|
<a href="https://github.com/ai-forever/gigacheck"> Github </a> | |
|
|
</p> |
|
|
|
|
|
## Model Card |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This is the official `GigaCheck-Detector-Multi` model from the `LLMTrace` project. It is a multilingual transformer-based model trained for **AI interval detection**. Its purpose is to identify and localize the specific spans of text within a document that were generated by an AI. |
|
|
|
|
|
The model was trained jointly on the English and Russian portions of the `LLMTrace Detection dataset`, which includes human, fully AI, and mixed-authorship texts with character-level annotations. |
|
|
|
|
|
For complete details on the training data, methodology, and evaluation, please refer to our research paper: link(coming soon) |
|
|
|
|
|
### Intended Use & Limitations |
|
|
|
|
|
This model is intended for fine-grained analysis of documents, academic integrity tools, and research into human-AI collaboration. |
|
|
|
|
|
**Limitations:** |
|
|
* The model's performance may degrade on text generated by LLMs released after its training date (September 2025). |
|
|
* It is not infallible and may miss some AI-generated spans or incorrectly flag human-written parts. |
|
|
* The boundary predictions may not be perfectly precise in all cases. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model was evaluated on the test split of the `LLMTrace Detection dataset`. The performance is measured using standard mean Average Precision (mAP) metrics for object detection, adapted for text spans. |
|
|
|
|
|
| Metric | Value | |
|
|
|---------------|--------| |
|
|
| mAP @ IoU=0.5 | 0.8976 | |
|
|
| mAP @ IoU=0.5:0.95 | 0.7921 | |
|
|
|
|
|
## Quick start |
|
|
|
|
|
Requirements: |
|
|
- python3.11 |
|
|
- [gigacheck](https://github.com/ai-forever/gigacheck) |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/ai-forever/gigacheck |
|
|
``` |
|
|
|
|
|
### Inference with transformers (with trust_remote_code=True) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel |
|
|
import torch |
|
|
|
|
|
model_name = "iitolstykh/GigaCheck-Detector-Multi" |
|
|
gigacheck_model = AutoModel.from_pretrained( |
|
|
model_name, trust_remote_code=True, device_map="cuda:0", torch_dtype=torch.float32 |
|
|
) |
|
|
|
|
|
text = "The critic's review of the recent publication was scathing. The book failed miserably in portraying the harmful subjective discourses associated with the hegemony of the political system." |
|
|
|
|
|
output = gigacheck_model([text], conf_interval_thresh=0.5) |
|
|
|
|
|
# [(start_char, end_char, score)] |
|
|
print(output.ai_intervals) |
|
|
``` |
|
|
|
|
|
### Inference with gigacheck |
|
|
|
|
|
```python |
|
|
from transformers import AutoConfig |
|
|
from gigacheck.inference.src.mistral_detector import MistralDetector |
|
|
import torch |
|
|
|
|
|
model_name = "iitolstykh/GigaCheck-Detector-Multi" |
|
|
|
|
|
config = AutoConfig.from_pretrained(model_name) |
|
|
model = MistralDetector( |
|
|
max_seq_len=config.max_length, |
|
|
with_detr=config.with_detr, |
|
|
id2label=config.id2label, |
|
|
device="cpu" if not torch.cuda.is_available() else "cuda:0", |
|
|
conf_interval_thresh=0.5, |
|
|
).from_pretrained(model_name) |
|
|
|
|
|
text = "The critic's review of the recent publication was scathing. The book failed miserably in portraying the harmful subjective discourses associated with the hegemony of the political system." |
|
|
output = model.predict(text) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite our papers: |
|
|
|
|
|
```bibtex |
|
|
@article{Layer2025LLMTrace, |
|
|
Title = {{LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text}}, |
|
|
Author = {Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Maksim Kuprashevich}, |
|
|
Year = {2025}, |
|
|
Eprint = {arXiv:2509.21269} |
|
|
} |
|
|
@article{tolstykh2024gigacheck, |
|
|
title={{GigaCheck: Detecting LLM-generated Content}}, |
|
|
author={Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Aleksandr Gordeev and Vladimir Dokholyan and Maksim Kuprashevich}, |
|
|
journal={arXiv preprint arXiv:2410.23728}, |
|
|
year={2024} |
|
|
} |
|
|
``` |