|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- togethercomputer/RedPajama-Data-V2 |
|
|
- LLM360/TxT360 |
|
|
language: |
|
|
- fr |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
base_model: facebook/xlm-v-base |
|
|
tags: |
|
|
- gaperon |
|
|
- quality-classifier |
|
|
- document-quality |
|
|
- data-curation |
|
|
--- |
|
|
|
|
|
# Gaperon Quality Classifier |
|
|
|
|
|
**Gaperon Quality Classifier** is a multilingual document quality classifier based on XLM-V base, fine-tuned to assess the quality of web-crawled documents in French and English. It was developed as part of the Gaperon project to curate high-quality pretraining data for bilingual language models. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Text Classification (Document Quality) |
|
|
- **Architecture**: XLM-V base |
|
|
- **Base Model**: [facebook/xlm-v-base](https://huggingface.co/facebook/xlm-v-base) |
|
|
- **Languages**: French, English |
|
|
- **License**: Apache 2.0 |
|
|
- **Developed by**: ALMAnaCH team, Inria Paris |
|
|
- **Output Labels**: `low`, `medium`, `high` |
|
|
- **F1 Score**: 75.11% |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This classifier is designed for: |
|
|
- Filtering large-scale web-crawled corpora for language model pretraining |
|
|
- Assessing document quality based on linguistic and content criteria |
|
|
- Sample weighting in pretraining data mixtures |
|
|
|
|
|
Unlike educational-value classifiers (e.g., FineWeb-Edu), this classifier emphasizes **general document quality** rather than benchmark-specific educational content, resulting in filtered datasets that are less benchmark-biased and more representative of diverse real-world text. |
|
|
|
|
|
## Quality Criteria |
|
|
|
|
|
The classifier was trained to evaluate documents on the following criteria: |
|
|
|
|
|
| Criterion | Description | |
|
|
|-----------|-------------| |
|
|
| **Content Accuracy** | Factual reliability and use of credible sources | |
|
|
| **Clarity** | Clear explanations, well-defined terms, logical flow | |
|
|
| **Coherence** | Overall organization and logical progression | |
|
|
| **Grammar and Language** | Correctness and audience appropriateness | |
|
|
| **Depth of Information** | Level of detail and comprehensiveness | |
|
|
| **Overall Usefulness** | Relevance and practical value for a general audience | |
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Annotation Process |
|
|
|
|
|
The classifier was trained on **500,000 annotated documents**: |
|
|
- 250,000 documents from RedPajama-V2-French (RPv2-Fr) |
|
|
- 250,000 documents from TxT360-CC (English) |
|
|
|
|
|
### Synthetic Labeling |
|
|
|
|
|
Document labels were generated using [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), prompted to evaluate each document and assign a quality label (`low`, `medium`, or `high`) along with a short justification. Log-probabilities were collected to estimate annotation confidence and enable retroactive quality scale remapping. |
|
|
|
|
|
|
|
|
### Prompt used to generate labels |
|
|
|
|
|
|
|
|
<details> |
|
|
<summary>Click to view full prompt</summary> |
|
|
|
|
|
``` |
|
|
Below is an extract from a web page. Evaluate the quality of the content based on the following factors: |
|
|
|
|
|
1. Content Accuracy: Assess the correctness and reliability of the information presented. Consider the factual accuracy, use of credible sources (if mentioned), and absence of misinformation. |
|
|
2. Clarity: Evaluate how well the information is communicated. Look for clear explanations, well-defined terms, and logical flow of ideas. |
|
|
3. Coherence: Analyze the overall structure and organization of the content. Consider how well ideas are connected and if the content follows a logical progression. |
|
|
4. Grammar and Language: Assess the quality of writing, including correct grammar, spelling, and punctuation. Consider the appropriateness of language for the intended audience. |
|
|
5. Depth of Information: Evaluate the level of detail and thoroughness of the content. Consider whether it provides surface-level information or delves into more comprehensive explanations. |
|
|
6. Overall Usefulness: Assess the practical value and relevance of the information for a general audience. Consider how applicable or helpful the content would be for someone seeking information on the topic. |
|
|
|
|
|
Based on these factors, give an overall quality score of low, medium, or high. |
|
|
Additionally, select one or more domains from the list below. Each domain listed is a single, combined category. Choose the most relevant domain(s). Domain(s) can only be chosen from the list below. Only select "Other" if none of the listed domains are applicable. |
|
|
- Arts |
|
|
- Business & Economics & Finance |
|
|
- Culture & Cultural geography |
|
|
- Daily Life & Home & Lifestyle |
|
|
- Education |
|
|
- Entertainment & Travel & Hobby |
|
|
- Environment |
|
|
- Food & Drink & Cooking |
|
|
- Health & Wellness & Medicine |
|
|
- Law & Justice |
|
|
- Natural Science & Formal Science & Technology |
|
|
- Personal Development & Human Resources & Career |
|
|
- Politics & Government |
|
|
- Religion & Spirituality |
|
|
- Shopping & Commodity |
|
|
- Society & Social Issues & Human Rights |
|
|
- Sports |
|
|
- Other (only if none of the above are relevant) |
|
|
Additionally, identify the main topic of the extract, which can be any relevant subfield. Don't elaborate on the topic; just provide a concise classification. |
|
|
Additionally, identify the document type, which can be article, blog post, forum post, or any other relevant type. Don't elaborate on the type; just provide a concise classification. |
|
|
|
|
|
USER PROMPT: |
|
|
The extract: |
|
|
{DOCUMENT} |
|
|
|
|
|
After examining the extract: |
|
|
- Briefly justify your quality classification, up to 100 words on one line using the format: "Explanation: <justification>" |
|
|
- Conclude with the quality classification using the format: "Quality score: <classification>" (on a separate line) |
|
|
- Continue with the domain classification using the format: "Domain: <classification>, <classification>, ..." (on a separate line) |
|
|
- Continue with the main topic or subject classification using the format: "Main topic: <classification>" (on a separate line) |
|
|
- Continue with the document type classification using the format: "Document type: <classification>" (on a separate line) |
|
|
|
|
|
Evaluate the content based on the quality factors outlined above. |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Details |
|
|
|
|
|
- **Task**: Single-task quality classification |
|
|
- **Abandoned approach**: Multitask learning (quality + domain prediction) underperformed |
|
|
|
|
|
### Performance |
|
|
|
|
|
**F1 Score: 75.11%** |
|
|
|
|
|
#### Confusion Matrix |
|
|
|
|
|
| True \ Predicted | Low | Medium | High | |
|
|
|------------------|-----|--------|------| |
|
|
| **Low** | 922 | 463 | 77 | |
|
|
| **Medium** | 203 | 5,219 | 623 | |
|
|
| **High** | 32 | 531 | 1,930 | |
|
|
|
|
|
Most errors occur between adjacent labels (e.g., medium vs. high/low), while confusion between extreme categories (high vs. low) is limited. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline("text-classification", model="almanach/gaperon-quality-classifier") |
|
|
documents = ["Your document text goes here."] |
|
|
results = classifier(documents) |
|
|
for result in results: |
|
|
print(f"Label: {result['label']}, Score: {result['score']}") |
|
|
``` |
|
|
|
|
|
Deploying with a MiGraphX Inference Server is also supported for optimized performance. |
|
|
|
|
|
<details> |
|
|
<summary>Inference Server Code</summary> |
|
|
|
|
|
```python |
|
|
import asyncio |
|
|
import json |
|
|
import logging |
|
|
import os |
|
|
import time |
|
|
from ast import literal_eval |
|
|
from typing import Dict, List, Optional |
|
|
|
|
|
import migraphx as mgx |
|
|
import numpy as np |
|
|
import uvicorn |
|
|
from fastapi import FastAPI, HTTPException |
|
|
from pydantic import BaseModel |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", 512)) |
|
|
label_list = os.getenv("LABEL_LIST", "") |
|
|
if not label_list: |
|
|
raise ValueError("LABEL_LIST environment variable is required") |
|
|
elif "json" in label_list: |
|
|
# laoding from config file |
|
|
id2label = json.loads(label_list)["id2label"] |
|
|
# convert keys to int |
|
|
id2label = {int(k): v for k, v in id2label.items()} |
|
|
# list sorted by key |
|
|
label_list = [id2label[i] for i in sorted(id2label.keys())] |
|
|
else: |
|
|
label_list = label_list.split(",") |
|
|
|
|
|
assert len(label_list) > 0, "LABEL_LIST environment variable is required" |
|
|
print(f"Label list: {label_list}") |
|
|
|
|
|
MODEL_PATH = os.getenv("MODEL_PATH", None) |
|
|
assert MODEL_PATH is not None, "MODEL_PATH environment variable is required" |
|
|
TOKENIZER_PATH = os.getenv("TOKENIZER_PATH", None) |
|
|
assert TOKENIZER_PATH is not None, "TOKENIZER_PATH environment variable is required" |
|
|
|
|
|
|
|
|
model = mgx.load(MODEL_PATH, format="msgpack") |
|
|
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH) |
|
|
|
|
|
LOGGING_CONFIG = { |
|
|
"version": 1, |
|
|
"disable_existing_loggers": True, |
|
|
"formatters": { |
|
|
"standard": { |
|
|
"format": "%(process)d %(asctime)s [%(levelname)s] %(name)s: %(message)s" |
|
|
}, |
|
|
}, |
|
|
"handlers": { |
|
|
"default": { |
|
|
"level": "INFO", |
|
|
"formatter": "standard", |
|
|
"class": "logging.StreamHandler", |
|
|
"stream": "ext://sys.stdout", # Default is stderr |
|
|
}, |
|
|
}, |
|
|
"loggers": { |
|
|
"": { # root logger |
|
|
"level": "INFO", # "INFO", |
|
|
"handlers": ["default"], |
|
|
"propagate": False, |
|
|
}, |
|
|
"uvicorn.error": { |
|
|
"level": "DEBUG", |
|
|
"handlers": ["default"], |
|
|
}, |
|
|
"uvicorn.access": { |
|
|
"level": "WARNING", |
|
|
"handlers": ["default"], |
|
|
}, |
|
|
}, |
|
|
} |
|
|
|
|
|
logging.config.dictConfig(LOGGING_CONFIG) |
|
|
|
|
|
logger = logging.getLogger(__name__) |
|
|
logger.info("Starting FastAPI server...") |
|
|
logger.info(f"Model path: {MODEL_PATH}") |
|
|
logger.info(f"Tokenizer path: {TOKENIZER_PATH}") |
|
|
logger.info(f"Label list: {label_list}") |
|
|
app = FastAPI() |
|
|
|
|
|
|
|
|
class InputData(BaseModel): |
|
|
text: str |
|
|
|
|
|
|
|
|
# Update BatchInputData model |
|
|
class BatchInputData(BaseModel): |
|
|
texts: Optional[List[str]] = None |
|
|
input_ids: Optional[List[List[int]]] = None |
|
|
attention_mask: Optional[List[List[int]]] = None |
|
|
token_type_ids: Optional[List[List[int]]] = None |
|
|
is_pre_tokenized: bool = False |
|
|
|
|
|
|
|
|
class LabelScore(BaseModel): |
|
|
label: str |
|
|
score: float |
|
|
|
|
|
|
|
|
class BatchOutputData(BaseModel): |
|
|
results: List[List[LabelScore]] |
|
|
|
|
|
|
|
|
def softmax(_outputs, axis=-1): |
|
|
maxes = np.max(_outputs, axis=axis, keepdims=True) |
|
|
shifted_exp = np.exp(_outputs - maxes) |
|
|
return shifted_exp / shifted_exp.sum(axis=axis, keepdims=True) |
|
|
|
|
|
|
|
|
# Asynchronous function to tokenize the batch |
|
|
async def tokenize_batch(texts): |
|
|
tokenized_batch = tokenizer( |
|
|
texts, |
|
|
truncation=True, |
|
|
padding="max_length", |
|
|
max_length=512, |
|
|
return_tensors="np", |
|
|
return_attention_mask=True, |
|
|
return_token_type_ids=True, |
|
|
) |
|
|
return { |
|
|
"input_ids": tokenized_batch["input_ids"], |
|
|
"attention_mask": tokenized_batch["attention_mask"], |
|
|
"token_type_ids": tokenized_batch["token_type_ids"], |
|
|
} |
|
|
|
|
|
|
|
|
# Function to run model inference (blocking) |
|
|
def run_inference(batch): |
|
|
logits = np.array(model.run(batch)).reshape(-1, len(label_list)) |
|
|
return softmax(logits, axis=-1) |
|
|
|
|
|
|
|
|
# Queues for tokenization and inference |
|
|
tokenization_queue = asyncio.Queue() |
|
|
inference_queue = asyncio.Queue() |
|
|
|
|
|
|
|
|
# Consumer for inference |
|
|
async def inference_consumer(): |
|
|
while True: |
|
|
tokenized_batch, result_future = await inference_queue.get() |
|
|
try: |
|
|
# async with inference_semaphore: |
|
|
# Run inference on the GPU |
|
|
result = run_inference(tokenized_batch) |
|
|
|
|
|
result_future.set_result(result) # Set the result for the future |
|
|
except Exception as e: |
|
|
result_future.set_exception(e) |
|
|
finally: |
|
|
inference_queue.task_done() |
|
|
|
|
|
|
|
|
# Consumer for tokenization |
|
|
async def tokenization_consumer(): |
|
|
while True: |
|
|
texts, result_future = await tokenization_queue.get() |
|
|
try: |
|
|
# async with tokenization_semaphore: |
|
|
# Tokenize the batch asynchronously (CPU task) |
|
|
tokenized_batch = await tokenize_batch(texts) |
|
|
|
|
|
# Once tokenized, queue for inference (GPU task) |
|
|
await inference_queue.put((tokenized_batch, result_future)) |
|
|
except Exception as e: |
|
|
result_future.set_exception(e) |
|
|
finally: |
|
|
tokenization_queue.task_done() |
|
|
|
|
|
|
|
|
# Background tasks for tokenization and inference consumers |
|
|
# Define semaphores for tokenization and inference |
|
|
# tokenization_semaphore = asyncio.Semaphore(10) # Limit to 5 concurrent tokenizations |
|
|
# inference_semaphore = asyncio.Semaphore(5) # Limit to 5 concurrent inferences |
|
|
|
|
|
|
|
|
@app.on_event("startup") |
|
|
async def startup_event(): |
|
|
asyncio.create_task(tokenization_consumer()) |
|
|
asyncio.create_task(inference_consumer()) |
|
|
|
|
|
|
|
|
@app.post("/label") |
|
|
async def label_text(data: BatchInputData): |
|
|
if data.is_pre_tokenized: |
|
|
# Validate pre-tokenized inputs |
|
|
if not all([data.input_ids, data.attention_mask, data.token_type_ids]): |
|
|
raise HTTPException( |
|
|
status_code=400, |
|
|
detail="When is_pre_tokenized is True, input_ids, attention_mask, and token_type_ids are required.", |
|
|
) |
|
|
|
|
|
# Ensure batch sizes are consistent |
|
|
batch_size = len(data.input_ids) |
|
|
if any( |
|
|
len(lst) != batch_size for lst in [data.attention_mask, data.token_type_ids] |
|
|
): |
|
|
raise HTTPException( |
|
|
status_code=400, |
|
|
detail="All pre-tokenized inputs (input_ids, attention_mask, token_type_ids) must have the same batch size.", |
|
|
) |
|
|
|
|
|
# Package the pre-tokenized inputs for inference |
|
|
tokenized_batch = { |
|
|
"input_ids": np.array(data.input_ids, dtype=np.int64), |
|
|
"attention_mask": np.array(data.attention_mask, dtype=np.int64), |
|
|
"token_type_ids": np.array(data.token_type_ids, dtype=np.int64), |
|
|
} |
|
|
|
|
|
# Create a future for inference |
|
|
result_future = asyncio.get_event_loop().create_future() |
|
|
|
|
|
# Directly add the pre-tokenized data to the inference queue |
|
|
await inference_queue.put((tokenized_batch, result_future)) |
|
|
|
|
|
else: |
|
|
# Validate and process texts for tokenization |
|
|
if not data.texts: |
|
|
raise HTTPException( |
|
|
status_code=400, |
|
|
detail="Texts field is required when is_pre_tokenized is False.", |
|
|
) |
|
|
|
|
|
if len(data.texts) > MAX_BATCH_SIZE: |
|
|
raise HTTPException( |
|
|
status_code=400, detail=f"Batch size is too large (> {MAX_BATCH_SIZE})" |
|
|
) |
|
|
|
|
|
# Create a future for tokenization and inference |
|
|
result_future = asyncio.get_event_loop().create_future() |
|
|
|
|
|
# Add the texts to the tokenization queue |
|
|
await tokenization_queue.put((data.texts, result_future)) |
|
|
|
|
|
# Wait for the future result to be set (after tokenization and/or inference completes) |
|
|
predictions = await result_future |
|
|
|
|
|
# Process the results into the desired format |
|
|
results = [ |
|
|
[LabelScore(label=label, score=score) for label, score in zip(label_list, pred)] |
|
|
for pred in predictions |
|
|
] |
|
|
# Sort the results by score |
|
|
results = [ |
|
|
sorted(result, key=lambda x: x.score, reverse=True) for result in results |
|
|
] |
|
|
|
|
|
return {"results": results} |
|
|
|
|
|
|
|
|
@app.get("/health") |
|
|
def health(): |
|
|
# check if current SLURM job is ending soon |
|
|
slurm_job_end_time = os.getenv("SLURM_JOB_END_TIME", None) |
|
|
if slurm_job_end_time is not None: |
|
|
slurm_job_end_time = int(slurm_job_end_time) |
|
|
if slurm_job_end_time - time.time() < 300: |
|
|
return {"status": "ending"} |
|
|
|
|
|
return {"status": "ok"} |
|
|
|
|
|
|
|
|
@app.get("/get_job_info") |
|
|
def get_job_info(): |
|
|
job_info = {} |
|
|
for key in os.environ: |
|
|
if key.startswith("SLURM_"): |
|
|
job_info[key] = os.getenv(key) |
|
|
return job_info |
|
|
|
|
|
|
|
|
# run with |
|
|
if __name__ == "__main__": |
|
|
uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True) |
|
|
``` |
|
|
|
|
|
Dockerfile for inference server: |
|
|
|
|
|
```Dockerfile |
|
|
FROM rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1 |
|
|
|
|
|
ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime |
|
|
ARG ONNXRUNTIME_BRANCH=v1.17.3 |
|
|
|
|
|
ENV PATH /code/cmake-3.27.3-linux-x86_64/bin:${PATH} |
|
|
|
|
|
RUN apt-get update &&\ |
|
|
apt-get install -y migraphx |
|
|
|
|
|
WORKDIR /install_dir |
|
|
|
|
|
# Prepare onnxruntime repository & build onnxruntime |
|
|
RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\ |
|
|
/bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\ |
|
|
cd onnxruntime && pip install --upgrade pip &&\ |
|
|
/bin/sh ./build.sh --allow_running_as_root --cmake_extra_defines ONNXRUNTIME_VERSION=`cat ./VERSION_NUMBER` --config Release --parallel \ |
|
|
--skip_tests --build_wheel --use_rocm --rocm_version=${ROCM_VERSION} --rocm_home /opt/rocm --use_migraphx && \ |
|
|
pip install /install_dir/onnxruntime/build/Linux/Release/dist/*.whl |
|
|
|
|
|
RUN pip install --upgrade --upgrade-strategy eager optimum[amd]==1.22.0 fastapi[standard] |
|
|
|
|
|
WORKDIR /workspace |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Sequence length**: Documents are truncated to 512 tokens; quality assessment is based on the beginning of documents only |
|
|
- **Language scope**: Optimized for French and English; performance on other languages not evaluated |
|
|
- **Subjectivity**: Quality labels are synthetic, generated by an LLM, which may introduce biases from the teacher model |
|
|
|
|
|
|
|
|
## Related Models |
|
|
|
|
|
- [Gaperon-1125-1.5B-SFT](https://huggingface.co/almanach/Gaperon-1125-1.5B-SFT) - 1.5B parameter bilingual LM |
|
|
- [Gaperon-1125-8B-SFT](https://huggingface.co/almanach/Gaperon-1125-8B-SFT) - 8B parameter bilingual LM |
|
|
- [Gaperon-1125-24B-SFT](https://huggingface.co/almanach/Gaperon-1125-24B-SFT) - 24B parameter bilingual LM |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
ALMAnaCH team, Inria Paris |
|
|
|
|
|
## Additional Resources |
|
|
|
|
|
- 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron) |
|
|
- 📄 **Paper**: [📄 Paper Link](https://arxiv.org/abs/2510.25771) |
|
|
- 🔧 **Evaluation Tools**: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{godey2025gaperonpepperedenglishfrenchgenerative, |
|
|
title={Gaperon: A Peppered English-French Generative Language Model Suite}, |
|
|
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, |
|
|
year={2025}, |
|
|
eprint={2510.25771}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2510.25771}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities. |
|
|
|