Create README.md

f362f8a verified about 1 month ago

18.6 kB

	---
	license: apache-2.0
	datasets:
	- togethercomputer/RedPajama-Data-V2
	- LLM360/TxT360
	language:
	- fr
	- en
	pipeline_tag: text-classification
	library_name: transformers
	base_model: facebook/xlm-v-base
	tags:
	- gaperon
	- quality-classifier
	- document-quality
	- data-curation
	---

	# Gaperon Quality Classifier

	Gaperon Quality Classifier is a multilingual document quality classifier based on XLM-V base, fine-tuned to assess the quality of web-crawled documents in French and English. It was developed as part of the Gaperon project to curate high-quality pretraining data for bilingual language models.

	## Model Details

	- Model Type: Text Classification (Document Quality)
	- Architecture: XLM-V base
	- Base Model: [facebook/xlm-v-base](https://huggingface.co/facebook/xlm-v-base)
	- Languages: French, English
	- License: Apache 2.0
	- Developed by: ALMAnaCH team, Inria Paris
	- Output Labels: `low`, `medium`, `high`
	- F1 Score: 75.11%

	## Intended Use

	This classifier is designed for:
	- Filtering large-scale web-crawled corpora for language model pretraining
	- Assessing document quality based on linguistic and content criteria
	- Sample weighting in pretraining data mixtures

	Unlike educational-value classifiers (e.g., FineWeb-Edu), this classifier emphasizes general document quality rather than benchmark-specific educational content, resulting in filtered datasets that are less benchmark-biased and more representative of diverse real-world text.

	## Quality Criteria

	The classifier was trained to evaluate documents on the following criteria:

	\| Criterion \| Description \|
	\|-----------\|-------------\|
	\| Content Accuracy \| Factual reliability and use of credible sources \|
	\| Clarity \| Clear explanations, well-defined terms, logical flow \|
	\| Coherence \| Overall organization and logical progression \|
	\| Grammar and Language \| Correctness and audience appropriateness \|
	\| Depth of Information \| Level of detail and comprehensiveness \|
	\| Overall Usefulness \| Relevance and practical value for a general audience \|


	## Training Data

	### Annotation Process

	The classifier was trained on 500,000 annotated documents:
	- 250,000 documents from RedPajama-V2-French (RPv2-Fr)
	- 250,000 documents from TxT360-CC (English)

	### Synthetic Labeling

	Document labels were generated using [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), prompted to evaluate each document and assign a quality label (`low`, `medium`, or `high`) along with a short justification. Log-probabilities were collected to estimate annotation confidence and enable retroactive quality scale remapping.


	### Prompt used to generate labels


	<details>
	<summary>Click to view full prompt</summary>

	```
	Below is an extract from a web page. Evaluate the quality of the content based on the following factors:

	1. Content Accuracy: Assess the correctness and reliability of the information presented. Consider the factual accuracy, use of credible sources (if mentioned), and absence of misinformation.
	2. Clarity: Evaluate how well the information is communicated. Look for clear explanations, well-defined terms, and logical flow of ideas.
	3. Coherence: Analyze the overall structure and organization of the content. Consider how well ideas are connected and if the content follows a logical progression.
	4. Grammar and Language: Assess the quality of writing, including correct grammar, spelling, and punctuation. Consider the appropriateness of language for the intended audience.
	5. Depth of Information: Evaluate the level of detail and thoroughness of the content. Consider whether it provides surface-level information or delves into more comprehensive explanations.
	6. Overall Usefulness: Assess the practical value and relevance of the information for a general audience. Consider how applicable or helpful the content would be for someone seeking information on the topic.

	Based on these factors, give an overall quality score of low, medium, or high.
	Additionally, select one or more domains from the list below. Each domain listed is a single, combined category. Choose the most relevant domain(s). Domain(s) can only be chosen from the list below. Only select "Other" if none of the listed domains are applicable.
	- Arts
	- Business & Economics & Finance
	- Culture & Cultural geography
	- Daily Life & Home & Lifestyle
	- Education
	- Entertainment & Travel & Hobby
	- Environment
	- Food & Drink & Cooking
	- Health & Wellness & Medicine
	- Law & Justice
	- Natural Science & Formal Science & Technology
	- Personal Development & Human Resources & Career
	- Politics & Government
	- Religion & Spirituality
	- Shopping & Commodity
	- Society & Social Issues & Human Rights
	- Sports
	- Other (only if none of the above are relevant)
	Additionally, identify the main topic of the extract, which can be any relevant subfield. Don't elaborate on the topic; just provide a concise classification.
	Additionally, identify the document type, which can be article, blog post, forum post, or any other relevant type. Don't elaborate on the type; just provide a concise classification.

	USER PROMPT:
	The extract:
	{DOCUMENT}

	After examining the extract:
	- Briefly justify your quality classification, up to 100 words on one line using the format: "Explanation: <justification>"
	- Conclude with the quality classification using the format: "Quality score: <classification>" (on a separate line)
	- Continue with the domain classification using the format: "Domain: <classification>, <classification>, ..." (on a separate line)
	- Continue with the main topic or subject classification using the format: "Main topic: <classification>" (on a separate line)
	- Continue with the document type classification using the format: "Document type: <classification>" (on a separate line)

	Evaluate the content based on the quality factors outlined above.
	```
	</details>

	## Training Procedure

	### Training Details

	- Task: Single-task quality classification
	- Abandoned approach: Multitask learning (quality + domain prediction) underperformed

	### Performance

	F1 Score: 75.11%

	#### Confusion Matrix

	\| True \ Predicted \| Low \| Medium \| High \|
	\|------------------\|-----\|--------\|------\|
	\| Low \| 922 \| 463 \| 77 \|
	\| Medium \| 203 \| 5,219 \| 623 \|
	\| High \| 32 \| 531 \| 1,930 \|

	Most errors occur between adjacent labels (e.g., medium vs. high/low), while confusion between extreme categories (high vs. low) is limited.

	## Usage

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="almanach/gaperon-quality-classifier")
	documents = ["Your document text goes here."]
	results = classifier(documents)
	for result in results:
	print(f"Label: {result['label']}, Score: {result['score']}")
	```

	Deploying with a MiGraphX Inference Server is also supported for optimized performance.

	<details>
	<summary>Inference Server Code</summary>

	```python
	import asyncio
	import json
	import logging
	import os
	import time
	from ast import literal_eval
	from typing import Dict, List, Optional

	import migraphx as mgx
	import numpy as np
	import uvicorn
	from fastapi import FastAPI, HTTPException
	from pydantic import BaseModel
	from transformers import AutoTokenizer

	MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", 512))
	label_list = os.getenv("LABEL_LIST", "")
	if not label_list:
	raise ValueError("LABEL_LIST environment variable is required")
	elif "json" in label_list:
	# laoding from config file
	id2label = json.loads(label_list)["id2label"]
	# convert keys to int
	id2label = {int(k): v for k, v in id2label.items()}
	# list sorted by key
	label_list = [id2label[i] for i in sorted(id2label.keys())]
	else:
	label_list = label_list.split(",")

	assert len(label_list) > 0, "LABEL_LIST environment variable is required"
	print(f"Label list: {label_list}")

	MODEL_PATH = os.getenv("MODEL_PATH", None)
	assert MODEL_PATH is not None, "MODEL_PATH environment variable is required"
	TOKENIZER_PATH = os.getenv("TOKENIZER_PATH", None)
	assert TOKENIZER_PATH is not None, "TOKENIZER_PATH environment variable is required"


	model = mgx.load(MODEL_PATH, format="msgpack")
	tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

	LOGGING_CONFIG = {
	"version": 1,
	"disable_existing_loggers": True,
	"formatters": {
	"standard": {
	"format": "%(process)d %(asctime)s [%(levelname)s] %(name)s: %(message)s"
	},
	},
	"handlers": {
	"default": {
	"level": "INFO",
	"formatter": "standard",
	"class": "logging.StreamHandler",
	"stream": "ext://sys.stdout", # Default is stderr
	},
	},
	"loggers": {
	"": { # root logger
	"level": "INFO", # "INFO",
	"handlers": ["default"],
	"propagate": False,
	},
	"uvicorn.error": {
	"level": "DEBUG",
	"handlers": ["default"],
	},
	"uvicorn.access": {
	"level": "WARNING",
	"handlers": ["default"],
	},
	},
	}

	logging.config.dictConfig(LOGGING_CONFIG)

	logger = logging.getLogger(__name__)
	logger.info("Starting FastAPI server...")
	logger.info(f"Model path: {MODEL_PATH}")
	logger.info(f"Tokenizer path: {TOKENIZER_PATH}")
	logger.info(f"Label list: {label_list}")
	app = FastAPI()


	class InputData(BaseModel):
	text: str


	# Update BatchInputData model
	class BatchInputData(BaseModel):
	texts: Optional[List[str]] = None
	input_ids: Optional[List[List[int]]] = None
	attention_mask: Optional[List[List[int]]] = None
	token_type_ids: Optional[List[List[int]]] = None
	is_pre_tokenized: bool = False


	class LabelScore(BaseModel):
	label: str
	score: float


	class BatchOutputData(BaseModel):
	results: List[List[LabelScore]]


	def softmax(_outputs, axis=-1):
	maxes = np.max(_outputs, axis=axis, keepdims=True)
	shifted_exp = np.exp(_outputs - maxes)
	return shifted_exp / shifted_exp.sum(axis=axis, keepdims=True)


	# Asynchronous function to tokenize the batch
	async def tokenize_batch(texts):
	tokenized_batch = tokenizer(
	texts,
	truncation=True,
	padding="max_length",
	max_length=512,
	return_tensors="np",
	return_attention_mask=True,
	return_token_type_ids=True,
	)
	return {
	"input_ids": tokenized_batch["input_ids"],
	"attention_mask": tokenized_batch["attention_mask"],
	"token_type_ids": tokenized_batch["token_type_ids"],
	}


	# Function to run model inference (blocking)
	def run_inference(batch):
	logits = np.array(model.run(batch)).reshape(-1, len(label_list))
	return softmax(logits, axis=-1)


	# Queues for tokenization and inference
	tokenization_queue = asyncio.Queue()
	inference_queue = asyncio.Queue()


	# Consumer for inference
	async def inference_consumer():
	while True:
	tokenized_batch, result_future = await inference_queue.get()
	try:
	# async with inference_semaphore:
	# Run inference on the GPU
	result = run_inference(tokenized_batch)

	result_future.set_result(result) # Set the result for the future
	except Exception as e:
	result_future.set_exception(e)
	finally:
	inference_queue.task_done()


	# Consumer for tokenization
	async def tokenization_consumer():
	while True:
	texts, result_future = await tokenization_queue.get()
	try:
	# async with tokenization_semaphore:
	# Tokenize the batch asynchronously (CPU task)
	tokenized_batch = await tokenize_batch(texts)

	# Once tokenized, queue for inference (GPU task)
	await inference_queue.put((tokenized_batch, result_future))
	except Exception as e:
	result_future.set_exception(e)
	finally:
	tokenization_queue.task_done()


	# Background tasks for tokenization and inference consumers
	# Define semaphores for tokenization and inference
	# tokenization_semaphore = asyncio.Semaphore(10) # Limit to 5 concurrent tokenizations
	# inference_semaphore = asyncio.Semaphore(5) # Limit to 5 concurrent inferences


	@app.on_event("startup")
	async def startup_event():
	asyncio.create_task(tokenization_consumer())
	asyncio.create_task(inference_consumer())


	@app.post("/label")
	async def label_text(data: BatchInputData):
	if data.is_pre_tokenized:
	# Validate pre-tokenized inputs
	if not all([data.input_ids, data.attention_mask, data.token_type_ids]):
	raise HTTPException(
	status_code=400,
	detail="When is_pre_tokenized is True, input_ids, attention_mask, and token_type_ids are required.",
	)

	# Ensure batch sizes are consistent
	batch_size = len(data.input_ids)
	if any(
	len(lst) != batch_size for lst in [data.attention_mask, data.token_type_ids]
	):
	raise HTTPException(
	status_code=400,
	detail="All pre-tokenized inputs (input_ids, attention_mask, token_type_ids) must have the same batch size.",
	)

	# Package the pre-tokenized inputs for inference
	tokenized_batch = {
	"input_ids": np.array(data.input_ids, dtype=np.int64),
	"attention_mask": np.array(data.attention_mask, dtype=np.int64),
	"token_type_ids": np.array(data.token_type_ids, dtype=np.int64),
	}

	# Create a future for inference
	result_future = asyncio.get_event_loop().create_future()

	# Directly add the pre-tokenized data to the inference queue
	await inference_queue.put((tokenized_batch, result_future))

	else:
	# Validate and process texts for tokenization
	if not data.texts:
	raise HTTPException(
	status_code=400,
	detail="Texts field is required when is_pre_tokenized is False.",
	)

	if len(data.texts) > MAX_BATCH_SIZE:
	raise HTTPException(
	status_code=400, detail=f"Batch size is too large (> {MAX_BATCH_SIZE})"
	)

	# Create a future for tokenization and inference
	result_future = asyncio.get_event_loop().create_future()

	# Add the texts to the tokenization queue
	await tokenization_queue.put((data.texts, result_future))

	# Wait for the future result to be set (after tokenization and/or inference completes)
	predictions = await result_future

	# Process the results into the desired format
	results = [
	[LabelScore(label=label, score=score) for label, score in zip(label_list, pred)]
	for pred in predictions
	]
	# Sort the results by score
	results = [
	sorted(result, key=lambda x: x.score, reverse=True) for result in results
	]

	return {"results": results}


	@app.get("/health")
	def health():
	# check if current SLURM job is ending soon
	slurm_job_end_time = os.getenv("SLURM_JOB_END_TIME", None)
	if slurm_job_end_time is not None:
	slurm_job_end_time = int(slurm_job_end_time)
	if slurm_job_end_time - time.time() < 300:
	return {"status": "ending"}

	return {"status": "ok"}


	@app.get("/get_job_info")
	def get_job_info():
	job_info = {}
	for key in os.environ:
	if key.startswith("SLURM_"):
	job_info[key] = os.getenv(key)
	return job_info


	# run with
	if __name__ == "__main__":
	uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
	```

	Dockerfile for inference server:

	```Dockerfile
	FROM rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1

	ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime
	ARG ONNXRUNTIME_BRANCH=v1.17.3

	ENV PATH /code/cmake-3.27.3-linux-x86_64/bin:${PATH}

	RUN apt-get update &&\
	apt-get install -y migraphx

	WORKDIR /install_dir

	# Prepare onnxruntime repository & build onnxruntime
	RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
	/bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\
	cd onnxruntime && pip install --upgrade pip &&\
	/bin/sh ./build.sh --allow_running_as_root --cmake_extra_defines ONNXRUNTIME_VERSION=`cat ./VERSION_NUMBER` --config Release --parallel \
	--skip_tests --build_wheel --use_rocm --rocm_version=${ROCM_VERSION} --rocm_home /opt/rocm --use_migraphx && \
	pip install /install_dir/onnxruntime/build/Linux/Release/dist/*.whl

	RUN pip install --upgrade --upgrade-strategy eager optimum[amd]==1.22.0 fastapi[standard]

	WORKDIR /workspace
	```
	</details>

	## Limitations

	- Sequence length: Documents are truncated to 512 tokens; quality assessment is based on the beginning of documents only
	- Language scope: Optimized for French and English; performance on other languages not evaluated
	- Subjectivity: Quality labels are synthetic, generated by an LLM, which may introduce biases from the teacher model


	## Related Models

	- [Gaperon-1125-1.5B-SFT](https://huggingface.co/almanach/Gaperon-1125-1.5B-SFT) - 1.5B parameter bilingual LM
	- [Gaperon-1125-8B-SFT](https://huggingface.co/almanach/Gaperon-1125-8B-SFT) - 8B parameter bilingual LM
	- [Gaperon-1125-24B-SFT](https://huggingface.co/almanach/Gaperon-1125-24B-SFT) - 24B parameter bilingual LM

	## Model Card Authors

	ALMAnaCH team, Inria Paris

	## Additional Resources

	- 🔗 GitHub: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron)
	- 📄 Paper: [📄 Paper Link](https://arxiv.org/abs/2510.25771)
	- 🔧 Evaluation Tools: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon)

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{godey2025gaperonpepperedenglishfrenchgenerative,
	title={Gaperon: A Peppered English-French Generative Language Model Suite},
	author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
	year={2025},
	eprint={2510.25771},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2510.25771},
	}
	```

	## Acknowledgments

	This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.