CamemBERT Finetuned Progressive

CamemBERT finetuned on 22,000 French parliamentary Q&A pairs for improved embedding in Retrieval-Augmented Generation (RAG) systems.

🤖 Model Type: Feature extraction & embedding 🇫🇷 Language: French 📜 Training Data: 22,000 Q&A pairs from the French Parliament (2017-2025) ⚖️ Use Case: Legal/regulatory analysis, parliamentary response drafting, semantic search in administrative documents

Model Details

Description

This model is a fine-tuned version of camembert-base optimized for embedding French parliamentary questions and answers. It is designed to improve the performance of RAG systems in legal and administrative contexts, such as:

Automated drafting of parliamentary responses.
Retrieval of relevant legal articles or budgetary information.
Semantic search in administrative documents.

Training Data

The model was fine-tuned on a dataset of 22,000 Q&A pairs from the French Parliament (Assemblée Nationale and Sénat). The dataset includes questions from MPs and responses from the Government, covering topics such as:

Legal and regulatory issues.
Budgetary and financial matters.
Public policies and administrative procedures.

Training Procedure

Base Model: camembert-base
Fine-tuning Task: Masked Language Modeling (MLM)
Context Preservation: Special attention was given to maintaining the legal and administrative context (e.g., references to laws, decrees, and ministries).
Chunking: Texts were split into chunks with overlapping context to preserve semantic coherence.

Uses

Direct Use

This model generates 768-dimensional embeddings that can be used in:

RAG pipelines (e.g., with Qdrant, Weaviate, or Elasticsearch).
Semantic search in legal or administrative documents.
Automated drafting tools for parliamentary responses.

Downstream Use

Public Administration: Assisting civil servants in drafting responses to parliamentary questions.
Legal Tech: Enhancing search and analysis in legal databases.
Research: Studying French legislative language and discourse.

Out-of-Scope Use

Non-French Texts: The model is specialized for French and may underperform on other languages.
General-Purpose Embeddings: While it can be used for general French text, it is optimized for legal/administrative language.

Performance

15-20% improvement in retrieval relevance compared to camembert-base on a validation set of 2,000 parliamentary questions.
Robustness: Handles complex formulations, such as references to multiple laws or budgetary lines.

Metric	Score (vs `camembert-base`)
Retrieval Precision	+18%
Semantic Similarity	+15%

How to Use

Installation

pip install transformers torch

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32