CamemBERT Finetuned Progressive
CamemBERT finetuned on 22,000 French parliamentary Q&A pairs for improved embedding in Retrieval-Augmented Generation (RAG) systems.
🤖 Model Type: Feature extraction & embedding 🇫🇷 Language: French 📜 Training Data: 22,000 Q&A pairs from the French Parliament (2017-2025) ⚖️ Use Case: Legal/regulatory analysis, parliamentary response drafting, semantic search in administrative documents
Model Details
Description
This model is a fine-tuned version of camembert-base optimized for embedding French parliamentary questions and answers. It is designed to improve the performance of RAG systems in legal and administrative contexts, such as:
- Automated drafting of parliamentary responses.
- Retrieval of relevant legal articles or budgetary information.
- Semantic search in administrative documents.
Training Data
The model was fine-tuned on a dataset of 22,000 Q&A pairs from the French Parliament (Assemblée Nationale and Sénat). The dataset includes questions from MPs and responses from the Government, covering topics such as:
- Legal and regulatory issues.
- Budgetary and financial matters.
- Public policies and administrative procedures.
Training Procedure
- Base Model:
camembert-base - Fine-tuning Task: Masked Language Modeling (MLM)
- Context Preservation: Special attention was given to maintaining the legal and administrative context (e.g., references to laws, decrees, and ministries).
- Chunking: Texts were split into chunks with overlapping context to preserve semantic coherence.
Uses
Direct Use
This model generates 768-dimensional embeddings that can be used in:
- RAG pipelines (e.g., with Qdrant, Weaviate, or Elasticsearch).
- Semantic search in legal or administrative documents.
- Automated drafting tools for parliamentary responses.
Downstream Use
- Public Administration: Assisting civil servants in drafting responses to parliamentary questions.
- Legal Tech: Enhancing search and analysis in legal databases.
- Research: Studying French legislative language and discourse.
Out-of-Scope Use
- Non-French Texts: The model is specialized for French and may underperform on other languages.
- General-Purpose Embeddings: While it can be used for general French text, it is optimized for legal/administrative language.
Performance
- 15-20% improvement in retrieval relevance compared to
camembert-baseon a validation set of 2,000 parliamentary questions. - Robustness: Handles complex formulations, such as references to multiple laws or budgetary lines.
| Metric | Score (vs camembert-base) |
|---|---|
| Retrieval Precision | +18% |
| Semantic Similarity | +15% |
How to Use
Installation
pip install transformers torch
- Downloads last month
- 98