CamemBERT Finetuned Progressive

CamemBERT finetuned on 22,000 French parliamentary Q&A pairs for improved embedding in Retrieval-Augmented Generation (RAG) systems.

🤖 Model Type: Feature extraction & embedding 🇫🇷 Language: French 📜 Training Data: 22,000 Q&A pairs from the French Parliament (2017-2025) ⚖️ Use Case: Legal/regulatory analysis, parliamentary response drafting, semantic search in administrative documents


Model Details

Description

This model is a fine-tuned version of camembert-base optimized for embedding French parliamentary questions and answers. It is designed to improve the performance of RAG systems in legal and administrative contexts, such as:

  • Automated drafting of parliamentary responses.
  • Retrieval of relevant legal articles or budgetary information.
  • Semantic search in administrative documents.

Training Data

The model was fine-tuned on a dataset of 22,000 Q&A pairs from the French Parliament (Assemblée Nationale and Sénat). The dataset includes questions from MPs and responses from the Government, covering topics such as:

  • Legal and regulatory issues.
  • Budgetary and financial matters.
  • Public policies and administrative procedures.

Training Procedure

  • Base Model: camembert-base
  • Fine-tuning Task: Masked Language Modeling (MLM)
  • Context Preservation: Special attention was given to maintaining the legal and administrative context (e.g., references to laws, decrees, and ministries).
  • Chunking: Texts were split into chunks with overlapping context to preserve semantic coherence.

Uses

Direct Use

This model generates 768-dimensional embeddings that can be used in:

  • RAG pipelines (e.g., with Qdrant, Weaviate, or Elasticsearch).
  • Semantic search in legal or administrative documents.
  • Automated drafting tools for parliamentary responses.

Downstream Use

  • Public Administration: Assisting civil servants in drafting responses to parliamentary questions.
  • Legal Tech: Enhancing search and analysis in legal databases.
  • Research: Studying French legislative language and discourse.

Out-of-Scope Use

  • Non-French Texts: The model is specialized for French and may underperform on other languages.
  • General-Purpose Embeddings: While it can be used for general French text, it is optimized for legal/administrative language.

Performance

  • 15-20% improvement in retrieval relevance compared to camembert-base on a validation set of 2,000 parliamentary questions.
  • Robustness: Handles complex formulations, such as references to multiple laws or budgetary lines.
Metric Score (vs camembert-base)
Retrieval Precision +18%
Semantic Similarity +15%

How to Use

Installation

pip install transformers torch
Downloads last month
98
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support