Semantic Highlight EN-FR v1

Modele bilingue anglais-francais de Semantic Highlight pour les systemes RAG. Identifie les phrases pertinentes dans les documents recuperes afin de reduire le contexte transmis au LLM.

Architecture

Backbone : BAAI/bge-reranker-v2-m3 (XLM-RoBERTa, 568M params)
Multi-layer aggregation (couches 5, 11, 17, 23)
Sentinelles [SENT] aux frontieres de phrases
Tete unifiee Pruning + Reranking avec distillation BGE-M3
Segment embeddings (special/query/context)
Focal Loss (alpha=0.75, gamma=2.0)
Flash Attention 2 (avec fallback SDPA)

Datasets d'entrainement

CGCTG/semantic-highlight-en-annotations — anglais (source: MS MARCO)
CGCTG/semantic-highlight-fr-annotations — francais (source: FrenchQA + Qwen3-8B-FP8)

Format : Open Provence avec 3 splits (train, validation, test).

Utilisation

from inference import SemanticHighlighter

highlighter = SemanticHighlighter(
    model_path="CGCTG/semantic-highlight-en-fr-v1",
    threshold=0.5,
    device="auto",
)

result = highlighter.highlight(
    query="Quelles sont les causes du rechauffement climatique ?",
    passage="Le rechauffement climatique est cause par les emissions de CO2. "
            "La deforestation aggrave le probleme. "
            "Les temperatures moyennes ont augmente de 1.1 C.",
)

for sent in result.highlighted_sentences:
    print(f"  [{sent.score:.3f}] {sent.text}")
print(f"Compression : {result.compression_ratio:.1%}")

Entrainement

Entraine avec Accelerate sur GPU A100 80 GB. Loss unifiee : FocalBCE (pruning) + MSE (distillation reranking BGE-M3).

Parametre	Valeur
Batch effectif	32
Learning rate	2e-5
Epochs	3
Max sequence length	8192
Warmup ratio	5%

Downloads last month: -; Downloads are not tracked for this model. How to track

CGCTG
/

semantic-highlight-en-fr-v1

Semantic Highlight EN-FR v1

Architecture

Datasets d'entrainement

Utilisation

Entrainement

Datasets used to train CGCTG/semantic-highlight-en-fr-v1