kihansii's picture
update model card
e0e5973 verified
metadata
repo_id: Ennov/pv_ae_document_duplication_embed
language: en
license: cc-by-nc-4.0
tags:
  - pharmacovigilance
  - duplicate-detection
  - sentence-similarity
  - hybrid-embedding
pipeline_tag: sentence-similarity

PV Duplication Embedder — v1

Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).

Architecture

Component Details
Text encoder joe32140/ModernBERT-large-msmarco (frozen, 1024-dim)
Metadata encoder Fitted sklearn pipeline → 512-dim dense vector
Fusion MultiGatedFusionEncoder — per-dimension gated fusion, trained with triplet loss → 1024-dim output

Artifacts in this repo

File Description
config.json Architecture config + text model reference
metadata_pipeline.pkl Fitted sklearn pipeline (cloudpickle)
fusion_encoder.safetensors Fusion encoder weights (inference-only, no training state)
pv_duplication_embedder.py Model entry point (load with trust_remote_code=True)
model/aggregator/pv_duplication/inference.py Pure nn.Module fusion encoder (no Lightning dependency)

Usage

Each input record must be a dict with exactly two keys:

Key Type Description
text str Free-text adverse-event narrative
metadata dict Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product)
from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder

model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")

records = [
    {
        "text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
        "metadata": {
            "recd_date": "2023-06-01",
            "pt_name": ["Vomiting"],
            "reporter_role": ["Veterinarian"],
            "prod_code": ["PROD123"],
            # ... other metadata fields expected by the sklearn pipeline
        }
    }
]

# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)

Loading directly from the Hub

from huggingface_hub import snapshot_download
import sys

snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)

from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)

Authentication

Set your HuggingFace token before pushing:

# Option A — environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."

# Option B — interactive login (persists to ~/.huggingface/token)
huggingface-cli login

Then push:

python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"

Training

Trained with TripletMarginLoss on VMD (veterinary medicine data) adverse-event report pairs. See model/aggregator/pv_duplication/ for the Lightning training module (GatedFusionEncoderModule, MultiGatedFusionEncoderModule).