You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

PV Duplication Embedder β€” v1

Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).

Architecture

Component Details
Text encoder joe32140/ModernBERT-large-msmarco (frozen, 1024-dim)
Metadata encoder Fitted sklearn pipeline β†’ 512-dim dense vector
Fusion MultiGatedFusionEncoder β€” per-dimension gated fusion, trained with triplet loss β†’ 1024-dim output

Artifacts in this repo

File Description
config.json Architecture config + text model reference
metadata_pipeline.pkl Fitted sklearn pipeline (cloudpickle)
fusion_encoder.safetensors Fusion encoder weights (inference-only, no training state)
pv_duplication_embedder.py Model entry point (load with trust_remote_code=True)
model/aggregator/pv_duplication/inference.py Pure nn.Module fusion encoder (no Lightning dependency)

Usage

Each input record must be a dict with exactly two keys:

Key Type Description
text str Free-text adverse-event narrative
metadata dict Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product)
from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder

model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")

records = [
    {
        "text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
        "metadata": {
            "recd_date": "2023-06-01",
            "pt_name": ["Vomiting"],
            "reporter_role": ["Veterinarian"],
            "prod_code": ["PROD123"],
            # ... other metadata fields expected by the sklearn pipeline
        }
    }
]

# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)

Loading directly from the Hub

from huggingface_hub import snapshot_download
import sys

snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)

from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)

Authentication

Set your HuggingFace token before pushing:

# Option A β€” environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."

# Option B β€” interactive login (persists to ~/.huggingface/token)
huggingface-cli login

Then push:

python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"

Training

Trained with TripletMarginLoss on VMD (veterinary medicine data) adverse-event report pairs. See model/aggregator/pv_duplication/ for the Lightning training module (GatedFusionEncoderModule, MultiGatedFusionEncoderModule).

Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support