You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PV Duplication Embedder — v1

Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).

Architecture

Component	Details
Text encoder	`joe32140/ModernBERT-large-msmarco` (frozen, 1024-dim)
Metadata encoder	Fitted `sklearn` pipeline → 512-dim dense vector
Fusion	`MultiGatedFusionEncoder` — per-dimension gated fusion, trained with triplet loss → 1024-dim output

Artifacts in this repo

File	Description
`config.json`	Architecture config + text model reference
`metadata_pipeline.pkl`	Fitted sklearn pipeline (cloudpickle)
`fusion_encoder.safetensors`	Fusion encoder weights (inference-only, no training state)
`pv_duplication_embedder.py`	Model entry point (load with `trust_remote_code=True`)
`model/aggregator/pv_duplication/inference.py`	Pure `nn.Module` fusion encoder (no Lightning dependency)

Usage

Each input record must be a dict with exactly two keys:

Key	Type	Description
`text`	`str`	Free-text adverse-event narrative
`metadata`	`dict`	Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product)

from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder

model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")

records = [
    {
        "text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
        "metadata": {
            "recd_date": "2023-06-01",
            "pt_name": ["Vomiting"],
            "reporter_role": ["Veterinarian"],
            "prod_code": ["PROD123"],
            # ... other metadata fields expected by the sklearn pipeline
        }
    }
]

# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)

Loading directly from the Hub

from huggingface_hub import snapshot_download
import sys

snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)

from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)

Authentication

Set your HuggingFace token before pushing:

# Option A — environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."

# Option B — interactive login (persists to ~/.huggingface/token)
huggingface-cli login

Then push:

python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"

Training

Trained with TripletMarginLoss on VMD (veterinary medicine data) adverse-event report pairs. See model/aggregator/pv_duplication/ for the Lightning training module (GatedFusionEncoderModule, MultiGatedFusionEncoderModule).

Downloads last month: 1

Safetensors

Model size

4.2M params

Tensor type

F32