update model card

e0e5973 verified 12 days ago

4.11 kB

repo_id: Ennov/pv_ae_document_duplication_embed
language: en
license: cc-by-nc-4.0
tags:
  - pharmacovigilance
  - duplicate-detection
  - sentence-similarity
  - hybrid-embedding
pipeline_tag: sentence-similarity

PV Duplication Embedder — v1

Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).

Architecture

Component	Details
Text encoder	`joe32140/ModernBERT-large-msmarco` (frozen, 1024-dim)
Metadata encoder	Fitted `sklearn` pipeline → 512-dim dense vector
Fusion	`MultiGatedFusionEncoder` — per-dimension gated fusion, trained with triplet loss → 1024-dim output

Artifacts in this repo

File	Description
`config.json`	Architecture config + text model reference
`metadata_pipeline.pkl`	Fitted sklearn pipeline (cloudpickle)
`fusion_encoder.safetensors`	Fusion encoder weights (inference-only, no training state)
`pv_duplication_embedder.py`	Model entry point (load with `trust_remote_code=True`)
`model/aggregator/pv_duplication/inference.py`	Pure `nn.Module` fusion encoder (no Lightning dependency)

Usage

Each input record must be a dict with exactly two keys:

Key	Type	Description
`text`	`str`	Free-text adverse-event narrative
`metadata`	`dict`	Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product)

from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder

model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")

records = [
    {
        "text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
        "metadata": {
            "recd_date": "2023-06-01",
            "pt_name": ["Vomiting"],
            "reporter_role": ["Veterinarian"],
            "prod_code": ["PROD123"],
            # ... other metadata fields expected by the sklearn pipeline
        }
    }
]

# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)

Loading directly from the Hub

from huggingface_hub import snapshot_download
import sys

snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)

from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)

Authentication

Set your HuggingFace token before pushing:

# Option A — environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."

# Option B — interactive login (persists to ~/.huggingface/token)
huggingface-cli login

Then push:

python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"

Training

Trained with TripletMarginLoss on VMD (veterinary medicine data) adverse-event report pairs. See model/aggregator/pv_duplication/ for the Lightning training module (GatedFusionEncoderModule, MultiGatedFusionEncoderModule).