PV Duplication Embedder β v1
Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).
Architecture
| Component | Details |
|---|---|
| Text encoder | joe32140/ModernBERT-large-msmarco (frozen, 1024-dim) |
| Metadata encoder | Fitted sklearn pipeline β 512-dim dense vector |
| Fusion | MultiGatedFusionEncoder β per-dimension gated fusion, trained with triplet loss β 1024-dim output |
Artifacts in this repo
| File | Description |
|---|---|
config.json |
Architecture config + text model reference |
metadata_pipeline.pkl |
Fitted sklearn pipeline (cloudpickle) |
fusion_encoder.safetensors |
Fusion encoder weights (inference-only, no training state) |
pv_duplication_embedder.py |
Model entry point (load with trust_remote_code=True) |
model/aggregator/pv_duplication/inference.py |
Pure nn.Module fusion encoder (no Lightning dependency) |
Usage
Each input record must be a dict with exactly two keys:
| Key | Type | Description |
|---|---|---|
text |
str |
Free-text adverse-event narrative |
metadata |
dict |
Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product) |
from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")
records = [
{
"text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
"metadata": {
"recd_date": "2023-06-01",
"pt_name": ["Vomiting"],
"reporter_role": ["Veterinarian"],
"prod_code": ["PROD123"],
# ... other metadata fields expected by the sklearn pipeline
}
}
]
# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)
Loading directly from the Hub
from huggingface_hub import snapshot_download
import sys
snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)
from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)
Authentication
Set your HuggingFace token before pushing:
# Option A β environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."
# Option B β interactive login (persists to ~/.huggingface/token)
huggingface-cli login
Then push:
python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"
Training
Trained with TripletMarginLoss on VMD (veterinary medicine data) adverse-event report pairs.
See model/aggregator/pv_duplication/ for the Lightning training module (GatedFusionEncoderModule, MultiGatedFusionEncoderModule).
- Downloads last month
- 67