| --- |
| |
| |
| |
| repo_id: "Ennov/pv_ae_document_duplication_embed" |
|
|
| language: en |
| license: "cc-by-nc-4.0" |
|
|
| |
| tags: |
| - pharmacovigilance |
| - duplicate-detection |
| - sentence-similarity |
| - hybrid-embedding |
|
|
| pipeline_tag: sentence-similarity |
|
|
| |
| |
| |
| |
| --- |
| |
| # PV Duplication Embedder β v1 |
|
|
| Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding |
| **free-text narratives** and **structured metadata** (dates, MedDRA/VeDDRA terms, reporter info, product). |
|
|
| ## Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | Text encoder | `joe32140/ModernBERT-large-msmarco` (frozen, 1024-dim) | |
| | Metadata encoder | Fitted `sklearn` pipeline β 512-dim dense vector | |
| | Fusion | `MultiGatedFusionEncoder` β per-dimension gated fusion, trained with triplet loss β 1024-dim output | |
|
|
| ## Artifacts in this repo |
|
|
| | File | Description | |
| |------|-------------| |
| | `config.json` | Architecture config + text model reference | |
| | `metadata_pipeline.pkl` | Fitted sklearn pipeline (cloudpickle) | |
| | `fusion_encoder.safetensors` | Fusion encoder weights (inference-only, no training state) | |
| | `pv_duplication_embedder.py` | Model entry point (load with `trust_remote_code=True`) | |
| | `model/aggregator/pv_duplication/inference.py` | Pure `nn.Module` fusion encoder (no Lightning dependency) | |
|
|
| ## Usage |
|
|
| Each input record must be a dict with exactly two keys: |
|
|
| | Key | Type | Description | |
| |-----|------|-------------| |
| | `text` | `str` | Free-text adverse-event narrative | |
| | `metadata` | `dict` | Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product) | |
|
|
| ```python |
| from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder |
| |
| model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed") |
| |
| records = [ |
| { |
| "text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.", |
| "metadata": { |
| "recd_date": "2023-06-01", |
| "pt_name": ["Vomiting"], |
| "reporter_role": ["Veterinarian"], |
| "prod_code": ["PROD123"], |
| # ... other metadata fields expected by the sklearn pipeline |
| } |
| } |
| ] |
| |
| # Returns np.ndarray of shape (N, 1024), float32 |
| embeddings = model.encode(records, batch_size=32) |
| ``` |
|
|
| ### Loading directly from the Hub |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import sys |
| |
| snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed") |
| sys.path.insert(0, snapshot_dir) |
| |
| from pv_duplication_embedder import PVDuplicateEmbedder |
| model = PVDuplicateEmbedder.from_pretrained(snapshot_dir) |
| ``` |
|
|
| ## Authentication |
|
|
| Set your HuggingFace token before pushing: |
|
|
| ```bash |
| # Option A β environment variable (CI/CD friendly) |
| export HF_TOKEN="hf_..." |
| |
| # Option B β interactive login (persists to ~/.huggingface/token) |
| huggingface-cli login |
| ``` |
|
|
| Then push: |
|
|
| ```bash |
| python -m model.wrapper.pv_duplication.v1.packaging save |
| python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1" |
| ``` |
|
|
| ## Training |
|
|
| Trained with `TripletMarginLoss` on VMD (veterinary medicine data) adverse-event report pairs. |
| See `model/aggregator/pv_duplication/` for the Lightning training module (`GatedFusionEncoderModule`, `MultiGatedFusionEncoderModule`). |