kihansii's picture
update model card
e0e5973 verified
---
# ── Required ──────────────────────────────────────────────────────────────────
# HuggingFace Hub repo: must match the --repo argument passed to packaging.py push
# Format: <organization>/<model-name>
repo_id: "Ennov/pv_ae_document_duplication_embed"
language: en
license: "cc-by-nc-4.0" # Attribution-NonCommercial 4.0 International 4.0
# ── Discovery tags ─────────────────────────────────────────────────────────────
tags:
- pharmacovigilance
- duplicate-detection
- sentence-similarity
- hybrid-embedding
pipeline_tag: sentence-similarity
# ── Optional: link to evaluation results ──────────────────────────────────────
# model-index:
# - name: pv_duplication_embedder
# results: []
---
# PV Duplication Embedder β€” v1
Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding
**free-text narratives** and **structured metadata** (dates, MedDRA/VeDDRA terms, reporter info, product).
## Architecture
| Component | Details |
|-----------|---------|
| Text encoder | `joe32140/ModernBERT-large-msmarco` (frozen, 1024-dim) |
| Metadata encoder | Fitted `sklearn` pipeline β†’ 512-dim dense vector |
| Fusion | `MultiGatedFusionEncoder` β€” per-dimension gated fusion, trained with triplet loss β†’ 1024-dim output |
## Artifacts in this repo
| File | Description |
|------|-------------|
| `config.json` | Architecture config + text model reference |
| `metadata_pipeline.pkl` | Fitted sklearn pipeline (cloudpickle) |
| `fusion_encoder.safetensors` | Fusion encoder weights (inference-only, no training state) |
| `pv_duplication_embedder.py` | Model entry point (load with `trust_remote_code=True`) |
| `model/aggregator/pv_duplication/inference.py` | Pure `nn.Module` fusion encoder (no Lightning dependency) |
## Usage
Each input record must be a dict with exactly two keys:
| Key | Type | Description |
|-----|------|-------------|
| `text` | `str` | Free-text adverse-event narrative |
| `metadata` | `dict` | Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product) |
```python
from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")
records = [
{
"text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
"metadata": {
"recd_date": "2023-06-01",
"pt_name": ["Vomiting"],
"reporter_role": ["Veterinarian"],
"prod_code": ["PROD123"],
# ... other metadata fields expected by the sklearn pipeline
}
}
]
# Returns np.ndarray of shape (N, 1024), float32
embeddings = model.encode(records, batch_size=32)
```
### Loading directly from the Hub
```python
from huggingface_hub import snapshot_download
import sys
snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
sys.path.insert(0, snapshot_dir)
from pv_duplication_embedder import PVDuplicateEmbedder
model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)
```
## Authentication
Set your HuggingFace token before pushing:
```bash
# Option A β€” environment variable (CI/CD friendly)
export HF_TOKEN="hf_..."
# Option B β€” interactive login (persists to ~/.huggingface/token)
huggingface-cli login
```
Then push:
```bash
python -m model.wrapper.pv_duplication.v1.packaging save
python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"
```
## Training
Trained with `TripletMarginLoss` on VMD (veterinary medicine data) adverse-event report pairs.
See `model/aggregator/pv_duplication/` for the Lightning training module (`GatedFusionEncoderModule`, `MultiGatedFusionEncoderModule`).