update model card

e0e5973 verified 13 days ago

4.11 kB

	---
	# ── Required ──────────────────────────────────────────────────────────────────
	# HuggingFace Hub repo: must match the --repo argument passed to packaging.py push
	# Format: <organization>/<model-name>
	repo_id: "Ennov/pv_ae_document_duplication_embed"

	language: en
	license: "cc-by-nc-4.0" # Attribution-NonCommercial 4.0 International 4.0

	# ── Discovery tags ─────────────────────────────────────────────────────────────
	tags:
	- pharmacovigilance
	- duplicate-detection
	- sentence-similarity
	- hybrid-embedding

	pipeline_tag: sentence-similarity

	# ── Optional: link to evaluation results ──────────────────────────────────────
	# model-index:
	# - name: pv_duplication_embedder
	# results: []
	---

	# PV Duplication Embedder — v1

	Hybrid embedding model that detects duplicate pharmacovigilance adverse-event reports by jointly encoding
	free-text narratives and structured metadata (dates, MedDRA/VeDDRA terms, reporter info, product).

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Text encoder \| `joe32140/ModernBERT-large-msmarco` (frozen, 1024-dim) \|
	\| Metadata encoder \| Fitted `sklearn` pipeline → 512-dim dense vector \|
	\| Fusion \| `MultiGatedFusionEncoder` — per-dimension gated fusion, trained with triplet loss → 1024-dim output \|

	## Artifacts in this repo

	\| File \| Description \|
	\|------\|-------------\|
	\| `config.json` \| Architecture config + text model reference \|
	\| `metadata_pipeline.pkl` \| Fitted sklearn pipeline (cloudpickle) \|
	\| `fusion_encoder.safetensors` \| Fusion encoder weights (inference-only, no training state) \|
	\| `pv_duplication_embedder.py` \| Model entry point (load with `trust_remote_code=True`) \|
	\| `model/aggregator/pv_duplication/inference.py` \| Pure `nn.Module` fusion encoder (no Lightning dependency) \|

	## Usage

	Each input record must be a dict with exactly two keys:

	\| Key \| Type \| Description \|
	\|-----\|------\|-------------\|
	\| `text` \| `str` \| Free-text adverse-event narrative \|
	\| `metadata` \| `dict` \| Raw metadata fields consumed by the sklearn pipeline (dates, MedDRA/VeDDRA terms, reporter info, product) \|

	```python
	from model.wrapper.pv_duplication.v1.pv_duplication_embedder import PVDuplicateEmbedder

	model = PVDuplicateEmbedder.from_pretrained("Ennov/pv_ae_document_duplication_embed")

	records = [
	{
	"text": "A 3-year-old Labrador received product X. Vomiting observed after 2 hours.",
	"metadata": {
	"recd_date": "2023-06-01",
	"pt_name": ["Vomiting"],
	"reporter_role": ["Veterinarian"],
	"prod_code": ["PROD123"],
	# ... other metadata fields expected by the sklearn pipeline
	}
	}
	]

	# Returns np.ndarray of shape (N, 1024), float32
	embeddings = model.encode(records, batch_size=32)
	```

	### Loading directly from the Hub

	```python
	from huggingface_hub import snapshot_download
	import sys

	snapshot_dir = snapshot_download("Ennov/pv_ae_document_duplication_embed")
	sys.path.insert(0, snapshot_dir)

	from pv_duplication_embedder import PVDuplicateEmbedder
	model = PVDuplicateEmbedder.from_pretrained(snapshot_dir)
	```

	## Authentication

	Set your HuggingFace token before pushing:

	```bash
	# Option A — environment variable (CI/CD friendly)
	export HF_TOKEN="hf_..."

	# Option B — interactive login (persists to ~/.huggingface/token)
	huggingface-cli login
	```

	Then push:

	```bash
	python -m model.wrapper.pv_duplication.v1.packaging save
	python -m model.wrapper.pv_duplication.v1.packaging push --message "Release v1"
	```

	## Training

	Trained with `TripletMarginLoss` on VMD (veterinary medicine data) adverse-event report pairs.
	See `model/aggregator/pv_duplication/` for the Lightning training module (`GatedFusionEncoderModule`, `MultiGatedFusionEncoderModule`).