PACT-ClinicalTrials-Pop-256

What patients like these have been enrolled in — across every intervention ever tested.

PACT (Population-Aware Clinical Trial embeddings) is the population-similarity model in the OntologerMed suite. It embeds any clinical trial into a 256-dimensional vector shaped by the patient population enrolled, trained on 563,845 ClinicalTrials.gov entries grouped by MeSH condition terms.

A bariatric surgery trial, a GLP-1 agonist trial, and a behavioural intervention trial all enrolled adults with obesity. They use completely different language, test completely different interventions, and share almost no words in common. PACT places them close together, because it understands they enrolled the same kind of patient.

Also known as OntologerMed PathFinder in the Commercializer.ai platform.

Model Overview

Property	Value
Model name	PACT-ClinicalTrials-Pop-256
Product name	OntologerMed PathFinder
Base model	NeuML/pubmedbert-base-embeddings (PubMedBERT)
Output dimension	256
Training method	Contrastive learning — Triplet Loss
Training corpus	563,845 ClinicalTrials.gov entries
Grouping signal	4,960 standardised MeSH condition terms
License	Apache 2.0
HuggingFace	`Ontologer/PACT-ClinicalTrials-Pop-256`

How It Works

PACT is trained to cluster trials by who they enrolled, not what treatment they tested.

Same population, different interventions → close vectors. A semaglutide obesity trial and a phentermine obesity trial use different drugs with different mechanisms — but they both enroll adults with BMI ≥ 30. PACT places them at 0.96 cosine similarity.
Same drug, different populations → far apart. A semaglutide obesity trial and a semaglutide diabetes trial share a drug, a manufacturer, and significant text overlap. But they enroll fundamentally different patients. PACT separates them.
Cross-drug, cross-modality clustering: A drug trial, a device trial, a surgical trial, and a behavioural trial can all sit close together if they enrolled patients with the same condition — something no keyword search can surface.

Architecture

Trial text → PubMedBERT (384-dim) → Dense projection (384 → 256) → L2 normalize → vector

Base: PubMedBERT — pre-trained on 30M+ PubMed abstracts
Projection: Single dense layer reducing 384 → 256 dimensions
Normalisation: L2, enabling cosine similarity via dot product
Training signal: Triplet loss — anchor and positive share a MeSH condition term; negative has a different condition term
Triplets: 10,000 generated from 563K trials across 4,960 distinct MeSH condition categories

Validation

Similarity matrix on five representative trials:

	GLP-1 obesity (sema)	GLP-1 obesity (lira)	PD-1 lung cancer	GLP-1 obesity (tirz)	KRAS solid tumours
GLP-1 obesity (sema)	1.00	0.93	—	0.96	—
GLP-1 obesity (lira)	0.93	1.00	—	0.92	—
PD-1 lung cancer	—	—	1.00	—	0.78
GLP-1 obesity (tirz)	0.96	0.92	—	1.00	—
KRAS solid tumours	—	—	0.78	—	1.00

The three obesity trials cluster tightly at 0.92–0.96 despite using different drugs with different mechanisms. Compare this to MOAt: where PACT clusters them together (same population), MOAt separates semaglutide and tirzepatide (different mechanism). The two oncology trials cluster at 0.78. Cross-group similarity is low. The model learned population, not drug.

Training Details

Parameter	Value
Epochs	10
Batch size	32
Learning rate	2e-5
Loss margin	0.5
Hardware	NVIDIA DGX Spark (Blackwell GB10)
Triplet grouping	MeSH condition term (`clintrials_condition_mesh_terms.term`)
Distinct MeSH terms	4,960 condition terms (each with ≥2 trials)

Business Use & Applications

PACT is purpose-built for population-level intelligence — mapping the full landscape of what has been tried in a patient cohort, regardless of drug class or modality.

Pharmaceutical & Biotech R&D

Therapeutic landscape mapping — see every approach ever tested in your target patient population
- Find every trial ever run in adults with moderate-to-severe RA — drugs, biologics, devices, behavioural — not just those that mention your drug class
- Identify which interventions have been tried and abandoned in your population, and which remain underexplored
- Surface standard-of-care precedents and comparator choices used across the full population history
Standard-of-care discovery — identify what comparators exist for your target indication
- Retrieve all trials in a population to map the approved and investigational comparator landscape
- Find which SOC definitions have been used in similar trials to inform your own comparator arm design
- Identify gaps where no strong comparator exists — a potential regulatory accelerated approval opportunity
Portfolio gap analysis — find underserved populations with few trials
- Embed a patient profile and retrieve the density of historical trials — populations with sparse coverage are underserved market opportunities
- Identify sub-populations (age groups, comorbidity combinations, prior treatment histories) that have been excluded from most trials

Clinical Trial Matching & Digital Health

Patient-to-trial matching at population level — find every trial a patient population has historically been enrolled in
- Given a patient profile, retrieve the full history of trials that enrolled similar patients — across all interventions and indications
- Identify active trials enrolling this population that the patient may currently qualify for
- Power eligibility screening tools with a population-similarity first pass before detailed criteria review
Trial search and discovery — move beyond keyword search for patients and clinicians
- Allow clinicians to search by patient type rather than drug name or keyword
- Surface trials in adjacent indications or with different interventions that the patient profile historically matches
- Enable patient-facing portals to present all relevant trials — not just those with obvious keyword overlap

Competitive Intelligence

What else is being tested in your target population — full modality coverage
- Identify competitor programmes across drug, device, surgical, and digital health interventions targeting the same patients
- Find early-phase trials in your target population before they enter the public competitive radar
- Monitor pipeline activity in your population in near-real-time as new trials register
Enrolment benchmarking — understand the historical enrolment landscape for your patient type
- Find all trials that enrolled similar patients to inform site selection, recruitment strategy, and timeline estimates
- Identify which enrolment criteria were commonly used for your population to inform your own protocol design
- Benchmark projected enrolment rates against historical trials in the same population

Investment & Due Diligence

Population-level pipeline assessment — understand how competitive a patient market is
- Retrieve all trials in a patient population to assess market crowding and differentiation opportunity
- Identify whether a company's target population has been heavily studied (commodity) or is underexplored (opportunity)
- Map the full competitive trial landscape in a population for a therapeutic area investment thesis

Academic & Evidence Synthesis

Systematic review support — retrieve trials by population for meta-analysis and evidence synthesis
- Find all trials enrolling a defined patient population as a starting point for systematic review inclusion screening
- Group trials by population similarity rather than keyword to capture conceptually related studies with different terminology
- Identify population definitions used across trials in a disease area to support meta-analysis harmonisation

Example Queries

PACT is used via embedding + nearest-neighbor lookup. Below are illustrative examples.

Example 1: Therapeutic Landscape Query

Query: "What has been tested in patients with obesity?"

Input text:

Phase 3 RCT of semaglutide 2.4mg SC weekly vs placebo in 1,961 adults with BMI ≥ 30. Primary endpoint: % body weight loss at 68 weeks.

Top PACT neighbors (sample across all modalities):

Trial	Intervention type	Cosine Sim
Liraglutide 3.0mg obesity RCT	GLP-1 agonist	0.96
Tirzepatide SURMOUNT-1	GIP/GLP-1 dual agonist	0.96
Bariatric surgery RCT (sleeve vs bypass)	Surgical	0.91
Orlistat weight management trial	Lipase inhibitor	0.89
Phentermine/topiramate obesity trial	CNS stimulant combination	0.88
Intensive lifestyle intervention RCT	Behavioural	0.85
Naltrexone/bupropion obesity trial	CNS combination	0.84

Insight: A single query surfaces the full therapeutic landscape — drugs, surgery, and behavioural approaches — that keyword search on a drug name would entirely miss.

Example 2: Sub-Population Gap Analysis

Query: "Have trials enrolled elderly patients with obesity and T2D?"

Input: Trial text specifying adults aged ≥70 with BMI ≥ 28 and Type 2 diabetes

PACT output: Low neighbor density. Few historical trials have enrolled this specific combination (elderly + obese + T2D). Most obesity trials cap at age 65; most T2D trials exclude BMI ≥ 45.

Insight: Sparse neighborhood = underserved population. A trial specifically targeting this demographic would face limited historical precedent but also limited competition.

Usage

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer("Ontologer/PACT-ClinicalTrials-Pop-256")

# Embed a trial (or a patient profile description)
trial_text = """
Phase 3 RCT enrolling adults aged 18-65 with BMI ≥ 30 and no prior
bariatric surgery. Excludes patients with T1D or active cardiovascular disease.
Primary endpoint: weight loss at 52 weeks.
"""
query_vector = model.encode([trial_text])  # shape: (1, 256)

# Find population-similar trials across all modalities
similarities = np.dot(query_vector, trial_index_vectors.T)
top_k = np.argsort(similarities[0])[::-1][:20]

for i in top_k:
    print(f"sim={similarities[0][i]:.3f} | {trial_metadata[i]['title']} | {trial_metadata[i]['intervention_type']}")

Index size: 256 dims × 4 bytes × 563K trials ≈ 550MB.

Part of the OntologerMed Suite

Model	Role
OntologerMed-ClinicalTrials-Instruct	Generative LM — reasoning, extraction, and summarisation over trial text
FATE-ClinicalTrials-Outcome-256 (TrialPulse)	Outcome-shaped embedding — similarity by historical success/failure pattern
MOAt-ClinicalTrials-MoA-256 (TargetLens)	Mechanism-of-action embedding — similarity by biological pathway
PACT-ClinicalTrials-Pop-256 (PathFinder)	Population embedding — similarity by patient demographics and disease
ORACLE-ClinicalTrials-SuccessProb-v1	Classifier — probability estimate combining all three embedding dimensions

ORACLE uses PACT embeddings as one of three inputs — alongside FATE and MOAt — to generate a combined probability-of-success score.

Limitations

Population, not treatment: Nearby vectors indicate that trials enrolled similar patients — not that the treatments are comparable or substitutable. Never use as a basis for treatment recommendation.
MeSH condition grouping: Trials with rare, compound, or non-standard conditions may not map cleanly to MeSH terms used during training. Validate top neighbors against source metadata.
English only: Trained on English-language trial records.
512-token limit: PubMedBERT truncates at 512 tokens. Place the most population-defining text (eligibility criteria, condition) early in the input.
Not medical advice: Population similarity does not imply that treatments tested in a similar cohort are safe or appropriate for any specific patient.

Citation

@misc{pact-clinicaltrials-2026,
  title        = {PACT-ClinicalTrials-Pop-256: Population-Similarity Embeddings for Clinical Trial Intelligence},
  author       = {Mishra, Sid},
  year         = {2026},
  note         = {Contrastive triplet embedding model trained on 563,845 ClinicalTrials.gov entries, grouped by MeSH condition terms.},
  howpublished = {\url{https://huggingface.co/Ontologer/PACT-ClinicalTrials-Pop-256}}
}

About the Author

Sid Mishra — Founder, Ontologer · Convixion AI

Sid is the founder of several AI-native and AI-powered startups and initiatives, based in Singapore. He founded Ontologer as the dedicated AI research arm of Convixion AI, with a focus on building domain-specific language models from the ground up — including data pipelines, training infrastructure, evaluation frameworks, and production deployment.

Ontologer generates novel LLM and embedding models purpose-built for use within Convixion AI's Commercializer.ai platform. PACT is part of the OntologerMed suite — a family of purpose-built models for clinical trial intelligence. Ontologer performs every step of model development — dataset curation, training infrastructure, evaluation, and production deployment — in-house.

Collaboration & Custom Work

Sid is open to collaborating on:

Custom domain-adapted embedding models — contrastive/triplet training on proprietary datasets for specialised retrieval tasks
End-to-end LLM and embedding pipelines — from data curation to training to production deployment
Evaluation framework design — task-specific benchmarks and retrieval evaluation pipelines
RAG + embedding system design — pairing domain-adapted models with retrieval systems for production use
Custom model architecture consulting — base model selection, training strategy, hardware planning


Site	ontologer.com
Email	sid@ontologer.com · sid@convixion.ai
LinkedIn	linkedin.com/in/sid-m-427b9865

Downloads last month: 5

Model tree for Ontologer/PACT-ClinicalTrials-Pop-256

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

NeuML/pubmedbert-base-embeddings

Finetuned

(21)

this model