PACT-ClinicalTrials-Pop-256
What patients like these have been enrolled in β across every intervention ever tested.
PACT (Population-Aware Clinical Trial embeddings) is the population-similarity model in the OntologerMed suite. It embeds any clinical trial into a 256-dimensional vector shaped by the patient population enrolled, trained on 563,845 ClinicalTrials.gov entries grouped by MeSH condition terms.
A bariatric surgery trial, a GLP-1 agonist trial, and a behavioural intervention trial all enrolled adults with obesity. They use completely different language, test completely different interventions, and share almost no words in common. PACT places them close together, because it understands they enrolled the same kind of patient.
Also known as OntologerMed PathFinder in the Commercializer.ai platform.
Model Overview
| Property | Value |
|---|---|
| Model name | PACT-ClinicalTrials-Pop-256 |
| Product name | OntologerMed PathFinder |
| Base model | NeuML/pubmedbert-base-embeddings (PubMedBERT) |
| Output dimension | 256 |
| Training method | Contrastive learning β Triplet Loss |
| Training corpus | 563,845 ClinicalTrials.gov entries |
| Grouping signal | 4,960 standardised MeSH condition terms |
| License | Apache 2.0 |
| HuggingFace | Ontologer/PACT-ClinicalTrials-Pop-256 |
How It Works
PACT is trained to cluster trials by who they enrolled, not what treatment they tested.
- Same population, different interventions β close vectors. A semaglutide obesity trial and a phentermine obesity trial use different drugs with different mechanisms β but they both enroll adults with BMI β₯ 30. PACT places them at 0.96 cosine similarity.
- Same drug, different populations β far apart. A semaglutide obesity trial and a semaglutide diabetes trial share a drug, a manufacturer, and significant text overlap. But they enroll fundamentally different patients. PACT separates them.
- Cross-drug, cross-modality clustering: A drug trial, a device trial, a surgical trial, and a behavioural trial can all sit close together if they enrolled patients with the same condition β something no keyword search can surface.
Architecture
Trial text β PubMedBERT (384-dim) β Dense projection (384 β 256) β L2 normalize β vector
- Base: PubMedBERT β pre-trained on 30M+ PubMed abstracts
- Projection: Single dense layer reducing 384 β 256 dimensions
- Normalisation: L2, enabling cosine similarity via dot product
- Training signal: Triplet loss β anchor and positive share a MeSH condition term; negative has a different condition term
- Triplets: 10,000 generated from 563K trials across 4,960 distinct MeSH condition categories
Validation
Similarity matrix on five representative trials:
| GLP-1 obesity (sema) | GLP-1 obesity (lira) | PD-1 lung cancer | GLP-1 obesity (tirz) | KRAS solid tumours | |
|---|---|---|---|---|---|
| GLP-1 obesity (sema) | 1.00 | 0.93 | β | 0.96 | β |
| GLP-1 obesity (lira) | 0.93 | 1.00 | β | 0.92 | β |
| PD-1 lung cancer | β | β | 1.00 | β | 0.78 |
| GLP-1 obesity (tirz) | 0.96 | 0.92 | β | 1.00 | β |
| KRAS solid tumours | β | β | 0.78 | β | 1.00 |
The three obesity trials cluster tightly at 0.92β0.96 despite using different drugs with different mechanisms. Compare this to MOAt: where PACT clusters them together (same population), MOAt separates semaglutide and tirzepatide (different mechanism). The two oncology trials cluster at 0.78. Cross-group similarity is low. The model learned population, not drug.
Training Details
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Loss margin | 0.5 |
| Hardware | NVIDIA DGX Spark (Blackwell GB10) |
| Triplet grouping | MeSH condition term (clintrials_condition_mesh_terms.term) |
| Distinct MeSH terms | 4,960 condition terms (each with β₯2 trials) |
Business Use & Applications
PACT is purpose-built for population-level intelligence β mapping the full landscape of what has been tried in a patient cohort, regardless of drug class or modality.
Pharmaceutical & Biotech R&D
Therapeutic landscape mapping β see every approach ever tested in your target patient population
- Find every trial ever run in adults with moderate-to-severe RA β drugs, biologics, devices, behavioural β not just those that mention your drug class
- Identify which interventions have been tried and abandoned in your population, and which remain underexplored
- Surface standard-of-care precedents and comparator choices used across the full population history
Standard-of-care discovery β identify what comparators exist for your target indication
- Retrieve all trials in a population to map the approved and investigational comparator landscape
- Find which SOC definitions have been used in similar trials to inform your own comparator arm design
- Identify gaps where no strong comparator exists β a potential regulatory accelerated approval opportunity
Portfolio gap analysis β find underserved populations with few trials
- Embed a patient profile and retrieve the density of historical trials β populations with sparse coverage are underserved market opportunities
- Identify sub-populations (age groups, comorbidity combinations, prior treatment histories) that have been excluded from most trials
Clinical Trial Matching & Digital Health
Patient-to-trial matching at population level β find every trial a patient population has historically been enrolled in
- Given a patient profile, retrieve the full history of trials that enrolled similar patients β across all interventions and indications
- Identify active trials enrolling this population that the patient may currently qualify for
- Power eligibility screening tools with a population-similarity first pass before detailed criteria review
Trial search and discovery β move beyond keyword search for patients and clinicians
- Allow clinicians to search by patient type rather than drug name or keyword
- Surface trials in adjacent indications or with different interventions that the patient profile historically matches
- Enable patient-facing portals to present all relevant trials β not just those with obvious keyword overlap
Competitive Intelligence
What else is being tested in your target population β full modality coverage
- Identify competitor programmes across drug, device, surgical, and digital health interventions targeting the same patients
- Find early-phase trials in your target population before they enter the public competitive radar
- Monitor pipeline activity in your population in near-real-time as new trials register
Enrolment benchmarking β understand the historical enrolment landscape for your patient type
- Find all trials that enrolled similar patients to inform site selection, recruitment strategy, and timeline estimates
- Identify which enrolment criteria were commonly used for your population to inform your own protocol design
- Benchmark projected enrolment rates against historical trials in the same population
Investment & Due Diligence
- Population-level pipeline assessment β understand how competitive a patient market is
- Retrieve all trials in a patient population to assess market crowding and differentiation opportunity
- Identify whether a company's target population has been heavily studied (commodity) or is underexplored (opportunity)
- Map the full competitive trial landscape in a population for a therapeutic area investment thesis
Academic & Evidence Synthesis
- Systematic review support β retrieve trials by population for meta-analysis and evidence synthesis
- Find all trials enrolling a defined patient population as a starting point for systematic review inclusion screening
- Group trials by population similarity rather than keyword to capture conceptually related studies with different terminology
- Identify population definitions used across trials in a disease area to support meta-analysis harmonisation
Example Queries
PACT is used via embedding + nearest-neighbor lookup. Below are illustrative examples.
Example 1: Therapeutic Landscape Query
Query: "What has been tested in patients with obesity?"
Input text:
Phase 3 RCT of semaglutide 2.4mg SC weekly vs placebo in 1,961 adults with BMI β₯ 30. Primary endpoint: % body weight loss at 68 weeks.
Top PACT neighbors (sample across all modalities):
| Trial | Intervention type | Cosine Sim |
|---|---|---|
| Liraglutide 3.0mg obesity RCT | GLP-1 agonist | 0.96 |
| Tirzepatide SURMOUNT-1 | GIP/GLP-1 dual agonist | 0.96 |
| Bariatric surgery RCT (sleeve vs bypass) | Surgical | 0.91 |
| Orlistat weight management trial | Lipase inhibitor | 0.89 |
| Phentermine/topiramate obesity trial | CNS stimulant combination | 0.88 |
| Intensive lifestyle intervention RCT | Behavioural | 0.85 |
| Naltrexone/bupropion obesity trial | CNS combination | 0.84 |
Insight: A single query surfaces the full therapeutic landscape β drugs, surgery, and behavioural approaches β that keyword search on a drug name would entirely miss.
Example 2: Sub-Population Gap Analysis
Query: "Have trials enrolled elderly patients with obesity and T2D?"
Input: Trial text specifying adults aged β₯70 with BMI β₯ 28 and Type 2 diabetes
PACT output: Low neighbor density. Few historical trials have enrolled this specific combination (elderly + obese + T2D). Most obesity trials cap at age 65; most T2D trials exclude BMI β₯ 45.
Insight: Sparse neighborhood = underserved population. A trial specifically targeting this demographic would face limited historical precedent but also limited competition.
Usage
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model
model = SentenceTransformer("Ontologer/PACT-ClinicalTrials-Pop-256")
# Embed a trial (or a patient profile description)
trial_text = """
Phase 3 RCT enrolling adults aged 18-65 with BMI β₯ 30 and no prior
bariatric surgery. Excludes patients with T1D or active cardiovascular disease.
Primary endpoint: weight loss at 52 weeks.
"""
query_vector = model.encode([trial_text]) # shape: (1, 256)
# Find population-similar trials across all modalities
similarities = np.dot(query_vector, trial_index_vectors.T)
top_k = np.argsort(similarities[0])[::-1][:20]
for i in top_k:
print(f"sim={similarities[0][i]:.3f} | {trial_metadata[i]['title']} | {trial_metadata[i]['intervention_type']}")
Index size: 256 dims Γ 4 bytes Γ 563K trials β 550MB.
Part of the OntologerMed Suite
| Model | Role |
|---|---|
| OntologerMed-ClinicalTrials-Instruct | Generative LM β reasoning, extraction, and summarisation over trial text |
| FATE-ClinicalTrials-Outcome-256 (TrialPulse) | Outcome-shaped embedding β similarity by historical success/failure pattern |
| MOAt-ClinicalTrials-MoA-256 (TargetLens) | Mechanism-of-action embedding β similarity by biological pathway |
| PACT-ClinicalTrials-Pop-256 (PathFinder) | Population embedding β similarity by patient demographics and disease |
| ORACLE-ClinicalTrials-SuccessProb-v1 | Classifier β probability estimate combining all three embedding dimensions |
ORACLE uses PACT embeddings as one of three inputs β alongside FATE and MOAt β to generate a combined probability-of-success score.
Limitations
- Population, not treatment: Nearby vectors indicate that trials enrolled similar patients β not that the treatments are comparable or substitutable. Never use as a basis for treatment recommendation.
- MeSH condition grouping: Trials with rare, compound, or non-standard conditions may not map cleanly to MeSH terms used during training. Validate top neighbors against source metadata.
- English only: Trained on English-language trial records.
- 512-token limit: PubMedBERT truncates at 512 tokens. Place the most population-defining text (eligibility criteria, condition) early in the input.
- Not medical advice: Population similarity does not imply that treatments tested in a similar cohort are safe or appropriate for any specific patient.
Citation
@misc{pact-clinicaltrials-2026,
title = {PACT-ClinicalTrials-Pop-256: Population-Similarity Embeddings for Clinical Trial Intelligence},
author = {Mishra, Sid},
year = {2026},
note = {Contrastive triplet embedding model trained on 563,845 ClinicalTrials.gov entries, grouped by MeSH condition terms.},
howpublished = {\url{https://huggingface.co/Ontologer/PACT-ClinicalTrials-Pop-256}}
}
About the Author
Sid Mishra β Founder, Ontologer Β· Convixion AI
Sid is the founder of several AI-native and AI-powered startups and initiatives, based in Singapore. He founded Ontologer as the dedicated AI research arm of Convixion AI, with a focus on building domain-specific language models from the ground up β including data pipelines, training infrastructure, evaluation frameworks, and production deployment.
Ontologer generates novel LLM and embedding models purpose-built for use within Convixion AI's Commercializer.ai platform. PACT is part of the OntologerMed suite β a family of purpose-built models for clinical trial intelligence. Ontologer performs every step of model development β dataset curation, training infrastructure, evaluation, and production deployment β in-house.
Collaboration & Custom Work
Sid is open to collaborating on:
- Custom domain-adapted embedding models β contrastive/triplet training on proprietary datasets for specialised retrieval tasks
- End-to-end LLM and embedding pipelines β from data curation to training to production deployment
- Evaluation framework design β task-specific benchmarks and retrieval evaluation pipelines
- RAG + embedding system design β pairing domain-adapted models with retrieval systems for production use
- Custom model architecture consulting β base model selection, training strategy, hardware planning
| Site | ontologer.com |
| sid@ontologer.com Β· sid@convixion.ai | |
| linkedin.com/in/sid-m-427b9865 |
- Downloads last month
- 19