Instructions to use raihan-js/fedproc-180m-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raihan-js/fedproc-180m-v0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="raihan-js/fedproc-180m-v0")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("raihan-js/fedproc-180m-v0", dtype="auto") - Notebooks
- Google Colab
- Kaggle
FedProc-180M v0
A compact, multi-task transformer for federal procurement text. One shared ModernBERT-base encoder, four task heads. Trained for the open FedProc-Bench.
Important framing โ this is a research model for NLP classification and extraction tasks on federal contracting text. It does not determine bid eligibility, compliance, legal status, or any government-recognized designation. Not affiliated with the U.S. government, GSA, SAM.gov, or acquisition.gov.
What it does
Four heads share a single ModernBERT-base backbone:
| Head | Task | Output |
|---|---|---|
| 1 | Notice type | one of 8 SAM.gov notice-type buckets |
| 2 | NAICS sector | one of 20 top-level NAICS sectors |
| 3 | Set-aside | multi-label over SBA / SDVOSB / WOSB / EDWOSB / 8(a) / HUBZone / SDB |
| 4 | FAR / DFARS clause spans | token-level BIO over CLAUSE entities |
The headline task is task 4. Frontier LLMs are known to invent FAR clause numbers when extracting citations from procurement text. This model is specialized for that task.
Headline results (v0 test split, 243 records)
Test split is 155 FAR clauses + 23 real SAM solicitations + 65 Claude-generated synthetic excerpts. The honest read is on the real FAR slice, which is not in any frontier model's family distribution:
| System | T4 entity F1 | T4 hallucination rate on real FAR text |
|---|---|---|
| Claude Sonnet 4.6 | 0.984 | 0.0% (0 / 182) |
| GPT-4o | 0.937 | 11.0% (23 / 209) |
| Claude Haiku 4.5 | 0.804 | 32.1% (88 / 274) |
| FedProc-180M (this model) | 0.800 | 13.8% (22 / 159) |
On the headline metric, FedProc-180M matches Haiku's F1 while inventing fewer than half as many fake clause numbers โ at roughly 50ร lower latency and three orders of magnitude lower per-call cost.
Aggregated overall scores (mean of per-task macro-F1; task 4 is entity F1):
| System | Aggregate | T1 acc | T2 acc | T3 micro-F1 | T4 F1 |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 0.911 | 0.807 | 0.897 | 0.868 | 0.991 |
| GPT-4o | 0.896 | 0.795 | 0.920 | 0.857 | 0.970 |
| Claude Haiku 4.5 | 0.851 | 0.727 | 0.897 | 0.829 | 0.916 |
| FedProc-180M | 0.497 | 0.716 | 0.218 | 0.186 | 0.921 |
Why the aggregate looks low: tasks 2 and 3 only see real solicitation titles in v0 (no description text), because the SAM.gov API daily quota blocked the description fetch during the v0 run. Task 4, which uses real procurement text from synthetic and FAR sources, is the head where this model is competitive. v0.1 with full SAM description text should move tasks 1, 2, 3 substantially.
How to use
import json, torch
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
# Pull the model package (model.pt + tokenizer + thresholds + this README)
local = snapshot_download("raihan-js/fedproc-180m-v0")
# Reconstruct the multi-task model class โ code lives in the FedProc-Bench
# repo at github.com/raihanjs/fedproc-bench, in models/multi_task/.
# Quick path: just clone that repo and `pip install -r requirements.txt`.
from models.multi_task.model import FedProcModel
from models.multi_task.labels import (
TASK1_ID_TO_LABEL, TASK2_ID_TO_LABEL, TASK3_LABELS, N_TASK3,
)
tok = AutoTokenizer.from_pretrained(local)
model = FedProcModel(encoder_name="answerdotai/ModernBERT-base")
model.load_state_dict(torch.load(f"{local}/model.pt", map_location="cpu", weights_only=True))
model.eval()
thresholds = json.loads(open(f"{local}/task3_thresholds.json").read())["thresholds"]
text = "This Solicitation is set aside for HUBZone small business concerns. " \
"Clauses include 52.219-3, 52.204-7, and DFARS 252.225-7042."
enc = tok(text, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=512)
with torch.no_grad():
out = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"])
# Task 1 / 2: argmax over softmax logits
notice = TASK1_ID_TO_LABEL[out.task1_logits.argmax(-1).item()]
sector = TASK2_ID_TO_LABEL[out.task2_logits.argmax(-1).item()]
# Task 3: per-class threshold
probs3 = torch.sigmoid(out.task3_logits)[0].tolist()
setasides = [TASK3_LABELS[i] for i in range(N_TASK3) if probs3[i] >= thresholds[TASK3_LABELS[i]]]
# Task 4: decode BIO spans into clause numbers using offset_mapping
# (see models/multi_task/_decode_bio in benchmark/run_baselines.py)
Training data
- 150 real SAM.gov solicitations (metadata only โ descriptions blocked by daily quota during v0)
- 1,032 FAR and DFARS clauses pulled from the ECFR API (acquisition.gov data, public domain)
- 433 Claude-generated synthetic solicitation excerpts that cite real cached clauses, used both for class balance (rare set-asides) and to provide task-4 training text. Validation rejected ~13% of generations that didn't cite their pinned clauses.
Train / val / test split: 1129 / 243 / 243 records. Stratified per source. See the dataset card for full provenance and label maps.
Training setup
- Encoder:
answerdotai/ModernBERT-base(149M parameters) - Batch size 16, max sequence 512 tokens
- AdamW, lr 5e-5, weight decay 0.01, 6% linear warmup
- Mixed precision (AMP), 6 epochs
- Task loss weights: 1 / 1 / 1 / 1
- Single RTX 3060 (12GB), ~4.3 minutes wall time
- Code:
models/multi_task/in the FedProc-Bench repo
Calibration
Task 3 (multi-label set-aside) needed per-class threshold calibration. The
default 0.5 produces all-zero predictions because sigmoid outputs saturate in
the 0.0โ0.4 range. Calibrated thresholds (Youden-style F1-max on val) live in
task3_thresholds.json in this repo. They lift task-3 micro-F1 from 0.000 to
0.293 on val without any retraining.
Known limitations
- v0 was trained on only 1,129 records because the SAM.gov daily quota blocked the planned 10k-record fetch. v0.1 retrain with full data is queued.
- Task 2 (NAICS sector) is essentially fed only solicitation titles in v0 because description text was unavailable. It is the weakest head.
- The model has not been evaluated on solicitations from before its ECFR snapshot date (2026-05-08) for drift. FAR clauses are revised periodically.
- The benchmark's test split contains 65 Claude-generated synthetic records; evaluation on those is biased in favor of Claude-family models. The per-source breakdown in the leaderboard makes this explicit.
License
Apache 2.0.
Citation
@software{fedproc_bench_v0,
author = {Raihan Sikder},
title = {FedProc-Bench v0: a multi-task benchmark for federal procurement NLP},
year = {2026},
url = {https://huggingface.co/datasets/raihan-js/fedproc-bench}
}
Model tree for raihan-js/fedproc-180m-v0
Base model
answerdotai/ModernBERT-baseDataset used to train raihan-js/fedproc-180m-v0
Evaluation results
- Entity F1 on FedProc-Bench v0test set self-reported0.921
- Hallucination rate on FedProc-Bench v0test set self-reported0.055