FedProc-180M v0

A compact, multi-task transformer for federal procurement text. One shared ModernBERT-base encoder, four task heads. Trained for the open FedProc-Bench.

Important framing — this is a research model for NLP classification and extraction tasks on federal contracting text. It does not determine bid eligibility, compliance, legal status, or any government-recognized designation. Not affiliated with the U.S. government, GSA, SAM.gov, or acquisition.gov.

What it does

Four heads share a single ModernBERT-base backbone:

Head	Task	Output
1	Notice type	one of 8 SAM.gov notice-type buckets
2	NAICS sector	one of 20 top-level NAICS sectors
3	Set-aside	multi-label over SBA / SDVOSB / WOSB / EDWOSB / 8(a) / HUBZone / SDB
4	FAR / DFARS clause spans	token-level BIO over CLAUSE entities

The headline task is task 4. Frontier LLMs are known to invent FAR clause numbers when extracting citations from procurement text. This model is specialized for that task.

Headline results (v0 test split, 243 records)

Test split is 155 FAR clauses + 23 real SAM solicitations + 65 Claude-generated synthetic excerpts. The honest read is on the real FAR slice, which is not in any frontier model's family distribution:

System	T4 entity F1	T4 hallucination rate on real FAR text
Claude Sonnet 4.6	0.984	0.0% (0 / 182)
GPT-4o	0.937	11.0% (23 / 209)
Claude Haiku 4.5	0.804	32.1% (88 / 274)
FedProc-180M (this model)	0.800	13.8% (22 / 159)

On the headline metric, FedProc-180M matches Haiku's F1 while inventing fewer than half as many fake clause numbers — at roughly 50× lower latency and three orders of magnitude lower per-call cost.

Aggregated overall scores (mean of per-task macro-F1; task 4 is entity F1):

System	Aggregate	T1 acc	T2 acc	T3 micro-F1	T4 F1
Claude Sonnet 4.6	0.911	0.807	0.897	0.868	0.991
GPT-4o	0.896	0.795	0.920	0.857	0.970
Claude Haiku 4.5	0.851	0.727	0.897	0.829	0.916
FedProc-180M	0.497	0.716	0.218	0.186	0.921

Why the aggregate looks low: tasks 2 and 3 only see real solicitation titles in v0 (no description text), because the SAM.gov API daily quota blocked the description fetch during the v0 run. Task 4, which uses real procurement text from synthetic and FAR sources, is the head where this model is competitive. v0.1 with full SAM description text should move tasks 1, 2, 3 substantially.

How to use

import json, torch
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

# Pull the model package (model.pt + tokenizer + thresholds + this README)
local = snapshot_download("raihan-js/fedproc-180m-v0")

# Reconstruct the multi-task model class — code lives in the FedProc-Bench
# repo at github.com/raihanjs/fedproc-bench, in models/multi_task/.
# Quick path: just clone that repo and `pip install -r requirements.txt`.
from models.multi_task.model import FedProcModel
from models.multi_task.labels import (
    TASK1_ID_TO_LABEL, TASK2_ID_TO_LABEL, TASK3_LABELS, N_TASK3,
)

tok = AutoTokenizer.from_pretrained(local)
model = FedProcModel(encoder_name="answerdotai/ModernBERT-base")
model.load_state_dict(torch.load(f"{local}/model.pt", map_location="cpu", weights_only=True))
model.eval()

thresholds = json.loads(open(f"{local}/task3_thresholds.json").read())["thresholds"]

text = "This Solicitation is set aside for HUBZone small business concerns. " \
       "Clauses include 52.219-3, 52.204-7, and DFARS 252.225-7042."

enc = tok(text, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=512)
with torch.no_grad():
    out = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"])

# Task 1 / 2: argmax over softmax logits
notice = TASK1_ID_TO_LABEL[out.task1_logits.argmax(-1).item()]
sector = TASK2_ID_TO_LABEL[out.task2_logits.argmax(-1).item()]
# Task 3: per-class threshold
probs3 = torch.sigmoid(out.task3_logits)[0].tolist()
setasides = [TASK3_LABELS[i] for i in range(N_TASK3) if probs3[i] >= thresholds[TASK3_LABELS[i]]]
# Task 4: decode BIO spans into clause numbers using offset_mapping
# (see models/multi_task/_decode_bio in benchmark/run_baselines.py)

Training data

150 real SAM.gov solicitations (metadata only — descriptions blocked by daily quota during v0)
1,032 FAR and DFARS clauses pulled from the ECFR API (acquisition.gov data, public domain)
433 Claude-generated synthetic solicitation excerpts that cite real cached clauses, used both for class balance (rare set-asides) and to provide task-4 training text. Validation rejected ~13% of generations that didn't cite their pinned clauses.

Train / val / test split: 1129 / 243 / 243 records. Stratified per source. See the dataset card for full provenance and label maps.

Training setup

Encoder: answerdotai/ModernBERT-base (149M parameters)
Batch size 16, max sequence 512 tokens
AdamW, lr 5e-5, weight decay 0.01, 6% linear warmup
Mixed precision (AMP), 6 epochs
Task loss weights: 1 / 1 / 1 / 1
Single RTX 3060 (12GB), ~4.3 minutes wall time
Code: models/multi_task/ in the FedProc-Bench repo

Calibration

Task 3 (multi-label set-aside) needed per-class threshold calibration. The default 0.5 produces all-zero predictions because sigmoid outputs saturate in the 0.0–0.4 range. Calibrated thresholds (Youden-style F1-max on val) live in task3_thresholds.json in this repo. They lift task-3 micro-F1 from 0.000 to 0.293 on val without any retraining.

Known limitations

v0 was trained on only 1,129 records because the SAM.gov daily quota blocked the planned 10k-record fetch. v0.1 retrain with full data is queued.
Task 2 (NAICS sector) is essentially fed only solicitation titles in v0 because description text was unavailable. It is the weakest head.
The model has not been evaluated on solicitations from before its ECFR snapshot date (2026-05-08) for drift. FAR clauses are revised periodically.
The benchmark's test split contains 65 Claude-generated synthetic records; evaluation on those is biased in favor of Claude-family models. The per-source breakdown in the leaderboard makes this explicit.

License

Apache 2.0.

Citation

@software{fedproc_bench_v0,
  author  = {Raihan Sikder},
  title   = {FedProc-Bench v0: a multi-task benchmark for federal procurement NLP},
  year    = {2026},
  url     = {https://huggingface.co/datasets/raihan-js/fedproc-bench}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for raihan-js/fedproc-180m-v0

Base model

answerdotai/ModernBERT-base

Finetuned

(1353)

this model

Dataset used to train raihan-js/fedproc-180m-v0

Evaluation results

Entity F1 on FedProc-Bench v0
test set self-reported

0.921
Hallucination rate on FedProc-Bench v0
test set self-reported

0.055