BinomialTechnologies
/

binomial-marks-1

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+base_model: answerdotai/ModernBERT-large
+pipeline_tag: text-classification
+tags:
+- finance
+- earnings-calls
+- multi-task
+- regression
+- distillation
+- modernbert
+- sec
+- quantitative-finance
+inference: false
+---
+# binomial-marks-1
+**An earnings-call NLP scorer that produces 23 structured signals per transcript.**
+Distilled from frontier reasoning models (Grok-4.1-fast-reasoning, validated against
+Claude Opus 4.7 and GPT-5.5) into a 395M-parameter ModernBERT-large fine-tune.
+Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* — a
+roster of small, deployable AI models for quantitative finance. Each model is named after
+a thinker who shaped how markets are understood. **marks-1** is named after Howard Marks
+(Oaktree), whose memos parse market sentiment, tone, and the gap between what's said and
+what's meant.
+---
+## What it does
+Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
+**23 structured numbers** per call:
+**10 topic-direction scores** (each: was the topic discussed? if so, what direction?)
+| Topic | What −2 / +2 mean |
+|---|---|
+| `guidance` | lowered hard / raised significantly |
+| `revenue_growth` | decelerating / accelerating |
+| `margins` | compressing / expanding |
+| `demand` | softening / strong |
+| `buybacks` | paused or reduced / new or upsized |
+| `dividends` | cut or skipped / raised or initiated |
+| `m_and_a` | divestiture / strategic acquisition |
+| `headcount` | layoffs / aggressive hiring |
+| `macro_exposure` | clear headwind / clear tailwind |
+| `competition` | losing share / gaining share |
+**3 tone scores** (each: 1 to 5, low to high)
+| Dimension | What it measures |
+|---|---|
+| `mgmt_confidence` | directness in prepared remarks (1 = uncertain "we hope" → 5 = "we will deliver X by Y") |
+| `mgmt_defensiveness` | evasion in Q&A (1 = open → 5 = deflects, pivots, refuses to commit) |
+| `analyst_skepticism` | analyst pushback (1 = congratulatory → 5 = re-asking the same question) |
+Quants consume the 23 outputs as features in factor models, screening filters, or
+event-study triggers. The model outputs structure, not opinions — buy/sell logic is the
+consumer's responsibility.
+---
+## Quick start
+### One-liner via the convenience helper
+```bash
+pip install binomial-marks
+```
+```python
+from binomial_marks import score
+result = score(
+    transcript="Operator: Welcome to NVIDIA's Q4 2025 earnings call...",
+    ticker="NVDA",
+    sector="Technology",
+    country="US",
+    year=2025, quarter=4,
+)
+# {
+#   "topics": {
+#     "guidance":       {"mentioned": True, "mention_prob": 0.94, "score": +1.7},
+#     "revenue_growth": {"mentioned": True, "mention_prob": 0.97, "score": +1.5},
+#     ...
+#   },
+#   "mgmt_confidence":     4.6,
+#   "mgmt_defensiveness":  1.4,
+#   "analyst_skepticism":  1.8,
+# }
+```
+### Direct via `transformers`
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+tok   = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
+model = AutoModel.from_pretrained(
+    "BinomialTechnologies/binomial-marks-1",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+).eval().cuda()
+prefix = "[SECTOR: Technology] [COUNTRY: US] [TICKER: NVDA] [QUARTER: Q4 2025]\n\n"
+inputs = tok(prefix + transcript, return_tensors="pt",
+             truncation=True, max_length=16384).to("cuda")
+with torch.no_grad():
+    out = model.predict(**inputs)
+# out["topic_score"]: shape (1, 10), the 10 topic directions
+# out["tone_score"]:  shape (1, 3),  the 3 tone dimensions
+```
+### Batched
+```python
+from binomial_marks import MarksScorer
+scorer = MarksScorer()                              # loads model once
+results = scorer.score_batch([
+    {"transcript": ..., "ticker": "NVDA", "sector": "Technology", "year": 2025, "quarter": 4},
+    {"transcript": ..., "ticker": "AAPL", "sector": "Technology", "year": 2025, "quarter": 1},
+])
+```
+---
+## Architecture
+```
+ModernBERT-large encoder (395M, 8192 native ctx → extended to 16384 via YaRN-2x)
+    ↓
+[CLS] embedding ⊕ masked mean pool         (concat → 2H = 2048 dim)
+    ↓
+3 × 2-layer MLP heads (Linear → GELU → Dropout → Linear)
+    ↓
+23 outputs:
+  10 × topic_mentioned (binary, BCE-with-logits)
+  10 × topic_score     (regression, MSE, clamped to [-2, +2] at inference)
+   3 × tone_score      (regression, MSE, clamped to [1, 5] at inference)
+```
+Key details:
+- **YaRN RoPE extension** (β_fast=32, β_slow=1) on the global attention layers, scaling
+  ModernBERT-large from native 8192 → 16384 tokens. Local sliding-window layers (128
+  tokens) are unmodified.
+- **Conditioning prefix** `[SECTOR][COUNTRY][TICKER][QUARTER]` lets the model interpret
+  language sector-specifically (e.g., "margins compressing" reads differently in software
+  vs. retail).
+- **fp32 loss math** (forward in bf16, loss in fp32) — required for stable training at
+  16k context.
+- **Weighted multi-task loss**: `topic_mentioned 0.5 + topic_score 1.5 + tone_scores 0.2`.
+  Tone weight is low because the teacher's tone labels were saturated (~50% std).
+---
+## Training data
+- **99,539 earnings call transcripts** across 2,749 unique tickers, dated 2012-05 to
+  2026-03. Sources: institutional buy-side providers (FMP).
+- **Sector/country/industry metadata** via FMP `/profile` (Yahoo-style GICS).
+- **Labels** distilled from `grok-4-1-fast-reasoning` (xAI) with `reasoning_effort: low`
+  on the entire training corpus. No human annotation. Cost: ~$140 for the full label
+  pass.
+- **80/20 random split** (seed 42), keyed on `(ticker, year, quarter)`. Pure NLP
+  imitation — no temporal split needed since labels come from the LLM, not from market
+  reactions.
+The labels themselves are released as a separate dataset (forthcoming): `BinomialTechnologies/marks-labels-v1`.
+---
+## Eval — cross-LLM agreement on a 2,000-call benchmark
+The benchmark sample is 2,000 calls held out from training, scored by **five LLMs**
+(Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 with low reasoning, DeepSeek V4-Pro,
+and `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
+dimensions:
+|                | vs Opus  | vs GPT-5.5 | vs Grok  | vs DeepSeek |
+| ---            | ---      | ---        | ---      | ---         |
+| **Opus 4.7**   | —        | 0.886      | 0.832    | 0.803       |
+| **GPT-5.5**    | 0.886    | —          | 0.871    | 0.827       |
+| **Grok**       | 0.832    | 0.871      | —        | 0.807       |
+| **DeepSeek V4**| 0.803    | 0.827      | 0.807    | —           |
+| **marks-1**    | **0.697**| **0.696**  | **0.677**| **0.627**   |
+| | Frontier ↔ Frontier (6 pairs) | marks-1 ↔ Frontier (4 pairs) |
+|---|---|---|
+| Mean topic-score Spearman | **0.838** | **0.674** |
+| Mean tone Spearman | **0.61** *(see note)* | **0.62** |
+| Mean *mentioned* MAE | **0.05** | **0.10** |
+**Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
+frontier models (its tone Spearman vs the others is 0.50-0.55, vs Opus↔GPT-5.5 at 0.78).
+Excluding DeepSeek, frontier tone agreement is **0.72** — and marks-1 still hits 0.67
+against that subset.
+**marks-1 reproduces ≈80% of the agreement that frontier reasoners have with each other**
+on financial NLP scoring, at a fraction of the inference cost (~50–200ms on CPU vs
+multi-second LLM API calls).
+### Per-topic Spearman vs. Claude Opus 4.7
+| Topic | marks-1 ↔ Opus | Opus ↔ GPT-5.5 (ceiling) | Δ |
+|---|---|---|---|
+| `dividends` | 0.84 | 0.89 | **-0.05** ✓ |
+| `demand` | 0.82 | 0.94 | -0.12 |
+| `revenue_growth` | 0.80 | 0.94 | -0.14 |
+| `buybacks` | 0.77 | 0.94 | -0.17 |
+| `guidance` | 0.76 | 0.91 | -0.15 |
+| `m_and_a` | 0.71 | 0.83 | -0.12 |
+| `macro_exposure` | 0.66 | 0.89 | -0.23 |
+| `margins` | 0.63 | 0.91 | -0.28 |
+| `competition` | 0.59 | 0.81 | -0.22 |
+| **`headcount`** | **0.39** | 0.81 | **-0.42** ⚠ |
+**Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
+direction-of-growth signals. v2 will revisit.
+### vs. teacher (eval/overall on 20k held-out test split)
+```
+eval/overall:               0.7425
+eval/mentioned_macro_f1:    0.9092
+eval/score_macro_spearman:  0.6658
+eval/tone_macro_spearman:   0.6524
+```
+---
+## Inference
+- **Latency target**: 50ms/call on CPU, sub-10ms on a modern GPU.
+- **Batched throughput** on A100/H100/B200 (bf16, max_length=16384):
+  ~12 calls/sec/instance (single-stream).
+- **Output deterministic** — pure encoder forward + linear projections.
+For deployment: the model is a regular `transformers` model. Wrap in FastAPI, deploy on
+HF Inference Endpoints, or run as a subprocess in your data pipeline.
+---
+## Limitations and known gaps
+1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier — 50% below the
+   other 9 topics). Treat with skepticism.
+2. **Tone labels are partly mode-collapsed** in the teacher (Grok defaults `mgmt_confidence`
+   to 4-5/5 and `mgmt_defensiveness` to 1-2/5). The model picks up rank order but the
+   absolute scale is uninformative — quants should normalize cross-sectionally.
+3. **English-only**. Trained on English transcripts; non-English calls (translated) work
+   but degrade. Top non-US training countries: GB, DE, FR, JP, SE, CH, CN.
+4. **Truncates at 16,384 tokens** (~50k characters). Covers ~p99 of earnings calls;
+   the very longest (Asian conglomerates with 8h+ analyst days) lose middle content via
+   head+tail truncation.
+5. **Pure NLP scorer — not an alpha model.** Outputs are *features*; the trading rule is
+   the consumer's responsibility.
+6. **Distilled, not original judgment.** marks-1 reproduces the teacher's biases,
+   including any systematic miscalibration. The cross-LLM benchmark documents the residual
+   disagreement.
+---
+## Tier
+**Tier 2 — research preview.** v1 of the model. Eval against three frontier LLMs is
+documented above; absolute calibration may shift in v2 with a larger / cleaner label set.
+Production users should run their own validation against return data.
+---
+## Citation
+```bibtex
+@misc{binomialmarks2026,
+  author = {Binomial AI Research},
+  title  = {binomial-marks-1: An earnings-call NLP scorer for quantitative finance},
+  year   = {2026},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/BinomialTechnologies/binomial-marks-1}},
+}
+```
+---
+## License
+Apache 2.0. Use freely; we'd appreciate a citation if you build on it.