File size: 10,374 Bytes
c2b1993
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495a7fb
c2b1993
 
 
 
 
 
 
f7b715f
 
9834616
 
f7b715f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2b1993
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7b715f
 
 
 
 
 
c2b1993
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7b715f
c2b1993
f7b715f
 
 
 
c2b1993
f7b715f
 
c2b1993
 
 
 
 
f7b715f
 
 
c2b1993
 
 
 
 
 
 
 
9834616
c2b1993
 
 
9834616
c2b1993
 
 
 
f7b715f
c2b1993
 
 
9834616
f7b715f
c2b1993
 
 
 
 
 
 
9834616
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2b1993
 
 
 
 
f7b715f
 
 
 
c2b1993
f7b715f
c2b1993
 
 
 
 
 
d1bcea1
f7b715f
d1bcea1
f7b715f
d1bcea1
f7b715f
d1bcea1
c2b1993
 
 
 
 
 
f7b715f
 
 
c2b1993
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- finance
- earnings-calls
- multi-task
- regression
- sec
- quantitative-finance
inference: false
---

# binomial-marks-1

**An earnings-call NLP scorer that produces 23 structured signals per transcript.**

Built by [Binomial AI Research](https://www.binomialtec.com/pages/research.html). Part of the *specialist zoo* β€” a
roster of small, deployable AI models for quantitative finance. Each model is named after
a thinker who shaped how markets are understood. **marks-1** is named after Howard Marks
(Oaktree), whose memos parse market sentiment, tone, and the gap between what's said and
what's meant.

---

## Headline numbers

- **~82% of frontier-LLM consensus** on topic-direction scoring (mean Spearman vs frontier
  panel: 0.690, vs the ceiling that frontier reasoners hit with each other: 0.838).
- **Frontier parity on tone**: marks-1 ↔ frontier mean Spearman **0.62** is statistically
  tied with frontier ↔ frontier **0.61** (DeepSeek-included) and within 0.05 of the
  Western-frontier subset.
- **F1 = 0.91** on the binary topic-mention heads β€” i.e. it agrees with the teacher 9 out
  of 10 times on whether a topic was discussed at all.
- **6 of 10 topics β‰₯ 0.71 Spearman** with Claude Opus 4.7. `dividends` hits **0.84**, only
  **0.05** below the frontier-frontier ceiling of 0.89.
- **~50ms / call on CPU**, sub-10ms on a modern GPU, ~12 calls/sec batched on
  A100/H100/B200 β€” vs **multi-second** latency for a comparable LLM API call. **Two
  orders of magnitude** faster, deterministic, and runs offline.
- **23 outputs in a single forward pass** β€” no chained LLM calls, no JSON parsing, no
  retry logic.
- **16,384-token context window** covers ~p99 of earnings calls; conditioned on
  `(country, sector, ticker, quarter)` so the same words read correctly in context.
- **Apache 2.0** β€” deployable anywhere, no API key, no vendor lock-in.

---

## What it does

Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
**23 structured numbers** per call:

**10 topic-direction scores** (each: was the topic discussed? if so, what direction?)

| Topic | What βˆ’2 / +2 mean |
|---|---|
| `guidance` | lowered hard / raised significantly |
| `revenue_growth` | decelerating / accelerating |
| `margins` | compressing / expanding |
| `demand` | softening / strong |
| `buybacks` | paused or reduced / new or upsized |
| `dividends` | cut or skipped / raised or initiated |
| `m_and_a` | divestiture / strategic acquisition |
| `headcount` | layoffs / aggressive hiring |
| `macro_exposure` | clear headwind / clear tailwind |
| `competition` | losing share / gaining share |

**3 tone scores** (each: 1 to 5, low to high)

| Dimension | What it measures |
|---|---|
| `mgmt_confidence` | directness in prepared remarks (1 = uncertain "we hope" β†’ 5 = "we will deliver X by Y") |
| `mgmt_defensiveness` | evasion in Q&A (1 = open β†’ 5 = deflects, pivots, refuses to commit) |
| `analyst_skepticism` | analyst pushback (1 = congratulatory β†’ 5 = re-asking the same question) |

The model is conditioned on **country, sector, ticker, and quarter** at inference, so the
same words read differently in the right context β€” *"margins compressing"* in software
isn't the same signal as in retail; *"demand softening"* in a Chinese consumer name isn't
the same as a US one. This conditioning is the difference between a generic sentiment
scorer and one that reads earnings calls the way an analyst does.

Quants consume the 23 outputs as features in factor models, screening filters, or
event-study triggers. The model outputs structure, not opinions β€” buy/sell logic is the
consumer's responsibility.

---

## Quick start

### One-liner via the convenience helper

```bash
pip install binomial-marks
```

```python
from binomial_marks import score

result = score(
    transcript="Operator: Welcome to NVIDIA's Q4 2025 earnings call...",
    ticker="NVDA",
    sector="Technology",
    country="US",
    year=2025, quarter=4,
)
# {
#   "topics": {
#     "guidance":       {"mentioned": True, "mention_prob": 0.94, "score": +1.7},
#     "revenue_growth": {"mentioned": True, "mention_prob": 0.97, "score": +1.5},
#     ...
#   },
#   "mgmt_confidence":     4.6,
#   "mgmt_defensiveness":  1.4,
#   "analyst_skepticism":  1.8,
# }
```

### Direct via `transformers`

```python
from transformers import AutoTokenizer, AutoModel
import torch

tok   = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
model = AutoModel.from_pretrained(
    "BinomialTechnologies/binomial-marks-1",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).eval().cuda()

prefix = "[SECTOR: Technology] [COUNTRY: US] [TICKER: NVDA] [QUARTER: Q4 2025]\n\n"
inputs = tok(prefix + transcript, return_tensors="pt",
             truncation=True, max_length=16384).to("cuda")

with torch.no_grad():
    out = model.predict(**inputs)
# out["topic_score"]: shape (1, 10), the 10 topic directions
# out["tone_score"]:  shape (1, 3),  the 3 tone dimensions
```

### Batched

```python
from binomial_marks import MarksScorer
scorer = MarksScorer()                              # loads model once
results = scorer.score_batch([
    {"transcript": ..., "ticker": "NVDA", "sector": "Technology", "year": 2025, "quarter": 4},
    {"transcript": ..., "ticker": "AAPL", "sector": "Technology", "year": 2025, "quarter": 1},
])
```

---

## Training

`binomial-marks-1` is trained on **80,000+ earnings call transcripts** spanning 2,700+
unique tickers across global markets (2012–2026), each tagged with country, sector, and
industry metadata. Labels are distilled from frontier reasoning models and the model is
benchmarked against the same set of frontier systems on a held-out 2,000-call sample.

The split is `(ticker, year, quarter)`-keyed; this is a pure NLP imitation task β€” labels
come from language models, not market outcomes β€” so a temporal split is unnecessary.

---

## Eval β€” cross-LLM agreement on a 2,000-call benchmark

The benchmark is 2,000 calls held out from training, scored by **five systems**
(Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 low-reasoning, DeepSeek V4-Pro, and
`marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
dimensions:

|                | vs Opus  | vs GPT-5.5 | vs Grok  | vs DeepSeek |
| ---            | ---      | ---        | ---      | ---         |
| **Opus 4.7**   | β€”        | 0.886      | 0.832    | 0.803       |
| **GPT-5.5**    | 0.886    | β€”          | 0.871    | 0.827       |
| **Grok**       | 0.832    | 0.871      | β€”        | 0.807       |
| **DeepSeek V4**| 0.803    | 0.827      | 0.807    | β€”           |
| **marks-1**    | **0.711**| **0.712**  | **0.692**| **0.644**   |

| | Frontier ↔ Frontier (6 pairs) | marks-1 ↔ Frontier (4 pairs) |
|---|---|---|
| Mean topic-score Spearman | **0.838** | **0.690** |
| Mean tone Spearman | **0.61** *(see note)* | **0.62** |
| Mean *mentioned* MAE | **0.05** | **0.10** |

**Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
frontier models (its tone Spearman vs the others is 0.50–0.55, vs Opus↔GPT-5.5 at 0.78).
Excluding DeepSeek, frontier tone agreement is **0.72** β€” and marks-1 still hits 0.67
against that subset.

**marks-1 reproduces β‰ˆ82% of the agreement that frontier reasoners have with each other**
on financial NLP scoring, at a fraction of the inference cost (~50ms on CPU vs
multi-second LLM API calls).

### Per-topic Spearman vs. Claude Opus 4.7

| Topic | marks-1 ↔ Opus | Opus ↔ GPT-5.5 (ceiling) | Ξ” |
|---|---|---|---|
| `dividends` | 0.84 | 0.89 | **-0.05** βœ“ |
| `demand` | 0.83 | 0.94 | -0.11 |
| `revenue_growth` | 0.81 | 0.94 | -0.13 |
| `buybacks` | 0.79 | 0.94 | -0.15 |
| `guidance` | 0.77 | 0.91 | -0.14 |
| `m_and_a` | 0.72 | 0.83 | -0.11 |
| `margins` | 0.69 | 0.91 | -0.22 |
| `macro_exposure` | 0.67 | 0.89 | -0.22 |
| `competition` | 0.60 | 0.81 | -0.21 |
| `headcount` | 0.41 | 0.81 | -0.40 |

**On the US-large-cap-heavy 2k benchmark, `headcount` is still the weakest dimension at
0.41 vs Opus.** Off the US distribution, the gap closes substantially: on a 4,777-call
non-US/EM-heavy holdout, `headcount` Spearman lands at **0.66** β€” inside the same
range as the other 9 topics. The 2k-benchmark gap reflects that headcount language in
US large-cap calls is heavily templated (hiring framed as routine HR commentary), so
rank-order signal is genuinely scarce in that slice β€” not a model defect.

---

## Inference

- **Latency**: ~50ms/call on CPU, sub-10ms on modern GPUs.
- **Batched throughput** (bf16, max_length=16384): ~12 calls/sec/instance on A100/H100/B200.
- **Output is deterministic** β€” same input always returns the same 23 numbers.
- **Context window**: 16,384 tokens (~50k characters). Covers ~p99 of earnings calls.

For deployment: the model is a standard `transformers` model. Wrap in FastAPI, deploy on
HF Inference Endpoints, or run as a subprocess in your data pipeline.

---

## Limitations and known gaps

1. **Tone has rank-order signal but absolute levels drift.** Quants should normalize
   cross-sectionally rather than thresholding raw values.
2. **English transcripts only.** Non-English calls (translated) work but degrade. Top
   non-US training countries: GB, DE, FR, JP, SE, CH, CN.
3. **Truncates at 16,384 tokens.** Covers ~p99 of calls; the very longest (Asian
   conglomerates with 8h+ analyst days) lose middle content via head+tail truncation.
4. **Pure NLP scorer β€” not an alpha model.** Outputs are *features*; the trading rule is
   the consumer's responsibility.

---

## Tier

**Tier 2 β€” research preview.** v1 of the model. Eval against frontier LLMs is documented
above; absolute calibration may shift in v2 with a larger label set. Production users
should run their own validation against return data.

---

## Citation

```bibtex
@misc{binomialmarks2026,
  author = {Binomial AI Research},
  title  = {binomial-marks-1: An earnings-call NLP scorer for quantitative finance},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/BinomialTechnologies/binomial-marks-1}},
}
```

---

## License

Apache 2.0. Use freely; we'd appreciate a citation if you build on it.