ilayibrahimzadeh commited on
Commit
c2b1993
Β·
verified Β·
1 Parent(s): 8361ea4

Add full model card

Browse files
Files changed (1) hide show
  1. README.md +294 -3
README.md CHANGED
@@ -1,3 +1,294 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ base_model: answerdotai/ModernBERT-large
7
+ pipeline_tag: text-classification
8
+ tags:
9
+ - finance
10
+ - earnings-calls
11
+ - multi-task
12
+ - regression
13
+ - distillation
14
+ - modernbert
15
+ - sec
16
+ - quantitative-finance
17
+ inference: false
18
+ ---
19
+
20
+ # binomial-marks-1
21
+
22
+ **An earnings-call NLP scorer that produces 23 structured signals per transcript.**
23
+ Distilled from frontier reasoning models (Grok-4.1-fast-reasoning, validated against
24
+ Claude Opus 4.7 and GPT-5.5) into a 395M-parameter ModernBERT-large fine-tune.
25
+
26
+ Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* β€” a
27
+ roster of small, deployable AI models for quantitative finance. Each model is named after
28
+ a thinker who shaped how markets are understood. **marks-1** is named after Howard Marks
29
+ (Oaktree), whose memos parse market sentiment, tone, and the gap between what's said and
30
+ what's meant.
31
+
32
+ ---
33
+
34
+ ## What it does
35
+
36
+ Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
37
+ **23 structured numbers** per call:
38
+
39
+ **10 topic-direction scores** (each: was the topic discussed? if so, what direction?)
40
+
41
+ | Topic | What βˆ’2 / +2 mean |
42
+ |---|---|
43
+ | `guidance` | lowered hard / raised significantly |
44
+ | `revenue_growth` | decelerating / accelerating |
45
+ | `margins` | compressing / expanding |
46
+ | `demand` | softening / strong |
47
+ | `buybacks` | paused or reduced / new or upsized |
48
+ | `dividends` | cut or skipped / raised or initiated |
49
+ | `m_and_a` | divestiture / strategic acquisition |
50
+ | `headcount` | layoffs / aggressive hiring |
51
+ | `macro_exposure` | clear headwind / clear tailwind |
52
+ | `competition` | losing share / gaining share |
53
+
54
+ **3 tone scores** (each: 1 to 5, low to high)
55
+
56
+ | Dimension | What it measures |
57
+ |---|---|
58
+ | `mgmt_confidence` | directness in prepared remarks (1 = uncertain "we hope" β†’ 5 = "we will deliver X by Y") |
59
+ | `mgmt_defensiveness` | evasion in Q&A (1 = open β†’ 5 = deflects, pivots, refuses to commit) |
60
+ | `analyst_skepticism` | analyst pushback (1 = congratulatory β†’ 5 = re-asking the same question) |
61
+
62
+ Quants consume the 23 outputs as features in factor models, screening filters, or
63
+ event-study triggers. The model outputs structure, not opinions β€” buy/sell logic is the
64
+ consumer's responsibility.
65
+
66
+ ---
67
+
68
+ ## Quick start
69
+
70
+ ### One-liner via the convenience helper
71
+
72
+ ```bash
73
+ pip install binomial-marks
74
+ ```
75
+
76
+ ```python
77
+ from binomial_marks import score
78
+
79
+ result = score(
80
+ transcript="Operator: Welcome to NVIDIA's Q4 2025 earnings call...",
81
+ ticker="NVDA",
82
+ sector="Technology",
83
+ country="US",
84
+ year=2025, quarter=4,
85
+ )
86
+ # {
87
+ # "topics": {
88
+ # "guidance": {"mentioned": True, "mention_prob": 0.94, "score": +1.7},
89
+ # "revenue_growth": {"mentioned": True, "mention_prob": 0.97, "score": +1.5},
90
+ # ...
91
+ # },
92
+ # "mgmt_confidence": 4.6,
93
+ # "mgmt_defensiveness": 1.4,
94
+ # "analyst_skepticism": 1.8,
95
+ # }
96
+ ```
97
+
98
+ ### Direct via `transformers`
99
+
100
+ ```python
101
+ from transformers import AutoTokenizer, AutoModel
102
+ import torch
103
+
104
+ tok = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
105
+ model = AutoModel.from_pretrained(
106
+ "BinomialTechnologies/binomial-marks-1",
107
+ trust_remote_code=True,
108
+ torch_dtype=torch.bfloat16,
109
+ ).eval().cuda()
110
+
111
+ prefix = "[SECTOR: Technology] [COUNTRY: US] [TICKER: NVDA] [QUARTER: Q4 2025]\n\n"
112
+ inputs = tok(prefix + transcript, return_tensors="pt",
113
+ truncation=True, max_length=16384).to("cuda")
114
+
115
+ with torch.no_grad():
116
+ out = model.predict(**inputs)
117
+ # out["topic_score"]: shape (1, 10), the 10 topic directions
118
+ # out["tone_score"]: shape (1, 3), the 3 tone dimensions
119
+ ```
120
+
121
+ ### Batched
122
+
123
+ ```python
124
+ from binomial_marks import MarksScorer
125
+ scorer = MarksScorer() # loads model once
126
+ results = scorer.score_batch([
127
+ {"transcript": ..., "ticker": "NVDA", "sector": "Technology", "year": 2025, "quarter": 4},
128
+ {"transcript": ..., "ticker": "AAPL", "sector": "Technology", "year": 2025, "quarter": 1},
129
+ ])
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Architecture
135
+
136
+ ```
137
+ ModernBERT-large encoder (395M, 8192 native ctx β†’ extended to 16384 via YaRN-2x)
138
+ ↓
139
+ [CLS] embedding βŠ• masked mean pool (concat β†’ 2H = 2048 dim)
140
+ ↓
141
+ 3 Γ— 2-layer MLP heads (Linear β†’ GELU β†’ Dropout β†’ Linear)
142
+ ↓
143
+ 23 outputs:
144
+ 10 Γ— topic_mentioned (binary, BCE-with-logits)
145
+ 10 Γ— topic_score (regression, MSE, clamped to [-2, +2] at inference)
146
+ 3 Γ— tone_score (regression, MSE, clamped to [1, 5] at inference)
147
+ ```
148
+
149
+ Key details:
150
+ - **YaRN RoPE extension** (Ξ²_fast=32, Ξ²_slow=1) on the global attention layers, scaling
151
+ ModernBERT-large from native 8192 β†’ 16384 tokens. Local sliding-window layers (128
152
+ tokens) are unmodified.
153
+ - **Conditioning prefix** `[SECTOR][COUNTRY][TICKER][QUARTER]` lets the model interpret
154
+ language sector-specifically (e.g., "margins compressing" reads differently in software
155
+ vs. retail).
156
+ - **fp32 loss math** (forward in bf16, loss in fp32) β€” required for stable training at
157
+ 16k context.
158
+ - **Weighted multi-task loss**: `topic_mentioned 0.5 + topic_score 1.5 + tone_scores 0.2`.
159
+ Tone weight is low because the teacher's tone labels were saturated (~50% std).
160
+
161
+ ---
162
+
163
+ ## Training data
164
+
165
+ - **99,539 earnings call transcripts** across 2,749 unique tickers, dated 2012-05 to
166
+ 2026-03. Sources: institutional buy-side providers (FMP).
167
+ - **Sector/country/industry metadata** via FMP `/profile` (Yahoo-style GICS).
168
+ - **Labels** distilled from `grok-4-1-fast-reasoning` (xAI) with `reasoning_effort: low`
169
+ on the entire training corpus. No human annotation. Cost: ~$140 for the full label
170
+ pass.
171
+ - **80/20 random split** (seed 42), keyed on `(ticker, year, quarter)`. Pure NLP
172
+ imitation β€” no temporal split needed since labels come from the LLM, not from market
173
+ reactions.
174
+
175
+ The labels themselves are released as a separate dataset (forthcoming): `BinomialTechnologies/marks-labels-v1`.
176
+
177
+ ---
178
+
179
+ ## Eval β€” cross-LLM agreement on a 2,000-call benchmark
180
+
181
+ The benchmark sample is 2,000 calls held out from training, scored by **five LLMs**
182
+ (Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 with low reasoning, DeepSeek V4-Pro,
183
+ and `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
184
+ dimensions:
185
+
186
+ | | vs Opus | vs GPT-5.5 | vs Grok | vs DeepSeek |
187
+ | --- | --- | --- | --- | --- |
188
+ | **Opus 4.7** | β€” | 0.886 | 0.832 | 0.803 |
189
+ | **GPT-5.5** | 0.886 | β€” | 0.871 | 0.827 |
190
+ | **Grok** | 0.832 | 0.871 | β€” | 0.807 |
191
+ | **DeepSeek V4**| 0.803 | 0.827 | 0.807 | β€” |
192
+ | **marks-1** | **0.697**| **0.696** | **0.677**| **0.627** |
193
+
194
+ | | Frontier ↔ Frontier (6 pairs) | marks-1 ↔ Frontier (4 pairs) |
195
+ |---|---|---|
196
+ | Mean topic-score Spearman | **0.838** | **0.674** |
197
+ | Mean tone Spearman | **0.61** *(see note)* | **0.62** |
198
+ | Mean *mentioned* MAE | **0.05** | **0.10** |
199
+
200
+ **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
201
+ frontier models (its tone Spearman vs the others is 0.50-0.55, vs Opus↔GPT-5.5 at 0.78).
202
+ Excluding DeepSeek, frontier tone agreement is **0.72** β€” and marks-1 still hits 0.67
203
+ against that subset.
204
+
205
+ **marks-1 reproduces β‰ˆ80% of the agreement that frontier reasoners have with each other**
206
+ on financial NLP scoring, at a fraction of the inference cost (~50–200ms on CPU vs
207
+ multi-second LLM API calls).
208
+
209
+ ### Per-topic Spearman vs. Claude Opus 4.7
210
+
211
+ | Topic | marks-1 ↔ Opus | Opus ↔ GPT-5.5 (ceiling) | Ξ” |
212
+ |---|---|---|---|
213
+ | `dividends` | 0.84 | 0.89 | **-0.05** βœ“ |
214
+ | `demand` | 0.82 | 0.94 | -0.12 |
215
+ | `revenue_growth` | 0.80 | 0.94 | -0.14 |
216
+ | `buybacks` | 0.77 | 0.94 | -0.17 |
217
+ | `guidance` | 0.76 | 0.91 | -0.15 |
218
+ | `m_and_a` | 0.71 | 0.83 | -0.12 |
219
+ | `macro_exposure` | 0.66 | 0.89 | -0.23 |
220
+ | `margins` | 0.63 | 0.91 | -0.28 |
221
+ | `competition` | 0.59 | 0.81 | -0.22 |
222
+ | **`headcount`** | **0.39** | 0.81 | **-0.42** ⚠ |
223
+
224
+ **Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
225
+ direction-of-growth signals. v2 will revisit.
226
+
227
+ ### vs. teacher (eval/overall on 20k held-out test split)
228
+
229
+ ```
230
+ eval/overall: 0.7425
231
+ eval/mentioned_macro_f1: 0.9092
232
+ eval/score_macro_spearman: 0.6658
233
+ eval/tone_macro_spearman: 0.6524
234
+ ```
235
+
236
+ ---
237
+
238
+ ## Inference
239
+
240
+ - **Latency target**: 50ms/call on CPU, sub-10ms on a modern GPU.
241
+ - **Batched throughput** on A100/H100/B200 (bf16, max_length=16384):
242
+ ~12 calls/sec/instance (single-stream).
243
+ - **Output deterministic** β€” pure encoder forward + linear projections.
244
+
245
+ For deployment: the model is a regular `transformers` model. Wrap in FastAPI, deploy on
246
+ HF Inference Endpoints, or run as a subprocess in your data pipeline.
247
+
248
+ ---
249
+
250
+ ## Limitations and known gaps
251
+
252
+ 1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier β€” 50% below the
253
+ other 9 topics). Treat with skepticism.
254
+ 2. **Tone labels are partly mode-collapsed** in the teacher (Grok defaults `mgmt_confidence`
255
+ to 4-5/5 and `mgmt_defensiveness` to 1-2/5). The model picks up rank order but the
256
+ absolute scale is uninformative β€” quants should normalize cross-sectionally.
257
+ 3. **English-only**. Trained on English transcripts; non-English calls (translated) work
258
+ but degrade. Top non-US training countries: GB, DE, FR, JP, SE, CH, CN.
259
+ 4. **Truncates at 16,384 tokens** (~50k characters). Covers ~p99 of earnings calls;
260
+ the very longest (Asian conglomerates with 8h+ analyst days) lose middle content via
261
+ head+tail truncation.
262
+ 5. **Pure NLP scorer β€” not an alpha model.** Outputs are *features*; the trading rule is
263
+ the consumer's responsibility.
264
+ 6. **Distilled, not original judgment.** marks-1 reproduces the teacher's biases,
265
+ including any systematic miscalibration. The cross-LLM benchmark documents the residual
266
+ disagreement.
267
+
268
+ ---
269
+
270
+ ## Tier
271
+
272
+ **Tier 2 β€” research preview.** v1 of the model. Eval against three frontier LLMs is
273
+ documented above; absolute calibration may shift in v2 with a larger / cleaner label set.
274
+ Production users should run their own validation against return data.
275
+
276
+ ---
277
+
278
+ ## Citation
279
+
280
+ ```bibtex
281
+ @misc{binomialmarks2026,
282
+ author = {Binomial AI Research},
283
+ title = {binomial-marks-1: An earnings-call NLP scorer for quantitative finance},
284
+ year = {2026},
285
+ publisher = {HuggingFace},
286
+ howpublished = {\url{https://huggingface.co/BinomialTechnologies/binomial-marks-1}},
287
+ }
288
+ ```
289
+
290
+ ---
291
+
292
+ ## License
293
+
294
+ Apache 2.0. Use freely; we'd appreciate a citation if you build on it.