| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: text-classification |
| tags: |
| - finance |
| - earnings-calls |
| - multi-task |
| - regression |
| - sec |
| - quantitative-finance |
| inference: false |
| --- |
| |
| # binomial-marks-1 |
|
|
| **An earnings-call NLP scorer that produces 23 structured signals per transcript.** |
|
|
| Built by [Binomial AI Research](https://www.binomialtec.com/pages/research.html). Part of the *specialist zoo* β a |
| roster of small, deployable AI models for quantitative finance. Each model is named after |
| a thinker who shaped how markets are understood. **marks-1** is named after Howard Marks |
| (Oaktree), whose memos parse market sentiment, tone, and the gap between what's said and |
| what's meant. |
|
|
| --- |
|
|
| ## Headline numbers |
|
|
| - **~82% of frontier-LLM consensus** on topic-direction scoring (mean Spearman vs frontier |
| panel: 0.690, vs the ceiling that frontier reasoners hit with each other: 0.838). |
| - **Frontier parity on tone**: marks-1 β frontier mean Spearman **0.62** is statistically |
| tied with frontier β frontier **0.61** (DeepSeek-included) and within 0.05 of the |
| Western-frontier subset. |
| - **F1 = 0.91** on the binary topic-mention heads β i.e. it agrees with the teacher 9 out |
| of 10 times on whether a topic was discussed at all. |
| - **6 of 10 topics β₯ 0.71 Spearman** with Claude Opus 4.7. `dividends` hits **0.84**, only |
| **0.05** below the frontier-frontier ceiling of 0.89. |
| - **~50ms / call on CPU**, sub-10ms on a modern GPU, ~12 calls/sec batched on |
| A100/H100/B200 β vs **multi-second** latency for a comparable LLM API call. **Two |
| orders of magnitude** faster, deterministic, and runs offline. |
| - **23 outputs in a single forward pass** β no chained LLM calls, no JSON parsing, no |
| retry logic. |
| - **16,384-token context window** covers ~p99 of earnings calls; conditioned on |
| `(country, sector, ticker, quarter)` so the same words read correctly in context. |
| - **Apache 2.0** β deployable anywhere, no API key, no vendor lock-in. |
|
|
| --- |
|
|
| ## What it does |
|
|
| Given the text of an earnings call (with light metadata), `binomial-marks-1` returns |
| **23 structured numbers** per call: |
|
|
| **10 topic-direction scores** (each: was the topic discussed? if so, what direction?) |
|
|
| | Topic | What β2 / +2 mean | |
| |---|---| |
| | `guidance` | lowered hard / raised significantly | |
| | `revenue_growth` | decelerating / accelerating | |
| | `margins` | compressing / expanding | |
| | `demand` | softening / strong | |
| | `buybacks` | paused or reduced / new or upsized | |
| | `dividends` | cut or skipped / raised or initiated | |
| | `m_and_a` | divestiture / strategic acquisition | |
| | `headcount` | layoffs / aggressive hiring | |
| | `macro_exposure` | clear headwind / clear tailwind | |
| | `competition` | losing share / gaining share | |
|
|
| **3 tone scores** (each: 1 to 5, low to high) |
|
|
| | Dimension | What it measures | |
| |---|---| |
| | `mgmt_confidence` | directness in prepared remarks (1 = uncertain "we hope" β 5 = "we will deliver X by Y") | |
| | `mgmt_defensiveness` | evasion in Q&A (1 = open β 5 = deflects, pivots, refuses to commit) | |
| | `analyst_skepticism` | analyst pushback (1 = congratulatory β 5 = re-asking the same question) | |
|
|
| The model is conditioned on **country, sector, ticker, and quarter** at inference, so the |
| same words read differently in the right context β *"margins compressing"* in software |
| isn't the same signal as in retail; *"demand softening"* in a Chinese consumer name isn't |
| the same as a US one. This conditioning is the difference between a generic sentiment |
| scorer and one that reads earnings calls the way an analyst does. |
|
|
| Quants consume the 23 outputs as features in factor models, screening filters, or |
| event-study triggers. The model outputs structure, not opinions β buy/sell logic is the |
| consumer's responsibility. |
|
|
| --- |
|
|
| ## Quick start |
|
|
| ### One-liner via the convenience helper |
|
|
| ```bash |
| pip install binomial-marks |
| ``` |
|
|
| ```python |
| from binomial_marks import score |
| |
| result = score( |
| transcript="Operator: Welcome to NVIDIA's Q4 2025 earnings call...", |
| ticker="NVDA", |
| sector="Technology", |
| country="US", |
| year=2025, quarter=4, |
| ) |
| # { |
| # "topics": { |
| # "guidance": {"mentioned": True, "mention_prob": 0.94, "score": +1.7}, |
| # "revenue_growth": {"mentioned": True, "mention_prob": 0.97, "score": +1.5}, |
| # ... |
| # }, |
| # "mgmt_confidence": 4.6, |
| # "mgmt_defensiveness": 1.4, |
| # "analyst_skepticism": 1.8, |
| # } |
| ``` |
|
|
| ### Direct via `transformers` |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| |
| tok = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1") |
| model = AutoModel.from_pretrained( |
| "BinomialTechnologies/binomial-marks-1", |
| trust_remote_code=True, |
| torch_dtype=torch.bfloat16, |
| ).eval().cuda() |
| |
| prefix = "[SECTOR: Technology] [COUNTRY: US] [TICKER: NVDA] [QUARTER: Q4 2025]\n\n" |
| inputs = tok(prefix + transcript, return_tensors="pt", |
| truncation=True, max_length=16384).to("cuda") |
| |
| with torch.no_grad(): |
| out = model.predict(**inputs) |
| # out["topic_score"]: shape (1, 10), the 10 topic directions |
| # out["tone_score"]: shape (1, 3), the 3 tone dimensions |
| ``` |
|
|
| ### Batched |
|
|
| ```python |
| from binomial_marks import MarksScorer |
| scorer = MarksScorer() # loads model once |
| results = scorer.score_batch([ |
| {"transcript": ..., "ticker": "NVDA", "sector": "Technology", "year": 2025, "quarter": 4}, |
| {"transcript": ..., "ticker": "AAPL", "sector": "Technology", "year": 2025, "quarter": 1}, |
| ]) |
| ``` |
|
|
| --- |
|
|
| ## Training |
|
|
| `binomial-marks-1` is trained on **80,000+ earnings call transcripts** spanning 2,700+ |
| unique tickers across global markets (2012β2026), each tagged with country, sector, and |
| industry metadata. Labels are distilled from frontier reasoning models and the model is |
| benchmarked against the same set of frontier systems on a held-out 2,000-call sample. |
|
|
| The split is `(ticker, year, quarter)`-keyed; this is a pure NLP imitation task β labels |
| come from language models, not market outcomes β so a temporal split is unnecessary. |
|
|
| --- |
|
|
| ## Eval β cross-LLM agreement on a 2,000-call benchmark |
|
|
| The benchmark is 2,000 calls held out from training, scored by **five systems** |
| (Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 low-reasoning, DeepSeek V4-Pro, and |
| `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction |
| dimensions: |
|
|
| | | vs Opus | vs GPT-5.5 | vs Grok | vs DeepSeek | |
| | --- | --- | --- | --- | --- | |
| | **Opus 4.7** | β | 0.886 | 0.832 | 0.803 | |
| | **GPT-5.5** | 0.886 | β | 0.871 | 0.827 | |
| | **Grok** | 0.832 | 0.871 | β | 0.807 | |
| | **DeepSeek V4**| 0.803 | 0.827 | 0.807 | β | |
| | **marks-1** | **0.711**| **0.712** | **0.692**| **0.644** | |
|
|
| | | Frontier β Frontier (6 pairs) | marks-1 β Frontier (4 pairs) | |
| |---|---|---| |
| | Mean topic-score Spearman | **0.838** | **0.690** | |
| | Mean tone Spearman | **0.61** *(see note)* | **0.62** | |
| | Mean *mentioned* MAE | **0.05** | **0.10** | |
|
|
| **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western |
| frontier models (its tone Spearman vs the others is 0.50β0.55, vs OpusβGPT-5.5 at 0.78). |
| Excluding DeepSeek, frontier tone agreement is **0.72** β and marks-1 still hits 0.67 |
| against that subset. |
|
|
| **marks-1 reproduces β82% of the agreement that frontier reasoners have with each other** |
| on financial NLP scoring, at a fraction of the inference cost (~50ms on CPU vs |
| multi-second LLM API calls). |
|
|
| ### Per-topic Spearman vs. Claude Opus 4.7 |
|
|
| | Topic | marks-1 β Opus | Opus β GPT-5.5 (ceiling) | Ξ | |
| |---|---|---|---| |
| | `dividends` | 0.84 | 0.89 | **-0.05** β | |
| | `demand` | 0.83 | 0.94 | -0.11 | |
| | `revenue_growth` | 0.81 | 0.94 | -0.13 | |
| | `buybacks` | 0.79 | 0.94 | -0.15 | |
| | `guidance` | 0.77 | 0.91 | -0.14 | |
| | `m_and_a` | 0.72 | 0.83 | -0.11 | |
| | `margins` | 0.69 | 0.91 | -0.22 | |
| | `macro_exposure` | 0.67 | 0.89 | -0.22 | |
| | `competition` | 0.60 | 0.81 | -0.21 | |
| | `headcount` | 0.41 | 0.81 | -0.40 | |
|
|
| **On the US-large-cap-heavy 2k benchmark, `headcount` is still the weakest dimension at |
| 0.41 vs Opus.** Off the US distribution, the gap closes substantially: on a 4,777-call |
| non-US/EM-heavy holdout, `headcount` Spearman lands at **0.66** β inside the same |
| range as the other 9 topics. The 2k-benchmark gap reflects that headcount language in |
| US large-cap calls is heavily templated (hiring framed as routine HR commentary), so |
| rank-order signal is genuinely scarce in that slice β not a model defect. |
|
|
| --- |
|
|
| ## Inference |
|
|
| - **Latency**: ~50ms/call on CPU, sub-10ms on modern GPUs. |
| - **Batched throughput** (bf16, max_length=16384): ~12 calls/sec/instance on A100/H100/B200. |
| - **Output is deterministic** β same input always returns the same 23 numbers. |
| - **Context window**: 16,384 tokens (~50k characters). Covers ~p99 of earnings calls. |
| |
| For deployment: the model is a standard `transformers` model. Wrap in FastAPI, deploy on |
| HF Inference Endpoints, or run as a subprocess in your data pipeline. |
| |
| --- |
| |
| ## Limitations and known gaps |
| |
| 1. **Tone has rank-order signal but absolute levels drift.** Quants should normalize |
| cross-sectionally rather than thresholding raw values. |
| 2. **English transcripts only.** Non-English calls (translated) work but degrade. Top |
| non-US training countries: GB, DE, FR, JP, SE, CH, CN. |
| 3. **Truncates at 16,384 tokens.** Covers ~p99 of calls; the very longest (Asian |
| conglomerates with 8h+ analyst days) lose middle content via head+tail truncation. |
| 4. **Pure NLP scorer β not an alpha model.** Outputs are *features*; the trading rule is |
| the consumer's responsibility. |
| |
| --- |
| |
| ## Tier |
| |
| **Tier 2 β research preview.** v1 of the model. Eval against frontier LLMs is documented |
| above; absolute calibration may shift in v2 with a larger label set. Production users |
| should run their own validation against return data. |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{binomialmarks2026, |
| author = {Binomial AI Research}, |
| title = {binomial-marks-1: An earnings-call NLP scorer for quantitative finance}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| howpublished = {\url{https://huggingface.co/BinomialTechnologies/binomial-marks-1}}, |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| Apache 2.0. Use freely; we'd appreciate a citation if you build on it. |
| |