CoFrGeNet-F β€” Continued Fraction Language Model

An open-source implementation of the CoFrGeNet-F architecture from IBM Research's paper arXiv:2601.21766. CoFrGeNet-F replaces standard Transformer FFN layers with continued fraction networks. This repo contains model weights for two CoFrGeNet-F experiments and the standard Transformer baseline, all trained on identical data with identical hyperparameters.


The Mathematics of Continued Fractions

At the heart of this architecture lies the generalized continued fraction:

f~(a1,a2,…,ad)β€…β€Š=β€…β€Š1a1+1a2+1β‹±+1ad \tilde{f}(a_1, a_2, \ldots, a_d) \;=\; \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{\ddots + \cfrac{1}{a_d}}}}

where each coefficient is a learned linear function of the input. This recursive structure is computed efficiently via continuant polynomials:

K0=1,K1(ad)=ad,Kk=adβˆ’k+1β‹…Kkβˆ’1+Kkβˆ’2 K_0 = 1, \qquad K_1(a_d) = a_d, \qquad K_k = a_{d-k+1} \cdot K_{k-1} + K_{k-2}

The continued fraction then evaluates to a simple ratio of consecutive continuants:

f~=Kdβˆ’1Kd \tilde{f} = \frac{K_{d-1}}{K_d}

Efficient Gradients via Proposition 1

The continuant formulation yields gradients that require only one division instead of d:

βˆ‚f~βˆ‚ak=(βˆ’1)k(Kdβˆ’kKd)2 \frac{\partial \tilde{f}}{\partial a_k} = (-1)^{k} \left( \frac{K_{d-k}}{K_d} \right)^{2}

By caching the reciprocal of the final continuant and reusing it across all depth levels, we reduce the cost from O(d) divisions to O(1).


The Cffn Layer

Each Continued Fraction FFN (Cffn) replaces the standard two-layer FFN. Where a standard Transformer block uses Linear, GELU, Linear with a 4x expansion, the Cffn instead computes:

y=Ux+βˆ‘j=1LVjβ‹…zj y = U x + \sum_{j=1}^{L} V_j \cdot z_j

Direct linear path:

U∈RpΓ—p U \in \mathbb{R}^{p \times p}

Gating projection:

x^=Οƒ(Gx)βŠ™x,G∈RpΓ—p \hat{x} = \sigma(G x) \odot x, \qquad G \in \mathbb{R}^{p \times p}

Continued fraction ladders (each computes p independent continued fractions element-wise):

zj=f~ ⁣(x^βŠ™W(j)),W(j)∈RpΓ—d,j=1,…,L z_j = \tilde{f}\!\left( \hat{x} \odot W^{(j)} \right), \qquad W^{(j)} \in \mathbb{R}^{p \times d}, \quad j = 1, \ldots, L

Combination weights:

V∈RpΓ—L V \in \mathbb{R}^{p \times L}

Parameter Count per Cffn Layer

2p2+Lβ‹…pβ‹…(d+1) 2p^2 + L \cdot p \cdot (d + 1)

At p=768: 1,193,472 params vs a standard FFN's 4,718,592 β€” a 3.95x reduction per layer.


Experiments

All models trained on FineWeb-Edu 10BT (~10B tokens) with identical hyperparameters. Evaluated on the same NVIDIA H200 with identical code.

Experiment 1: Parameter-Efficient (82M vs 124M)

Can CoFrGeNet-F match baseline quality with 34% fewer parameters?

Model Config Parameters
Baseline 12L, 768d, 12h, standard FFN 124.3M
CoFrGeNet-F 12L, 768d, 12h, Cffn (L=3, d=5) 82.0M

Experiment 2: Iso-Parameter (128M vs 124M)

With equal parameter budget, does CoFrGeNet-F match or beat the baseline? By widening the hidden dimension from 768 to 1024, the Cffn model reaches ~128M parameters. Because Cffn scales as:

Cffn(p)=2p2+Lβ‹…pβ‹…(d+1) \text{Cffn}(p) = 2p^2 + L \cdot p \cdot (d + 1)

while the standard FFN scales as:

FFN(p)=8p2 \text{FFN}(p) = 8p^2

the Cffn model can use a wider hidden dimension within the same parameter budget, giving it richer representations and more attention heads.

Model Config Parameters
Baseline 12L, 768d, 12h, standard FFN 124.3M
CoFrGeNet-F 12L, 1024d, 16h, Cffn (L=3, d=5) 128.3M

Results

Metric Baseline (124M) CoFrGeNet-F (82M) CoFrGeNet-F (128M)
WikiText-2 PPL 40.79 110.32 82.46
WikiText-103 PPL 40.79 110.32 82.46
LAMBADA PPL 37.45 166.57 111.26
LAMBADA Accuracy 19.06% 8.77% 11.41%
Throughput 452,622 tok/s 103,455 tok/s 128,206 tok/s
Generation Speed 3.68 ms/tok 10.92 ms/tok 10.50 ms/tok

Analysis

The 128M model shows significant improvement over the 82M (WikiText-2 PPL 110 to 82, LAMBADA accuracy 8.8% to 11.4%), confirming that the architecture benefits from scale. However, the baseline still wins at this scale. The paper's strongest results were at GPT-2 XL scale (985M CoFrGeNet-F vs 1.5B GPT2-xl), suggesting the gap may close at larger model sizes.

Experiment 3: More Ladders (128M, L=8) β€” In Progress

Each ladder is an independent rational approximation. More ladders give a richer function space at negligible parameter cost (+0.3%). Training on H200, ETA ~March 10, 2026. Results will be added here upon completion.


Training Details

Hyperparameter Value
Dataset FineWeb-Edu sample-10BT (~10B tokens)
Tokenizer GPT-2 (tiktoken)
Optimizer AdamW
Learning rate 6e-4 peak, cosine decay to 0
Warmup 700 steps
Weight decay 0.1
Batch size 524,288 tokens per update
Total steps 19,073 (one epoch)
Precision bfloat16
Seed 42

Dyadic Training Schedule

The paper's dyadic schedule is critical β€” without it, performance degrades 10-80%. Each depth level i unfreezes at:

si=(1βˆ’12i)Γ—Stotal s_i = \left(1 - \frac{1}{2^i}\right) \times S_{\text{total}}

Depth Unfreeze Step % of Training
0 (linear components U, G, V) 0 0%
1 (ladder column 0) 9,537 50%
2 (ladder column 1) 14,305 75%
3 (ladder column 2) 16,689 87.5%
4 (ladder column 3) 17,881 93.75%
5 (ladder column 4) 18,477 96.875%

Hardware

Model GPU Throughput Training Time
Baseline (124M) RTX 5090 ~141K tok/s ~19.7 hours
CoFrGeNet-F (82M) H200 NVL ~74K tok/s ~37.3 hours
CoFrGeNet-F (128M) H200 NVL ~114K tok/s ~24.3 hours

The 128M model used torch.compile for a ~2.3x throughput boost.


Repo Structure

cahlen/cofrgenet-f/
β”œβ”€β”€ baseline/
β”‚   β”œβ”€β”€ model.safetensors       # Standard Transformer (124M params)
β”‚   └── eval_results.json
β”œβ”€β”€ cofrgenet/
β”‚   β”œβ”€β”€ model.safetensors       # CoFrGeNet-F 82M (Experiment 1)
β”‚   └── eval_results.json
β”œβ”€β”€ cofrgenet-128m/
β”‚   β”œβ”€β”€ model.safetensors       # CoFrGeNet-F 128M (Experiment 2)
β”‚   └── eval_results.json
└── src/                        # Full model source code
    β”œβ”€β”€ cofrgenet/
    └── baseline/

Usage

CoFrGeNet-F 128M (Experiment 2):

from safetensors.torch import load_file
from src.cofrgenet.config import CoFrGeNetConfig
from src.cofrgenet.model import CoFrGeNetTransformer

config = CoFrGeNetConfig(n_embd=1024, n_head=16)
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet-128m/model.safetensors")
# Strip torch.compile prefix if present
cleaned = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
model.load_state_dict(cleaned, strict=False)
model.eval()

CoFrGeNet-F 82M (Experiment 1):

config = CoFrGeNetConfig()  # defaults: 768d, 12h
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

Baseline (124M):

from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer

config = BaselineConfig()
model = BaselineTransformer(config)
state_dict = load_file("baseline/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

Source Code

Full implementation: github.com/cahlen/cofrgenet-f

Citation

@article{dey2025cofrgenet,
  title={CoFrGeNet: Continued Fraction based Geometric Network for target benchmarks},
  author={Dey, Subhajit and others},
  journal={arXiv preprint arXiv:2601.21766},
  year={2025}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train cahlen/cofrgenet-f

Paper for cahlen/cofrgenet-f

Evaluation results