---
license: mit
datasets:
  - OleehyO/latex-formulas-80M
  - marsianin500/Speech2Latex
  - Kyudan/MathBridge
language:
  - en
metrics:
  - cer
  - exact_match
  - bleu
base_model:
  - Salesforce/codet5p-220m
pipeline_tag: text-generation
tags:
  - LaTeX
  - math
  - text-to-LaTeX
  - translation
---

# IntelliTeX: Natural Language → LaTeX (Experimental)

## Model summary

**IntelliTeX** is an experimental Small Language Model (SLM) study for **converting English, spoken-style math descriptions into a single LaTeX equation**. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

- **Base model:** `Salesforce/codet5p-220m` (CodeT5+ 220M)
- **Primary task:** text → LaTeX equation generation (single equation output)
- **Primary language:** English

## What the model is for

**Intended use**
- Drafting LaTeX equations from short natural-language descriptions
- Prototyping or benchmarking compact models on domain-specific translation

**Not recommended**
- Fully automated formula generation without verification

## Training approach (experimental study)

We evaluated multiple training configurations to understand what improves a compact model most:

1. **LoRA fine-tuning**: rapid iteration and capability checks  
2. **Full-parameter fine-tuning (FPFT)**: to measure the performance ceiling (LoRA often underperformed FPFT)
3. **Two-stage pipeline (continued pretraining → FPFT)** inspired by CodeT5+ training recipes:
   - **Stage 1: domain-adaptive continued pretraining** on *TeXTeller* with span-denoising + causal LM objectives  
     - ~4B tokens, ~76k steps
   - **Stage 2: supervised FPFT** on Speech2LaTeX text→LaTeX pairs

## Experiment Results

- **Full-parameter fine-tuning (FPFT) was the largest single driver of gains** in our experiments. In our report, **FPFT CodeT5+ 220M reached EM 0.463**, which is ~**4×** higher than the **Qwen2.5-Coder 32B Instruct (EM 0.121)** under the same evaluation setup.
- **On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline** in our report, indicating that **training regime and architecture** can matter more than parameter count for this task.
- **Stage 1 (domain-adaptive continued pretraining) primarily improved robustness** rather than average-case performance: it **did not materially change EM on the main S2L test set** (e.g., **0.467** vs **0.463**), but helped more on stress conditions.
- **On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator** on long-context and long-target subsets, and outperformed the model with only FPFT.


**Main benchmark (Speech2LaTeX test set)**
- Qwen2.5-Coder 32B: EM 0.121
- FPFT Qwen2.5-Coder 0.5B: EM 0.405
- Stage 1 + FPFT CodeT5+ 220M: EM 0.463
- FPFT CodeT5+ 220M: EM 0.467
- FPFT Qwen2.5-Coder 3B: EM 0.507

**Stress tests (MathBridge subsets)**
- **Long-context inputs** (source length > 115 chars):  
  - FPFT CodeT5+ 220M: EM 0.150
  - Stage 1 + FPFT CodeT5+ 220M: EM 0.195
  - FPFT Qwen2.5-Coder 3B: EM 0.209
- **Long-target outputs** (target length > 60 chars):  
  - FPFT CodeT5+ 220M: EM 0.049
  - FPFT Qwen2.5-Coder 3B: EM 0.070
  - Stage 1 + FPFT CodeT5+ 220M: EM 0.076

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

inputs = tok(prompt, return_tensors="pt")

out = model.generate(
    **inputs,
    max_length=512,
)

print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$
```

### Running on Transformer.js

A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware.

| Model | |
| :--- | :---: |
| **IntelliTeX** | ![our](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/1DEmNl2ZdvQf9GqSBDkuj.gif) |
| **Qwen2.5-Coder-0.5B-Instruct** | ![qwen](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/4uTwYt1kPy8m8b0OcfW7b.gif) |

## Full Evaluation Results

### 1. Comprehensive performance on the S2L test dataset (2745 samples)

| Model Architecture | Method | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :--- | :---: | :---: | :---: | :---: |
| **SmolLM2 (135M)** | Base | 0.005 | 0.790 | 42.40 | 0.743 |
| | Base + Grammar | 0.011 | 0.822 | 7.90 | 0.279 |
| | LoRA | 0.126 | 0.957 | 0.90 | 0.823 |
| | LoRA + Grammar | 0.127 | 0.957 | 0.91 | 0.824 |
| **SmolLM2 (360M)** | Base | 0.107 | 0.695 | 9.38 | 0.802 |
| | Base + Grammar | 0.142 | 0.760 | 10.00 | 0.812 |
| | LoRA | 0.242 | 0.980 | 0.49 | 0.861 |
| | LoRA + Grammar | 0.243 | 0.980 | 0.49 | 0.862 |
| **CodeT5+ (220M)** | Base | 0.000 | 0.921 | 96.01 | 0.725 |
| | LoRA | 0.258 | 0.913 | 0.39 | 0.874 |
| | FPFT | 0.467 | 0.982 | 0.22 | 0.912 |
| | **Stage 1 + FPFT (IntelliTeX)** | 0.463 | 0.998 | 0.22 | 0.915 |
| **Qwen2.5-Coder (0.5B)** | Base | 0.161 | 0.974 | 1.27 | 0.830 |
| | Base + Grammar | 0.160 | 0.978 | 1.27 | 0.831 |
| | LoRA | 0.155 | 0.909 | 2.71 | 0.836 |
| | LoRA + Grammar | 0.155 | 0.967 | 1.75 | 0.838 |
| | FPFT | 0.405 | 0.990 | 0.24 | 0.902 |
| **Qwen2.5-Coder (3B)** | Base | 0.294 | 0.991 | 0.46 | 0.869 |
| | Base + Grammar | 0.293 | 0.996 | 0.45 | 0.870 |
| | FPFT | 0.507 | 0.997 | 0.18 | 0.919 |
| **Qwen2.5-Coder (32B)** | Base | 0.121 | 1.000 | 0.38 | 0.863 |

*Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.*

### 2. Stress Test Analysis

**Performance on Long Context Inputs (Source > 115 chars)**

*Demonstrates the model's ability to understand lengthy natural language descriptions.*

| Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :---: | :---: | :---: | :---: |
| CodeT5+ (220M) | 0.150 | 0.967 | 0.219 | 0.868 |
| **IntelliTeX (Stage 1 + FPFT)** | 0.195 | 0.997 | 0.211 | 0.873 |
| Qwen2.5-Coder (0.5B) | 0.129 | 0.976 | 0.292 | 0.859 |
| Qwen2.5-Coder (3B) | 0.209 | 0.996 | 0.199 | 0.874 |

**Performance on Long Sequence Generation (Target > 60 chars)**

*Demonstrates the model's ability to generate complex, long LaTeX formulas.*

| Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :---: | :---: | :---: | :---: | 
| CodeT5+ (220M) | 0.049 | 0.940 | 0.297 | 0.827 |
| **IntelliTeX (Stage 1 + FPFT)** | 0.076 | 0.991 | 0.312 | 0.828 |
| Qwen2.5-Coder (0.5B) | 0.037 | 0.967 | 0.394 | 0.816 |
| Qwen2.5-Coder (3B) | 0.070 | 0.988 | 0.350 | 0.822 |