IntelliTex / README.md
duanxianpi's picture
Update README.md
2d252a0 verified
---
license: mit
datasets:
- OleehyO/latex-formulas-80M
- marsianin500/Speech2Latex
- Kyudan/MathBridge
language:
- en
metrics:
- cer
- exact_match
- bleu
base_model:
- Salesforce/codet5p-220m
pipeline_tag: text-generation
tags:
- LaTeX
- math
- text-to-LaTeX
- translation
---
# IntelliTeX: Natural Language → LaTeX (Experimental)
## Model summary
**IntelliTeX** is an experimental Small Language Model (SLM) study for **converting English, spoken-style math descriptions into a single LaTeX equation**. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.
- **Base model:** `Salesforce/codet5p-220m` (CodeT5+ 220M)
- **Primary task:** text → LaTeX equation generation (single equation output)
- **Primary language:** English
## What the model is for
**Intended use**
- Drafting LaTeX equations from short natural-language descriptions
- Prototyping or benchmarking compact models on domain-specific translation
**Not recommended**
- Fully automated formula generation without verification
## Training approach (experimental study)
We evaluated multiple training configurations to understand what improves a compact model most:
1. **LoRA fine-tuning**: rapid iteration and capability checks
2. **Full-parameter fine-tuning (FPFT)**: to measure the performance ceiling (LoRA often underperformed FPFT)
3. **Two-stage pipeline (continued pretraining → FPFT)** inspired by CodeT5+ training recipes:
- **Stage 1: domain-adaptive continued pretraining** on *TeXTeller* with span-denoising + causal LM objectives
- ~4B tokens, ~76k steps
- **Stage 2: supervised FPFT** on Speech2LaTeX text→LaTeX pairs
## Experiment Results
- **Full-parameter fine-tuning (FPFT) was the largest single driver of gains** in our experiments. In our report, **FPFT CodeT5+ 220M reached EM 0.463**, which is ~**4×** higher than the **Qwen2.5-Coder 32B Instruct (EM 0.121)** under the same evaluation setup.
- **On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline** in our report, indicating that **training regime and architecture** can matter more than parameter count for this task.
- **Stage 1 (domain-adaptive continued pretraining) primarily improved robustness** rather than average-case performance: it **did not materially change EM on the main S2L test set** (e.g., **0.467** vs **0.463**), but helped more on stress conditions.
- **On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator** on long-context and long-target subsets, and outperformed the model with only FPFT.
**Main benchmark (Speech2LaTeX test set)**
- Qwen2.5-Coder 32B: EM 0.121
- FPFT Qwen2.5-Coder 0.5B: EM 0.405
- Stage 1 + FPFT CodeT5+ 220M: EM 0.463
- FPFT CodeT5+ 220M: EM 0.467
- FPFT Qwen2.5-Coder 3B: EM 0.507
**Stress tests (MathBridge subsets)**
- **Long-context inputs** (source length > 115 chars):
- FPFT CodeT5+ 220M: EM 0.150
- Stage 1 + FPFT CodeT5+ 220M: EM 0.195
- FPFT Qwen2.5-Coder 3B: EM 0.209
- **Long-target outputs** (target length > 60 chars):
- FPFT CodeT5+ 220M: EM 0.049
- FPFT Qwen2.5-Coder 3B: EM 0.070
- Stage 1 + FPFT CodeT5+ 220M: EM 0.076
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"
inputs = tok(prompt, return_tensors="pt")
out = model.generate(
**inputs,
max_length=512,
)
print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$
```
### Running on Transformer.js
A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware.
| Model | |
| :--- | :---: |
| **IntelliTeX** | ![our](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/1DEmNl2ZdvQf9GqSBDkuj.gif) |
| **Qwen2.5-Coder-0.5B-Instruct** | ![qwen](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/4uTwYt1kPy8m8b0OcfW7b.gif) |
## Full Evaluation Results
### 1. Comprehensive performance on the S2L test dataset (2745 samples)
| Model Architecture | Method | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :--- | :---: | :---: | :---: | :---: |
| **SmolLM2 (135M)** | Base | 0.005 | 0.790 | 42.40 | 0.743 |
| | Base + Grammar | 0.011 | 0.822 | 7.90 | 0.279 |
| | LoRA | 0.126 | 0.957 | 0.90 | 0.823 |
| | LoRA + Grammar | 0.127 | 0.957 | 0.91 | 0.824 |
| **SmolLM2 (360M)** | Base | 0.107 | 0.695 | 9.38 | 0.802 |
| | Base + Grammar | 0.142 | 0.760 | 10.00 | 0.812 |
| | LoRA | 0.242 | 0.980 | 0.49 | 0.861 |
| | LoRA + Grammar | 0.243 | 0.980 | 0.49 | 0.862 |
| **CodeT5+ (220M)** | Base | 0.000 | 0.921 | 96.01 | 0.725 |
| | LoRA | 0.258 | 0.913 | 0.39 | 0.874 |
| | FPFT | 0.467 | 0.982 | 0.22 | 0.912 |
| | **Stage 1 + FPFT (IntelliTeX)** | 0.463 | 0.998 | 0.22 | 0.915 |
| **Qwen2.5-Coder (0.5B)** | Base | 0.161 | 0.974 | 1.27 | 0.830 |
| | Base + Grammar | 0.160 | 0.978 | 1.27 | 0.831 |
| | LoRA | 0.155 | 0.909 | 2.71 | 0.836 |
| | LoRA + Grammar | 0.155 | 0.967 | 1.75 | 0.838 |
| | FPFT | 0.405 | 0.990 | 0.24 | 0.902 |
| **Qwen2.5-Coder (3B)** | Base | 0.294 | 0.991 | 0.46 | 0.869 |
| | Base + Grammar | 0.293 | 0.996 | 0.45 | 0.870 |
| | FPFT | 0.507 | 0.997 | 0.18 | 0.919 |
| **Qwen2.5-Coder (32B)** | Base | 0.121 | 1.000 | 0.38 | 0.863 |
*Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.*
### 2. Stress Test Analysis
**Performance on Long Context Inputs (Source > 115 chars)**
*Demonstrates the model's ability to understand lengthy natural language descriptions.*
| Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :---: | :---: | :---: | :---: |
| CodeT5+ (220M) | 0.150 | 0.967 | 0.219 | 0.868 |
| **IntelliTeX (Stage 1 + FPFT)** | 0.195 | 0.997 | 0.211 | 0.873 |
| Qwen2.5-Coder (0.5B) | 0.129 | 0.976 | 0.292 | 0.859 |
| Qwen2.5-Coder (3B) | 0.209 | 0.996 | 0.199 | 0.874 |
**Performance on Long Sequence Generation (Target > 60 chars)**
*Demonstrates the model's ability to generate complex, long LaTeX formulas.*
| Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ |
| :--- | :---: | :---: | :---: | :---: |
| CodeT5+ (220M) | 0.049 | 0.940 | 0.297 | 0.827 |
| **IntelliTeX (Stage 1 + FPFT)** | 0.076 | 0.991 | 0.312 | 0.828 |
| Qwen2.5-Coder (0.5B) | 0.037 | 0.967 | 0.394 | 0.816 |
| Qwen2.5-Coder (3B) | 0.070 | 0.988 | 0.350 | 0.822 |