--- license: mit datasets: - OleehyO/latex-formulas-80M - marsianin500/Speech2Latex - Kyudan/MathBridge language: - en metrics: - cer - exact_match - bleu base_model: - Salesforce/codet5p-220m pipeline_tag: text-generation tags: - LaTeX - math - text-to-LaTeX - translation --- # IntelliTeX: Natural Language → LaTeX (Experimental) ## Model summary **IntelliTeX** is an experimental Small Language Model (SLM) study for **converting English, spoken-style math descriptions into a single LaTeX equation**. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system. - **Base model:** `Salesforce/codet5p-220m` (CodeT5+ 220M) - **Primary task:** text → LaTeX equation generation (single equation output) - **Primary language:** English ## What the model is for **Intended use** - Drafting LaTeX equations from short natural-language descriptions - Prototyping or benchmarking compact models on domain-specific translation **Not recommended** - Fully automated formula generation without verification ## Training approach (experimental study) We evaluated multiple training configurations to understand what improves a compact model most: 1. **LoRA fine-tuning**: rapid iteration and capability checks 2. **Full-parameter fine-tuning (FPFT)**: to measure the performance ceiling (LoRA often underperformed FPFT) 3. **Two-stage pipeline (continued pretraining → FPFT)** inspired by CodeT5+ training recipes: - **Stage 1: domain-adaptive continued pretraining** on *TeXTeller* with span-denoising + causal LM objectives - ~4B tokens, ~76k steps - **Stage 2: supervised FPFT** on Speech2LaTeX text→LaTeX pairs ## Experiment Results - **Full-parameter fine-tuning (FPFT) was the largest single driver of gains** in our experiments. In our report, **FPFT CodeT5+ 220M reached EM 0.463**, which is ~**4×** higher than the **Qwen2.5-Coder 32B Instruct (EM 0.121)** under the same evaluation setup. - **On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline** in our report, indicating that **training regime and architecture** can matter more than parameter count for this task. - **Stage 1 (domain-adaptive continued pretraining) primarily improved robustness** rather than average-case performance: it **did not materially change EM on the main S2L test set** (e.g., **0.467** vs **0.463**), but helped more on stress conditions. - **On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator** on long-context and long-target subsets, and outperformed the model with only FPFT. **Main benchmark (Speech2LaTeX test set)** - Qwen2.5-Coder 32B: EM 0.121 - FPFT Qwen2.5-Coder 0.5B: EM 0.405 - Stage 1 + FPFT CodeT5+ 220M: EM 0.463 - FPFT CodeT5+ 220M: EM 0.467 - FPFT Qwen2.5-Coder 3B: EM 0.507 **Stress tests (MathBridge subsets)** - **Long-context inputs** (source length > 115 chars): - FPFT CodeT5+ 220M: EM 0.150 - Stage 1 + FPFT CodeT5+ 220M: EM 0.195 - FPFT Qwen2.5-Coder 3B: EM 0.209 - **Long-target outputs** (target length > 60 chars): - FPFT CodeT5+ 220M: EM 0.049 - FPFT Qwen2.5-Coder 3B: EM 0.070 - Stage 1 + FPFT CodeT5+ 220M: EM 0.076 ## Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_id = "duanxianpi/IntelliTeX" tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) text = "the integral from zero to one of x squared dx" prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}" inputs = tok(prompt, return_tensors="pt") out = model.generate( **inputs, max_length=512, ) print(tok.decode(out[0], skip_special_tokens=True)) # Ouput: $$\int_{0}^{1}x^{2}\,dx$$ ``` ### Running on Transformer.js A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware. | Model | | | :--- | :---: | | **IntelliTeX** | ![our](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/1DEmNl2ZdvQf9GqSBDkuj.gif) | | **Qwen2.5-Coder-0.5B-Instruct** | ![qwen](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/4uTwYt1kPy8m8b0OcfW7b.gif) | ## Full Evaluation Results ### 1. Comprehensive performance on the S2L test dataset (2745 samples) | Model Architecture | Method | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | | :--- | :--- | :---: | :---: | :---: | :---: | | **SmolLM2 (135M)** | Base | 0.005 | 0.790 | 42.40 | 0.743 | | | Base + Grammar | 0.011 | 0.822 | 7.90 | 0.279 | | | LoRA | 0.126 | 0.957 | 0.90 | 0.823 | | | LoRA + Grammar | 0.127 | 0.957 | 0.91 | 0.824 | | **SmolLM2 (360M)** | Base | 0.107 | 0.695 | 9.38 | 0.802 | | | Base + Grammar | 0.142 | 0.760 | 10.00 | 0.812 | | | LoRA | 0.242 | 0.980 | 0.49 | 0.861 | | | LoRA + Grammar | 0.243 | 0.980 | 0.49 | 0.862 | | **CodeT5+ (220M)** | Base | 0.000 | 0.921 | 96.01 | 0.725 | | | LoRA | 0.258 | 0.913 | 0.39 | 0.874 | | | FPFT | 0.467 | 0.982 | 0.22 | 0.912 | | | **Stage 1 + FPFT (IntelliTeX)** | 0.463 | 0.998 | 0.22 | 0.915 | | **Qwen2.5-Coder (0.5B)** | Base | 0.161 | 0.974 | 1.27 | 0.830 | | | Base + Grammar | 0.160 | 0.978 | 1.27 | 0.831 | | | LoRA | 0.155 | 0.909 | 2.71 | 0.836 | | | LoRA + Grammar | 0.155 | 0.967 | 1.75 | 0.838 | | | FPFT | 0.405 | 0.990 | 0.24 | 0.902 | | **Qwen2.5-Coder (3B)** | Base | 0.294 | 0.991 | 0.46 | 0.869 | | | Base + Grammar | 0.293 | 0.996 | 0.45 | 0.870 | | | FPFT | 0.507 | 0.997 | 0.18 | 0.919 | | **Qwen2.5-Coder (32B)** | Base | 0.121 | 1.000 | 0.38 | 0.863 | *Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.* ### 2. Stress Test Analysis **Performance on Long Context Inputs (Source > 115 chars)** *Demonstrates the model's ability to understand lengthy natural language descriptions.* | Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | | :--- | :---: | :---: | :---: | :---: | | CodeT5+ (220M) | 0.150 | 0.967 | 0.219 | 0.868 | | **IntelliTeX (Stage 1 + FPFT)** | 0.195 | 0.997 | 0.211 | 0.873 | | Qwen2.5-Coder (0.5B) | 0.129 | 0.976 | 0.292 | 0.859 | | Qwen2.5-Coder (3B) | 0.209 | 0.996 | 0.199 | 0.874 | **Performance on Long Sequence Generation (Target > 60 chars)** *Demonstrates the model's ability to generate complex, long LaTeX formulas.* | Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | | :--- | :---: | :---: | :---: | :---: | | CodeT5+ (220M) | 0.049 | 0.940 | 0.297 | 0.827 | | **IntelliTeX (Stage 1 + FPFT)** | 0.076 | 0.991 | 0.312 | 0.828 | | Qwen2.5-Coder (0.5B) | 0.037 | 0.967 | 0.394 | 0.816 | | Qwen2.5-Coder (3B) | 0.070 | 0.988 | 0.350 | 0.822 |