| | --- |
| | license: mit |
| | datasets: |
| | - OleehyO/latex-formulas-80M |
| | - marsianin500/Speech2Latex |
| | - Kyudan/MathBridge |
| | language: |
| | - en |
| | metrics: |
| | - cer |
| | - exact_match |
| | - bleu |
| | base_model: |
| | - Salesforce/codet5p-220m |
| | pipeline_tag: text-generation |
| | tags: |
| | - LaTeX |
| | - math |
| | - text-to-LaTeX |
| | - translation |
| | --- |
| | |
| | # IntelliTeX: Natural Language → LaTeX (Experimental) |
| |
|
| | ## Model summary |
| |
|
| | **IntelliTeX** is an experimental Small Language Model (SLM) study for **converting English, spoken-style math descriptions into a single LaTeX equation**. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system. |
| |
|
| | - **Base model:** `Salesforce/codet5p-220m` (CodeT5+ 220M) |
| | - **Primary task:** text → LaTeX equation generation (single equation output) |
| | - **Primary language:** English |
| |
|
| | ## What the model is for |
| |
|
| | **Intended use** |
| | - Drafting LaTeX equations from short natural-language descriptions |
| | - Prototyping or benchmarking compact models on domain-specific translation |
| |
|
| | **Not recommended** |
| | - Fully automated formula generation without verification |
| |
|
| | ## Training approach (experimental study) |
| |
|
| | We evaluated multiple training configurations to understand what improves a compact model most: |
| |
|
| | 1. **LoRA fine-tuning**: rapid iteration and capability checks |
| | 2. **Full-parameter fine-tuning (FPFT)**: to measure the performance ceiling (LoRA often underperformed FPFT) |
| | 3. **Two-stage pipeline (continued pretraining → FPFT)** inspired by CodeT5+ training recipes: |
| | - **Stage 1: domain-adaptive continued pretraining** on *TeXTeller* with span-denoising + causal LM objectives |
| | - ~4B tokens, ~76k steps |
| | - **Stage 2: supervised FPFT** on Speech2LaTeX text→LaTeX pairs |
| |
|
| | ## Experiment Results |
| |
|
| | - **Full-parameter fine-tuning (FPFT) was the largest single driver of gains** in our experiments. In our report, **FPFT CodeT5+ 220M reached EM 0.463**, which is ~**4×** higher than the **Qwen2.5-Coder 32B Instruct (EM 0.121)** under the same evaluation setup. |
| | - **On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline** in our report, indicating that **training regime and architecture** can matter more than parameter count for this task. |
| | - **Stage 1 (domain-adaptive continued pretraining) primarily improved robustness** rather than average-case performance: it **did not materially change EM on the main S2L test set** (e.g., **0.467** vs **0.463**), but helped more on stress conditions. |
| | - **On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator** on long-context and long-target subsets, and outperformed the model with only FPFT. |
| |
|
| |
|
| | **Main benchmark (Speech2LaTeX test set)** |
| | - Qwen2.5-Coder 32B: EM 0.121 |
| | - FPFT Qwen2.5-Coder 0.5B: EM 0.405 |
| | - Stage 1 + FPFT CodeT5+ 220M: EM 0.463 |
| | - FPFT CodeT5+ 220M: EM 0.467 |
| | - FPFT Qwen2.5-Coder 3B: EM 0.507 |
| |
|
| | **Stress tests (MathBridge subsets)** |
| | - **Long-context inputs** (source length > 115 chars): |
| | - FPFT CodeT5+ 220M: EM 0.150 |
| | - Stage 1 + FPFT CodeT5+ 220M: EM 0.195 |
| | - FPFT Qwen2.5-Coder 3B: EM 0.209 |
| | - **Long-target outputs** (target length > 60 chars): |
| | - FPFT CodeT5+ 220M: EM 0.049 |
| | - FPFT Qwen2.5-Coder 3B: EM 0.070 |
| | - Stage 1 + FPFT CodeT5+ 220M: EM 0.076 |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| | |
| | model_id = "duanxianpi/IntelliTeX" |
| | tok = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
| | |
| | text = "the integral from zero to one of x squared dx" |
| | prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}" |
| | |
| | inputs = tok(prompt, return_tensors="pt") |
| | |
| | out = model.generate( |
| | **inputs, |
| | max_length=512, |
| | ) |
| | |
| | print(tok.decode(out[0], skip_special_tokens=True)) |
| | # Ouput: $$\int_{0}^{1}x^{2}\,dx$$ |
| | ``` |
| |
|
| | ### Running on Transformer.js |
| |
|
| | A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware. |
| |
|
| | | Model | | |
| | | :--- | :---: | |
| | | **IntelliTeX** |  | |
| | | **Qwen2.5-Coder-0.5B-Instruct** |  | |
| |
|
| | ## Full Evaluation Results |
| |
|
| | ### 1. Comprehensive performance on the S2L test dataset (2745 samples) |
| |
|
| | | Model Architecture | Method | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | |
| | | :--- | :--- | :---: | :---: | :---: | :---: | |
| | | **SmolLM2 (135M)** | Base | 0.005 | 0.790 | 42.40 | 0.743 | |
| | | | Base + Grammar | 0.011 | 0.822 | 7.90 | 0.279 | |
| | | | LoRA | 0.126 | 0.957 | 0.90 | 0.823 | |
| | | | LoRA + Grammar | 0.127 | 0.957 | 0.91 | 0.824 | |
| | | **SmolLM2 (360M)** | Base | 0.107 | 0.695 | 9.38 | 0.802 | |
| | | | Base + Grammar | 0.142 | 0.760 | 10.00 | 0.812 | |
| | | | LoRA | 0.242 | 0.980 | 0.49 | 0.861 | |
| | | | LoRA + Grammar | 0.243 | 0.980 | 0.49 | 0.862 | |
| | | **CodeT5+ (220M)** | Base | 0.000 | 0.921 | 96.01 | 0.725 | |
| | | | LoRA | 0.258 | 0.913 | 0.39 | 0.874 | |
| | | | FPFT | 0.467 | 0.982 | 0.22 | 0.912 | |
| | | | **Stage 1 + FPFT (IntelliTeX)** | 0.463 | 0.998 | 0.22 | 0.915 | |
| | | **Qwen2.5-Coder (0.5B)** | Base | 0.161 | 0.974 | 1.27 | 0.830 | |
| | | | Base + Grammar | 0.160 | 0.978 | 1.27 | 0.831 | |
| | | | LoRA | 0.155 | 0.909 | 2.71 | 0.836 | |
| | | | LoRA + Grammar | 0.155 | 0.967 | 1.75 | 0.838 | |
| | | | FPFT | 0.405 | 0.990 | 0.24 | 0.902 | |
| | | **Qwen2.5-Coder (3B)** | Base | 0.294 | 0.991 | 0.46 | 0.869 | |
| | | | Base + Grammar | 0.293 | 0.996 | 0.45 | 0.870 | |
| | | | FPFT | 0.507 | 0.997 | 0.18 | 0.919 | |
| | | **Qwen2.5-Coder (32B)** | Base | 0.121 | 1.000 | 0.38 | 0.863 | |
| |
|
| | *Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.* |
| |
|
| | ### 2. Stress Test Analysis |
| |
|
| | **Performance on Long Context Inputs (Source > 115 chars)** |
| |
|
| | *Demonstrates the model's ability to understand lengthy natural language descriptions.* |
| |
|
| | | Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | |
| | | :--- | :---: | :---: | :---: | :---: | |
| | | CodeT5+ (220M) | 0.150 | 0.967 | 0.219 | 0.868 | |
| | | **IntelliTeX (Stage 1 + FPFT)** | 0.195 | 0.997 | 0.211 | 0.873 | |
| | | Qwen2.5-Coder (0.5B) | 0.129 | 0.976 | 0.292 | 0.859 | |
| | | Qwen2.5-Coder (3B) | 0.209 | 0.996 | 0.199 | 0.874 | |
| |
|
| | **Performance on Long Sequence Generation (Target > 60 chars)** |
| |
|
| | *Demonstrates the model's ability to generate complex, long LaTeX formulas.* |
| |
|
| | | Model (FPFT) | EM ↑ | CR ↑ | CER ↓ | TexBLEU ↑ | |
| | | :--- | :---: | :---: | :---: | :---: | |
| | | CodeT5+ (220M) | 0.049 | 0.940 | 0.297 | 0.827 | |
| | | **IntelliTeX (Stage 1 + FPFT)** | 0.076 | 0.991 | 0.312 | 0.828 | |
| | | Qwen2.5-Coder (0.5B) | 0.037 | 0.967 | 0.394 | 0.816 | |
| | | Qwen2.5-Coder (3B) | 0.070 | 0.988 | 0.350 | 0.822 | |
| |
|