Update README.md

2d252a0 verified 3 months ago

6.82 kB

	---
	license: mit
	datasets:
	- OleehyO/latex-formulas-80M
	- marsianin500/Speech2Latex
	- Kyudan/MathBridge
	language:
	- en
	metrics:
	- cer
	- exact_match
	- bleu
	base_model:
	- Salesforce/codet5p-220m
	pipeline_tag: text-generation
	tags:
	- LaTeX
	- math
	- text-to-LaTeX
	- translation
	---

	# IntelliTeX: Natural Language → LaTeX (Experimental)

	## Model summary

	IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

	- Base model: `Salesforce/codet5p-220m` (CodeT5+ 220M)
	- Primary task: text → LaTeX equation generation (single equation output)
	- Primary language: English

	## What the model is for

	Intended use
	- Drafting LaTeX equations from short natural-language descriptions
	- Prototyping or benchmarking compact models on domain-specific translation

	Not recommended
	- Fully automated formula generation without verification

	## Training approach (experimental study)

	We evaluated multiple training configurations to understand what improves a compact model most:

	1. LoRA fine-tuning: rapid iteration and capability checks
	2. Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
	3. Two-stage pipeline (continued pretraining → FPFT) inspired by CodeT5+ training recipes:
	- Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
	- ~4B tokens, ~76k steps
	- Stage 2: supervised FPFT on Speech2LaTeX text→LaTeX pairs

	## Experiment Results

	- Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4× higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
	- On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
	- Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
	- On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.


	Main benchmark (Speech2LaTeX test set)
	- Qwen2.5-Coder 32B: EM 0.121
	- FPFT Qwen2.5-Coder 0.5B: EM 0.405
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.463
	- FPFT CodeT5+ 220M: EM 0.467
	- FPFT Qwen2.5-Coder 3B: EM 0.507

	Stress tests (MathBridge subsets)
	- Long-context inputs (source length > 115 chars):
	- FPFT CodeT5+ 220M: EM 0.150
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.195
	- FPFT Qwen2.5-Coder 3B: EM 0.209
	- Long-target outputs (target length > 60 chars):
	- FPFT CodeT5+ 220M: EM 0.049
	- FPFT Qwen2.5-Coder 3B: EM 0.070
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.076

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_id = "duanxianpi/IntelliTeX"
	tok = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

	text = "the integral from zero to one of x squared dx"
	prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

	inputs = tok(prompt, return_tensors="pt")

	out = model.generate(
	**inputs,
	max_length=512,
	)

	print(tok.decode(out[0], skip_special_tokens=True))
	# Ouput: $$\int_{0}^{1}x^{2}\,dx$$
	```

	### Running on Transformer.js

	A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware.

	\| Model \| \|
	\| :--- \| :---: \|
	\| IntelliTeX \| ![our](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/1DEmNl2ZdvQf9GqSBDkuj.gif) \|
	\| Qwen2.5-Coder-0.5B-Instruct \| ![qwen](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/4uTwYt1kPy8m8b0OcfW7b.gif) \|

	## Full Evaluation Results

	### 1. Comprehensive performance on the S2L test dataset (2745 samples)

	\| Model Architecture \| Method \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| SmolLM2 (135M) \| Base \| 0.005 \| 0.790 \| 42.40 \| 0.743 \|
	\| \| Base + Grammar \| 0.011 \| 0.822 \| 7.90 \| 0.279 \|
	\| \| LoRA \| 0.126 \| 0.957 \| 0.90 \| 0.823 \|
	\| \| LoRA + Grammar \| 0.127 \| 0.957 \| 0.91 \| 0.824 \|
	\| SmolLM2 (360M) \| Base \| 0.107 \| 0.695 \| 9.38 \| 0.802 \|
	\| \| Base + Grammar \| 0.142 \| 0.760 \| 10.00 \| 0.812 \|
	\| \| LoRA \| 0.242 \| 0.980 \| 0.49 \| 0.861 \|
	\| \| LoRA + Grammar \| 0.243 \| 0.980 \| 0.49 \| 0.862 \|
	\| CodeT5+ (220M) \| Base \| 0.000 \| 0.921 \| 96.01 \| 0.725 \|
	\| \| LoRA \| 0.258 \| 0.913 \| 0.39 \| 0.874 \|
	\| \| FPFT \| 0.467 \| 0.982 \| 0.22 \| 0.912 \|
	\| \| Stage 1 + FPFT (IntelliTeX) \| 0.463 \| 0.998 \| 0.22 \| 0.915 \|
	\| Qwen2.5-Coder (0.5B) \| Base \| 0.161 \| 0.974 \| 1.27 \| 0.830 \|
	\| \| Base + Grammar \| 0.160 \| 0.978 \| 1.27 \| 0.831 \|
	\| \| LoRA \| 0.155 \| 0.909 \| 2.71 \| 0.836 \|
	\| \| LoRA + Grammar \| 0.155 \| 0.967 \| 1.75 \| 0.838 \|
	\| \| FPFT \| 0.405 \| 0.990 \| 0.24 \| 0.902 \|
	\| Qwen2.5-Coder (3B) \| Base \| 0.294 \| 0.991 \| 0.46 \| 0.869 \|
	\| \| Base + Grammar \| 0.293 \| 0.996 \| 0.45 \| 0.870 \|
	\| \| FPFT \| 0.507 \| 0.997 \| 0.18 \| 0.919 \|
	\| Qwen2.5-Coder (32B) \| Base \| 0.121 \| 1.000 \| 0.38 \| 0.863 \|

	Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.

	### 2. Stress Test Analysis

	Performance on Long Context Inputs (Source > 115 chars)

	Demonstrates the model's ability to understand lengthy natural language descriptions.

	\| Model (FPFT) \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CodeT5+ (220M) \| 0.150 \| 0.967 \| 0.219 \| 0.868 \|
	\| IntelliTeX (Stage 1 + FPFT) \| 0.195 \| 0.997 \| 0.211 \| 0.873 \|
	\| Qwen2.5-Coder (0.5B) \| 0.129 \| 0.976 \| 0.292 \| 0.859 \|
	\| Qwen2.5-Coder (3B) \| 0.209 \| 0.996 \| 0.199 \| 0.874 \|

	Performance on Long Sequence Generation (Target > 60 chars)

	Demonstrates the model's ability to generate complex, long LaTeX formulas.

	\| Model (FPFT) \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CodeT5+ (220M) \| 0.049 \| 0.940 \| 0.297 \| 0.827 \|
	\| IntelliTeX (Stage 1 + FPFT) \| 0.076 \| 0.991 \| 0.312 \| 0.828 \|
	\| Qwen2.5-Coder (0.5B) \| 0.037 \| 0.967 \| 0.394 \| 0.816 \|
	\| Qwen2.5-Coder (3B) \| 0.070 \| 0.988 \| 0.350 \| 0.822 \|

	---
	license: mit
	datasets:
	- OleehyO/latex-formulas-80M
	- marsianin500/Speech2Latex
	- Kyudan/MathBridge
	language:
	- en
	metrics:
	- cer
	- exact_match
	- bleu
	base_model:
	- Salesforce/codet5p-220m
	pipeline_tag: text-generation
	tags:
	- LaTeX
	- math
	- text-to-LaTeX
	- translation
	---

	# IntelliTeX: Natural Language → LaTeX (Experimental)

	## Model summary

	IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

	- Base model: `Salesforce/codet5p-220m` (CodeT5+ 220M)
	- Primary task: text → LaTeX equation generation (single equation output)
	- Primary language: English

	## What the model is for

	Intended use
	- Drafting LaTeX equations from short natural-language descriptions
	- Prototyping or benchmarking compact models on domain-specific translation

	Not recommended
	- Fully automated formula generation without verification

	## Training approach (experimental study)

	We evaluated multiple training configurations to understand what improves a compact model most:

	1. LoRA fine-tuning: rapid iteration and capability checks
	2. Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
	3. Two-stage pipeline (continued pretraining → FPFT) inspired by CodeT5+ training recipes:
	- Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
	- ~4B tokens, ~76k steps
	- Stage 2: supervised FPFT on Speech2LaTeX text→LaTeX pairs

	## Experiment Results

	- Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4× higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
	- On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
	- Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
	- On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.


	Main benchmark (Speech2LaTeX test set)
	- Qwen2.5-Coder 32B: EM 0.121
	- FPFT Qwen2.5-Coder 0.5B: EM 0.405
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.463
	- FPFT CodeT5+ 220M: EM 0.467
	- FPFT Qwen2.5-Coder 3B: EM 0.507

	Stress tests (MathBridge subsets)
	- Long-context inputs (source length > 115 chars):
	- FPFT CodeT5+ 220M: EM 0.150
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.195
	- FPFT Qwen2.5-Coder 3B: EM 0.209
	- Long-target outputs (target length > 60 chars):
	- FPFT CodeT5+ 220M: EM 0.049
	- FPFT Qwen2.5-Coder 3B: EM 0.070
	- Stage 1 + FPFT CodeT5+ 220M: EM 0.076

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_id = "duanxianpi/IntelliTeX"
	tok = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

	text = "the integral from zero to one of x squared dx"
	prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

	inputs = tok(prompt, return_tensors="pt")

	out = model.generate(
	**inputs,
	max_length=512,
	)

	print(tok.decode(out[0], skip_special_tokens=True))
	# Ouput: $$\int_{0}^{1}x^{2}\,dx$$
	```

	### Running on Transformer.js

	A live, in-browser demonstration using the `transformer.js` library showcases this efficiency advantage on typical CPU hardware.

	\| Model \| \|
	\| :--- \| :---: \|
	\| IntelliTeX \| ![our](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/1DEmNl2ZdvQf9GqSBDkuj.gif) \|
	\| Qwen2.5-Coder-0.5B-Instruct \| ![qwen](https://cdn-uploads.huggingface.co/production/uploads/68661fe1b3d1359fb3442418/4uTwYt1kPy8m8b0OcfW7b.gif) \|

	## Full Evaluation Results

	### 1. Comprehensive performance on the S2L test dataset (2745 samples)

	\| Model Architecture \| Method \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| SmolLM2 (135M) \| Base \| 0.005 \| 0.790 \| 42.40 \| 0.743 \|
	\| \| Base + Grammar \| 0.011 \| 0.822 \| 7.90 \| 0.279 \|
	\| \| LoRA \| 0.126 \| 0.957 \| 0.90 \| 0.823 \|
	\| \| LoRA + Grammar \| 0.127 \| 0.957 \| 0.91 \| 0.824 \|
	\| SmolLM2 (360M) \| Base \| 0.107 \| 0.695 \| 9.38 \| 0.802 \|
	\| \| Base + Grammar \| 0.142 \| 0.760 \| 10.00 \| 0.812 \|
	\| \| LoRA \| 0.242 \| 0.980 \| 0.49 \| 0.861 \|
	\| \| LoRA + Grammar \| 0.243 \| 0.980 \| 0.49 \| 0.862 \|
	\| CodeT5+ (220M) \| Base \| 0.000 \| 0.921 \| 96.01 \| 0.725 \|
	\| \| LoRA \| 0.258 \| 0.913 \| 0.39 \| 0.874 \|
	\| \| FPFT \| 0.467 \| 0.982 \| 0.22 \| 0.912 \|
	\| \| Stage 1 + FPFT (IntelliTeX) \| 0.463 \| 0.998 \| 0.22 \| 0.915 \|
	\| Qwen2.5-Coder (0.5B) \| Base \| 0.161 \| 0.974 \| 1.27 \| 0.830 \|
	\| \| Base + Grammar \| 0.160 \| 0.978 \| 1.27 \| 0.831 \|
	\| \| LoRA \| 0.155 \| 0.909 \| 2.71 \| 0.836 \|
	\| \| LoRA + Grammar \| 0.155 \| 0.967 \| 1.75 \| 0.838 \|
	\| \| FPFT \| 0.405 \| 0.990 \| 0.24 \| 0.902 \|
	\| Qwen2.5-Coder (3B) \| Base \| 0.294 \| 0.991 \| 0.46 \| 0.869 \|
	\| \| Base + Grammar \| 0.293 \| 0.996 \| 0.45 \| 0.870 \|
	\| \| FPFT \| 0.507 \| 0.997 \| 0.18 \| 0.919 \|
	\| Qwen2.5-Coder (32B) \| Base \| 0.121 \| 1.000 \| 0.38 \| 0.863 \|

	Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.

	### 2. Stress Test Analysis

	Performance on Long Context Inputs (Source > 115 chars)

	Demonstrates the model's ability to understand lengthy natural language descriptions.

	\| Model (FPFT) \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CodeT5+ (220M) \| 0.150 \| 0.967 \| 0.219 \| 0.868 \|
	\| IntelliTeX (Stage 1 + FPFT) \| 0.195 \| 0.997 \| 0.211 \| 0.873 \|
	\| Qwen2.5-Coder (0.5B) \| 0.129 \| 0.976 \| 0.292 \| 0.859 \|
	\| Qwen2.5-Coder (3B) \| 0.209 \| 0.996 \| 0.199 \| 0.874 \|

	Performance on Long Sequence Generation (Target > 60 chars)

	Demonstrates the model's ability to generate complex, long LaTeX formulas.

	\| Model (FPFT) \| EM ↑ \| CR ↑ \| CER ↓ \| TexBLEU ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CodeT5+ (220M) \| 0.049 \| 0.940 \| 0.297 \| 0.827 \|
	\| IntelliTeX (Stage 1 + FPFT) \| 0.076 \| 0.991 \| 0.312 \| 0.828 \|
	\| Qwen2.5-Coder (0.5B) \| 0.037 \| 0.967 \| 0.394 \| 0.816 \|
	\| Qwen2.5-Coder (3B) \| 0.070 \| 0.988 \| 0.350 \| 0.822 \|