| # π Math2Visual: Visual Language Generation Model | |
| This is the official model for generating **structured visual language** representations from math word problems, as proposed in: | |
| π **[ACL 2025 Findings Paper β Math2Visual](https://arxiv.org/abs/2506.03735)** | |
| π₯ **[Project Video](https://youtu.be/jdPYVoHEPtk)** | |
| π **[Annotated Visual Language and Visual Dataset](https://huggingface.co/datasets/junling24/Math2Visual-Generating_Pedagogically_Meaningful_Visuals_for_Math_Word_Problems)** | |
| π» **[GitHub Codebase](https://github.com/eth-lre/math2visual/tree/main)** | |
| --- | |
| ## β¨ Model Summary | |
| This model takes a math word problem (MWP) and its equation (formula) as input and outputs a **visual language** string which is used for generating pedagogically meaningful visuals. The output follows a fixed structure based on teacher-informed design to describe key mathematical relationships between entities, containers, and operations. | |
| It is built by fine-tuning `meta-llama/Llama-3.1-8B` with LoRA using [PEFT](https://github.com/huggingface/peft), optimized with 4-bit quantization (BitsAndBytes). The code for generating visuals using visual language can be found in our **[github repository](https://github.com/eth-lre/math2visual/tree/main)** | |
| --- | |
| ## π§ Example Use | |
| ### π§ Install dependencies | |
| ```bash | |
| pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121 \ | |
| bitsandbytes==0.45.0 inflect==7.3.1 lxml==5.3.0 ipython==8.25.0 python-dotenv==1.0.1 \ | |
| git+https://github.com/huggingface/transformers.git@5fa35344755d8d9c29610b57d175efd03776ae9e \ | |
| git+https://github.com/huggingface/peft.git@aa3f41f7529ed078e9225b2fc1edbb8c71f58f99 | |
| π‘ Use -f https://download.pytorch.org/whl/torch_stable.html for CUDA wheels if needed. | |
| βΈ» | |
| π Run Inference | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig | |
| from peft import PeftModel | |
| # Load model | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_use_double_quant=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16 | |
| ) | |
| base_model_id = "meta-llama/Llama-3.1-8B" | |
| adapter_dir = "junling24/Math2Visual-Visual_Language_Generation" | |
| base = AutoModelForCausalLM.from_pretrained( | |
| base_model_id, | |
| quantization_config=bnb_config, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| model = PeftModel.from_pretrained(base, adapter_dir) | |
| model.eval() | |
| model.config.use_cache = True | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| base_model_id, | |
| padding_side="left", | |
| add_eos_token=True, | |
| add_bos_token=True, | |
| trust_remote_code=True | |
| ) | |
| tokenizer.pad_token = tokenizer.eos_token | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model.to(device) | |
| # Prompt | |
| def create_prompt(mwp, formula=None): | |
| return ( | |
| '''You are an expert at converting math story problem into a structured 'visual language'...''' | |
| f"Question: {mwp}\n" | |
| f"Formula: {formula}\n" | |
| "Answer in visual language:" | |
| ) | |
| mwp = "Janet has nine oranges, and Sharon has seven oranges. How many oranges do Janet and Sharon have together?" | |
| formula = "9 + 7 = 16" | |
| prompt = create_prompt(mwp, formula) | |
| inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048, padding="max_length").to(device) | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=2048, | |
| do_sample=True, | |
| temperature=0.7, | |
| repetition_penalty=1.15 | |
| ) | |
| visual_language = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip() | |
| print("Generated Visual Language:\n", visual_language) | |
| βΈ» | |
| π Citation | |
| @inproceedings{wang2025math2visual, | |
| title={Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models}, | |
| author={Wang, Junling and Rutkiewicz, Anna and Wang, April Yi and Sachan, Mrinmaya}, | |
| booktitle={Findings of the Association for Computational Linguistics: ACL 2025}, | |
| year={2025}, | |
| url={https://arxiv.org/abs/2506.03735} | |
| } | |
| βΈ» | |
| π¬ License & Contact | |
| This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). | |
| For research inquiries, please contact: | |
| π§ Junling Wang β wangjun [at] ethz [dot] ch | |