Text Generation
Transformers
Safetensors
English
qwen2
code
execution
prediction
language-generalization
no-compiler
python
javascript
lua
cobol
synthetic-languages
text-generation-inference
Instructions to use CaaLM/CaaLM-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CaaLM/CaaLM-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CaaLM/CaaLM-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CaaLM/CaaLM-v1") model = AutoModelForCausalLM.from_pretrained("CaaLM/CaaLM-v1") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use CaaLM/CaaLM-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CaaLM/CaaLM-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CaaLM/CaaLM-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/CaaLM/CaaLM-v1
- SGLang
How to use CaaLM/CaaLM-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CaaLM/CaaLM-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CaaLM/CaaLM-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CaaLM/CaaLM-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CaaLM/CaaLM-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use CaaLM/CaaLM-v1 with Docker Model Runner:
docker model run hf.co/CaaLM/CaaLM-v1
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - code | |
| - execution | |
| - prediction | |
| - language-generalization | |
| - no-compiler | |
| - python | |
| - javascript | |
| - lua | |
| - cobol | |
| - synthetic-languages | |
| - transformers | |
| - qwen2 | |
| pipeline_tag: text-generation | |
| base_model: Qwen/Qwen2.5-1.5B | |
| library_name: transformers | |
| # CaaLM/CaaLM-v1 | |
|  | |
| ## What is this? | |
| CaaLM (Code as a Language Model) is a 1.5B parameter model that predicts the output of code β without a compiler, runtime, or interpreter. | |
| You give it code. It tells you what it would print. | |
| The interesting part: it was never trained on a fixed set of languages. Instead, it was trained on real languages (Python, JavaScript, Lua, COBOL) alongside 200 synthetically generated fake programming languages β each with randomized syntax but consistent semantics. The goal was to teach the model what *execution* means, not what any specific language looks like. | |
| This means it can predict the output of languages it has never seen before. | |
| ## Performance | |
|  | |
|  | |
| **Overall: 96.2% (50/52 tests)** | |
| | Category | Accuracy | Passed/Total | | |
| |---|---|---| | |
| | Real: Python | 100% | 10/10 | | |
| | Real: JavaScript | 100% | 8/8 | | |
| | Real: Lua | 100% | 6/6 | | |
| | Real: COBOL | 75% | 3/4 | | |
| | Novel Fake: Tier 1 (assign + print) | 100% | 8/8 | | |
| | Novel Fake: Tier 2 (conditionals) | 86% | 6/7 | | |
| | Novel Fake: Tier 3 (loops) | 100% | 4/4 | | |
| | Edge Cases | 100% | 5/5 | | |
| The novel fake language tests use languages that were never seen during training β completely invented syntax like `SCRIBBLE @x BECOMES 7` or `WONDER n > 10`. The model infers semantics from context and gets them right. | |
| ### Known Failures | |
| Two failures in the benchmark, both explainable: | |
| - **COBOL zero-padding** β predicted `08` instead of `0008`. Got the value right, missed the `PIC 9(4)` padding format. Data consistency issue. | |
| - **If-without-else** β when a conditional has no else branch and the condition is false, the correct output is empty. The model predicted `NO`, hallucinating an else branch. Most training data had if/else pairs so it defaulted to that pattern. | |
| ## How It Works | |
| Input format: | |
| ``` | |
| Code: | |
| <your code here> | |
| Output: | |
| ``` | |
| The model completes the `Output:` section with the predicted stdout. | |
| ### Example β Real Language | |
| ``` | |
| Code: | |
| a = 10 | |
| b = 20 | |
| print(a + b) | |
| Output: | |
| 30 | |
| ``` | |
| ### Example β Novel Fake Language (never seen during training) | |
| ``` | |
| Code: | |
| SCRIBBLE @x BECOMES 7 | |
| SCRIBBLE @y BECOMES 3 | |
| YELL @x + @y | |
| Output: | |
| 10 | |
| ``` | |
| ``` | |
| Code: | |
| BIND n TO 15 | |
| WONDER n > 10 | |
| SHOUT YES | |
| STOP | |
| Output: | |
| YES | |
| ``` | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "CaaLM/CaaLM-v1", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("CaaLM/CaaLM-v1") | |
| model.eval() | |
| def predict_output(code: str) -> str: | |
| prompt = f"Code:\n{code}\n\nOutput:\n" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=128, | |
| do_sample=False, | |
| pad_token_id=tokenizer.eos_token_id, | |
| ) | |
| return tokenizer.decode( | |
| outputs[0][inputs.input_ids.shape[1]:], | |
| skip_special_tokens=True | |
| ).strip() | |
| # Real language | |
| print(predict_output("a = 6\nb = 7\nprint(a * b)")) | |
| # β 42 | |
| # Novel fake language | |
| print(predict_output("STORE X := 10\nSTORE Y := 5\nSPEAK X + Y")) | |
| # β 15 | |
| ``` | |
| ## Training | |
|  | |
| ### Data | |
| Training data was split between real and synthetic languages: | |
| **Real languages (8,000 examples total, 2,000 each):** | |
| - Python β clean semantics, baseline | |
| - JavaScript β type coercion, implicit behaviors | |
| - Lua β minimal syntax, sparse | |
| - COBOL β verbose, English-like, no conventional syntax markers | |
| **Synthetic languages (120,000 examples total):** | |
| - 200 procedurally generated fake languages | |
| - Each language has randomized keywords, operators, variable styles, and block delimiters | |
| - Semantics are consistent within each language but syntax varies wildly across all 200 | |
| - Programs generated via a Python simulator β outputs are ground truth from actual execution | |
| - Three complexity tiers: assign+print (30%), conditionals (40%), loops (30%) | |
| The spec for each fake language is discarded after data generation. The model only ever sees `(code, output)` pairs β it never gets a syntax guide. | |
| ### Configuration | |
| - **Base model:** Qwen/Qwen2.5-1.5B (base, not instruct) | |
| - **Training method:** Full fine-tuning (no LoRA) | |
| - **Loss masking:** Loss computed on output tokens only, not prompt | |
| - **Precision:** BF16 | |
| - **Optimizer:** AdamW (lr=2e-5, weight_decay=0.01) | |
| - **Scheduler:** Cosine with 3% warmup | |
| - **Batch size:** 8 per device Γ 4 gradient accumulation = 32 effective | |
| - **Epochs:** 3 | |
| - **Max sequence length:** 512 tokens | |
| - **Hardware:** NVIDIA A100 SXM4 40GB | |
| - **Training time:** 66.5 minutes | |
| - **Training cost:** ~$0.82 | |
| ## Supported Operations | |
| The model reliably handles: | |
| - Variable assignment and arithmetic | |
| - Print / output statements | |
| - Conditionals (if/else) | |
| - While loops with accumulator patterns | |
| - String output | |
| - Basic error behavior (empty output when conditions not met) | |
| It does not handle: functions, recursion, file I/O, complex data structures, pipes, or multi-line string manipulation. These may work in real languages due to Qwen's pretraining knowledge but are not guaranteed. | |
| ## Limitations | |
| - No actual code execution β outputs are predictions, not guarantees | |
| - If-without-else edge cases can produce hallucinated else branches | |
| - COBOL numeric padding format is inconsistent | |
| - Long programs (many steps) may degrade in accuracy as state complexity grows | |
| - Novel fake languages with very unusual execution models (non-linear control flow, stack-based semantics) are untested | |
| - Context window limits programs to ~512 tokens | |
| ## Why | |
| The original motivation was to ask: can a language model learn what *execution* means as an abstract concept, independent of any specific language's syntax? | |
| The novel fake language results suggest yes, at least for basic programs. The model sees `WONDER x > 10` for the first time and figures out it's a conditional. It sees `SCRIBBLE @x BECOMES 7` and figures out it's assignment. It doesn't know these keywords β it infers them from the structure of the code and the patterns it learned during training. | |
| Whether this scales to more complex programs, more alien execution models, or larger languages is an open question. | |
| ## Model Lineage | |
| CaaLM-v1 is the first model in the CaaLM series, and a spiritual successor to the [LaaLM project](https://huggingface.co/LaaLM). | |
| - **LaaLM-v1** β T5-base fine-tuned to simulate Linux shell commands (external state) | |
| - **LaaLM-exp-v1** β Qwen 3B fine-tuned for conversational Linux terminal emulation (internal state) | |
| - **CaaLM-v1** β Qwen 1.5B fine-tuned for language-agnostic code output prediction (current) | |
| ## License | |
| Apache 2.0 (inherited from Qwen 2.5 base model) |