|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- tensalang |
|
|
- llm-inference |
|
|
- mlir |
|
|
- safetensors |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# TensaLang Example Models |
|
|
|
|
|
The example models' weights for [TensaLang](https://github.com/BenChaliah/Tensa-Lang), a programming language for LLM inference. |
|
|
|
|
|
## What is TensaLang? |
|
|
|
|
|
A programming language for LLM inference. Implement new models with ease and compile through MLIR to CUDA, CPU-SIMD, MLX, or ROCm. The runtime is the program. |
|
|
|
|
|
``` |
|
|
fn attention_f16(q: Tensor<f32, [D]>, |
|
|
key_cache: Tensor<f16, [L, SeqLen, KvDim]>, |
|
|
value_cache: Tensor<f16, [L, SeqLen, KvDim]>, |
|
|
layer: i32, pos: i32, H: i32, scale: f32) -> Tensor<f32, [D]> |
|
|
with tile=[8, 64], parallel=[h, t] { |
|
|
|
|
|
var att: Tensor<f32, [H, SeqLen]> = zeros([H, SeqLen]) |
|
|
|
|
|
# Compute attention scores |
|
|
att[h, t] = if t > pos { -inf } else { |
|
|
sum(i) q[h * Dh + i] * (key_cache[layer, t, h * Dh + i] as f32) * scale |
|
|
} |
|
|
|
|
|
var weights: Tensor<f32, [H, SeqLen]> = softmax(att) |
|
|
# ... weighted sum over values |
|
|
} |
|
|
``` |
|
|
|
|
|
From the creator of [Datarus-R1-14B](https://huggingface.co/Datarus/Datarus-R1-14B). |
|
|
|
|
|
## Example models |
|
|
|
|
|
| Model | Parameters | Format | Description | |
|
|
|-------|------------|--------|-------------| |
|
|
| `llama2_7b_f16.safetensors` | 7B | FP16 | Llama2-7B | |
|
|
| `qwen2.5_coder_0.5b_bf16.safetensors` | 0.5B | BF16 | Qwen2.5-Coder-0.5B-Instruct | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```bash |
|
|
# Clone TensaLang |
|
|
git clone https://github.com/BenChaliah/Tensa-Lang.git |
|
|
cd Tensa-Lang && ./build.sh |
|
|
|
|
|
# Download models |
|
|
huggingface-cli download BenChaliah/TensaLang-models --local-dir ./models |
|
|
|
|
|
# Or download a specific model |
|
|
huggingface-cli download BenChaliah/TensaLang-models llama2_7b_f16.safetensors --local-dir ./Llama2-assets |
|
|
``` |
|
|
|
|
|
### Run Llama2 |
|
|
|
|
|
```bash |
|
|
./bin/tensalang-run examples/llama2_manual_tiling_fp16.tl \ |
|
|
--model Llama2-assets/llama2_7b_f16.safetensors \ |
|
|
--tokenizer Llama2-assets/tokenizer.json \ |
|
|
--prompt "Once upon a time" \ |
|
|
--target cuda \ |
|
|
--steps 128 \ |
|
|
--fused-attention 2 \ |
|
|
--cuda-arch sm_89 |
|
|
``` |
|
|
|
|
|
### Run Qwen2.5-Coder |
|
|
|
|
|
```bash |
|
|
./bin/tensalang-run examples/qwen25_coder_bf16.tl \ |
|
|
--model Qwen25-assets/qwen2.5_coder_0.5b_bf16.safetensors \ |
|
|
--tokenizer Qwen25-assets/tokenizer.json \ |
|
|
--prompt "def quicksort(arr):" \ |
|
|
--target cuda \ |
|
|
--steps 64 \ |
|
|
--cuda-arch sm_89 |
|
|
``` |
|
|
|
|
|
## Source |
|
|
|
|
|
Weights converted from: |
|
|
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) |
|
|
- [Qwen/Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) |
|
|
|
|
|
## License |
|
|
|
|
|
Model weights retain their original licenses. TensaLang compiler is MIT licensed. |
|
|
|
|
|
## Links |
|
|
|
|
|
- [TensaLang GitHub](https://github.com/BenChaliah/Tensa-Lang) |
|
|
- [Documentation](https://tensa-lang.org/docs.html) |
|
|
- [Website](https://tensa-lang.org/) |
|
|
- [Datarus-R1-14B](https://huggingface.co/Datarus/Datarus-R1-14B) |