|
|
--- |
|
|
language: |
|
|
- fr |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- binary-level |
|
|
- bit-level |
|
|
- causal-lm |
|
|
- tokenizer-free |
|
|
- base2 |
|
|
- binary |
|
|
- TinyTransformerLM |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- PhysiQuanty/CATIE-AQ-amazon_reviews_sentiment_analysis_BASE_2_VOCAB_4 |
|
|
--- |
|
|
|
|
|
# BinaryLLM (Proof of Concept) |
|
|
|
|
|
Tokenizer-free / radix-2 (vocab_size=4) proof of concept. |
|
|
|
|
|
This repo uses `trust_remote_code=True` because it ships a custom `modeling_*.py` / `configuration_*.py`. |
|
|
|
|
|
(In French only, we plan 20 languages as well as scientific and mathematical knowledge for BinaryLLM1) |
|
|
|
|
|
- 10 millions params |
|
|
- 2 billions training tokens |
|
|
- 40k steps |
|
|
- 1e-4 learning rate |
|
|
- Fp32 weight, FSDP training on 8 NVIDIA V100 |
|
|
|
|
|
## Load (Python) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM |
|
|
|
|
|
m = AutoModelForCausalLM.from_pretrained( |
|
|
"PhysiQuanty/Binary-LLM-POC", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
m.eval() |
|
|
```` |
|
|
|
|
|
### Command |
|
|
|
|
|
```bash |
|
|
python3 inference.py --repo "PhysiQuanty/Binary-LLM-POC" --prompt "bonjour" --print_ids |
|
|
``` |
|
|
|
|
|
### Example output |
|
|
|
|
|
```text |
|
|
[Seed] 295493869 |
|
|
[Device] cuda |
|
|
[+] PROMPT IDS = [2, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, |
|
|
0, 0, 1, 0, 3, 2] |
|
|
|
|
|
[Prompt] |
|
|
bonjour |
|
|
|
|
|
[Prompt IDs] len=59 | BOS=2 EOS=3 |
|
|
|
|
|
[Output] |
|
|
|
|
|
[Final Output] |
|
|
|
|
|
Voici un avis laissé par un client sur un produit. Diriez-vous qu'il est négatif ou positif ? |
|
|
Avi |
|
|
|
|
|
[Generated IDs] |
|
|
|
|
|
[0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, |
|
|
1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, |
|
|
1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, |
|
|
0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, |
|
|
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, |
|
|
0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, |
|
|
0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, |
|
|
0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, |
|
|
0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, |
|
|
1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, |
|
|
1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, |
|
|
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, |
|
|
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, |
|
|
1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, |
|
|
0, 1] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Inference (CLI) |
|
|
|
|
|
This repo includes a minimal inference script that: |
|
|
|
|
|
* encodes the prompt to radix-2 bits (UTF-8, MSB→LSB), |
|
|
* runs a manual token-by-token loop (no `generate`), |
|
|
* decodes the generated bits back to text (best-effort strict decode). |
|
|
|
|
|
|
|
|
|
|
|
## Notes |
|
|
|
|
|
* This model is **tokenizer-free**: the input prompt is encoded as base-2 bits (UTF-8 bytes → MSB→LSB). |
|
|
* Some prompts may decode better than others depending on training distribution (e.g. frequent phrases). |
|
|
|