ARBS / docs /project /MODEL-NOTES.md
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified
## Vocab(288) Divisible by 32, 16, 8, 3
- 256 Byte Slots
- 32 Special Tokens
**Core Sequence**
1. PAD
2. EOS
3. BOS
**Conversational/Role**
4. SYSTEM
5. USER
6. ASSISTANT
**Reasoning/Planning**
7. SCRATCHPAD
8. PLAN
9. REFLECTION
10. SUMMARY
**Tool/Agent Tokens**
11. ACTION
12. TOOL
13. TOOL_RESULT
14. FUNCTION
**Coding Tokens**
15. CODE
16. CODE_BLOCK
17. PATCH
18. EXECUTION
**Retrieval / Knowledge**
19. SEARCH
20. RETRIEVAL
21. CONTEXT
22. CITATION
**Structured Output**
23. JSON
24. XML
25. TABLE
26. LIST
27. MARKDOWN
**Safety / Control**
28. ERROR
29. WARNING
30. POLICY
31. STOP
**Free**
32. RESERVED
## Question
Is there a 8th Piece I can built that acts as an internal router. The flow I want to keep the same but a safe guard for my unique model taking in stuff like common datasets which can collapse parts of my system would be great if possible. Could be the Model wrapper class itself, example: TrigramModel or MORPH -> having classes and logic for routing data correctly or more even filtering. Maybe also helping in the model grow. Since the codebook constantly updates, the model somewhat learn, in training, a wrapper to ensure the end output model doesn't get trained but the whole or multiple pieces is ideal. Maybe even in a ModelTrainer class.
For GPU operations is it possible to secure a tensor/matrix with the size of 384(Divisable by 64/32/16/8/3) to act as a warp-like intermedian that sudo-transforms or scales up a ternary model inside efficient, compacted, and precise float-like numerials to pass there things like gradience, change actual parameter to freeze and stay maybe(Convert from ternary to fp32 for example), without scarificing must speed and goal is to keep the same amount of VRAM. The space is mostly padding with potential multiplication factors if resizing the content(The ternary model) into something on the fly.
## Recover Info
Codebook is 32-dim and FP32 Layer. This will be replaced with the S-Scale Layer and Scaled Ternary over FP32 in the future kernel. The idea I want is the Dim is * 2 the size of scale. Example: TernaryScale32(FP32 Mimick) has a 64-Dim codebook while TernaryScale4(The lowest - FP4) has a 8-Dim codebook
.float() and .dtype() will have to be updated will new kernal. Ternary is 2-bits and the kernal sudo-scales numbers to match sudo-higher scales
nn.Embedding and RMSNorm(Must be scaled) is being ternarized
When weights are ternarized, they are reinitialized as 1, 0, -1. Maybe there's a way to calculate all parameters and scale them to ternary correctly
The negative dynamic range isn't being used but it should have a reason to be.
GNN + C00 Sparse Tensor
zeros_like - creates normal tensors
torch.bmm on GraphPool — batched matrix multiply: [B,1,K] @ [B,K,D] → [B,1,D]
still using .float() and now looping and O(n) operations for TernaryGraph also, it regeisters a buffer also
The Model is now being packed into 5-trits ~ 8 bit
The question is now how do I calculate loss, should this be per group or individual weight
I'm trying a unique way to grade losses, rather than add losses together, maybe split them individuality to grade component loss rather that the total loss. I'm not sure how this will work yet or if grouping helps with that. I have 10 potential modules that can occur losses and 8 that are necessary
The most accurate and closest bit to trit packing is: 34 trit ~ 54 bit. This is Efficiency is extremely high ≈ 99.2% Conversion without lost. The only benefit is when unpacked, it's more usage of the raw ternary and near 0 loss leading to better memory but barely. The 3B goal for example only saves ~5MB ram when comparing 34 trit vs 5 trit conversion.
The main downside is 54 isn't compatible with GPU or CPU that relies and favor 64 bit/8 bit compatibility
The leads to the package version being more accurate but also more slower to decode
40 trit ~ 64 bits vs 5 trit ~ 8 bits is another example of speed. The 64 bit - 40 trit is slower to decode but the size is the same
unpacking completely changes memory usage, and in training it usually dominates everything.
Think of it like this:
- Packed ternary = compact storage (cheap memory, expensive compute)
- Unpacked ternary = full tensor (expensive memory, cheap compute)
They are not equivalent in memory at all.
So a 3B model becomes:
- 3 GB (int8)
- 6 GB (fp16)
- 12 GB (fp32)