Upload folder using huggingface_hub

d8bc908 verified 1 day ago

4.53 kB


	## Vocab(288) Divisible by 32, 16, 8, 3
	- 256 Byte Slots
	- 32 Special Tokens
	Core Sequence
	1. PAD
	2. EOS
	3. BOS
	Conversational/Role
	4. SYSTEM
	5. USER
	6. ASSISTANT
	Reasoning/Planning
	7. SCRATCHPAD
	8. PLAN
	9. REFLECTION
	10. SUMMARY
	Tool/Agent Tokens
	11. ACTION
	12. TOOL
	13. TOOL_RESULT
	14. FUNCTION
	Coding Tokens
	15. CODE
	16. CODE_BLOCK
	17. PATCH
	18. EXECUTION
	Retrieval / Knowledge
	19. SEARCH
	20. RETRIEVAL
	21. CONTEXT
	22. CITATION
	Structured Output
	23. JSON
	24. XML
	25. TABLE
	26. LIST
	27. MARKDOWN
	Safety / Control
	28. ERROR
	29. WARNING
	30. POLICY
	31. STOP
	Free
	32. RESERVED

	## Question
	Is there a 8th Piece I can built that acts as an internal router. The flow I want to keep the same but a safe guard for my unique model taking in stuff like common datasets which can collapse parts of my system would be great if possible. Could be the Model wrapper class itself, example: TrigramModel or MORPH -> having classes and logic for routing data correctly or more even filtering. Maybe also helping in the model grow. Since the codebook constantly updates, the model somewhat learn, in training, a wrapper to ensure the end output model doesn't get trained but the whole or multiple pieces is ideal. Maybe even in a ModelTrainer class.

	For GPU operations is it possible to secure a tensor/matrix with the size of 384(Divisable by 64/32/16/8/3) to act as a warp-like intermedian that sudo-transforms or scales up a ternary model inside efficient, compacted, and precise float-like numerials to pass there things like gradience, change actual parameter to freeze and stay maybe(Convert from ternary to fp32 for example), without scarificing must speed and goal is to keep the same amount of VRAM. The space is mostly padding with potential multiplication factors if resizing the content(The ternary model) into something on the fly.


	## Recover Info
	Codebook is 32-dim and FP32 Layer. This will be replaced with the S-Scale Layer and Scaled Ternary over FP32 in the future kernel. The idea I want is the Dim is * 2 the size of scale. Example: TernaryScale32(FP32 Mimick) has a 64-Dim codebook while TernaryScale4(The lowest - FP4) has a 8-Dim codebook

	.float() and .dtype() will have to be updated will new kernal. Ternary is 2-bits and the kernal sudo-scales numbers to match sudo-higher scales

	nn.Embedding and RMSNorm(Must be scaled) is being ternarized

	When weights are ternarized, they are reinitialized as 1, 0, -1. Maybe there's a way to calculate all parameters and scale them to ternary correctly

	The negative dynamic range isn't being used but it should have a reason to be.

	GNN + C00 Sparse Tensor

	zeros_like - creates normal tensors
	torch.bmm on GraphPool — batched matrix multiply: [B,1,K] @ [B,K,D] → [B,1,D]
	still using .float() and now looping and O(n) operations for TernaryGraph also, it regeisters a buffer also

	The Model is now being packed into 5-trits ~ 8 bit
	The question is now how do I calculate loss, should this be per group or individual weight
	I'm trying a unique way to grade losses, rather than add losses together, maybe split them individuality to grade component loss rather that the total loss. I'm not sure how this will work yet or if grouping helps with that. I have 10 potential modules that can occur losses and 8 that are necessary

	The most accurate and closest bit to trit packing is: 34 trit ~ 54 bit. This is Efficiency is extremely high ≈ 99.2% Conversion without lost. The only benefit is when unpacked, it's more usage of the raw ternary and near 0 loss leading to better memory but barely. The 3B goal for example only saves ~5MB ram when comparing 34 trit vs 5 trit conversion.

	The main downside is 54 isn't compatible with GPU or CPU that relies and favor 64 bit/8 bit compatibility
	The leads to the package version being more accurate but also more slower to decode
	40 trit ~ 64 bits vs 5 trit ~ 8 bits is another example of speed. The 64 bit - 40 trit is slower to decode but the size is the same

	unpacking completely changes memory usage, and in training it usually dominates everything.
	Think of it like this:
	- Packed ternary = compact storage (cheap memory, expensive compute)
	- Unpacked ternary = full tensor (expensive memory, cheap compute)
	They are not equivalent in memory at all.

	So a 3B model becomes:
	- 3 GB (int8)
	- 6 GB (fp16)
	- 12 GB (fp32)