💎 JiRack Tokenizer

Compatible with Llama 3 • Optimized for Code
A Llama 3-based tokenizer enhanced with FIM markers (, , ).
Fully compatible with Microsoft BigCode datasets including The Stack, StarCoder, and NextCoder.
Enables efficient training on large-scale coding data for superior code generation and understanding.

Inventor: Konstantin Vladimirovich Grabko
Organization: CMS Manhattan JiRack Technology
Official Site: www.cmsmanhattan.com Designed for Banking and Fintech Institutions

Banks and Fintech JiRack Architecture: Build Sovereign Financial Models from Scratch

Leveraging the JiRack Tokenizer and our Open Dataset, we enable financial institutions to develop secure, internal AI models from the ground up. This approach ensures maximum data privacy and model sovereignty for high-stakes banking operations.
There is fix price for FinTech
I recommend initializing the model with a 4K context window for initial stability, followed by scaling to 8K context using specialized JiRack 8K datasets. This two-stage approach ensures robust positional encoding before extending the model's long-range dependency.

JiRack Corp Tokenizer solution

Use JiRack models with trusted, high-quality coding datasets while maintaining full control over your code and data privacy.
Excellent fit for Banks, Fintech companies, and any organization that requires strict data confidentiality and security.
Update JiRack model for corp privacy coding.

JiRack Tokenizer Subcription

All subscribed members will receive regular tokenizer updates optimized for the latest high-quality coding datasets.

Install for Llamma compatible models in your chat script

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(your model)
The Must !
model.resize_token_embeddings(len(tokenizer.tokenizer)) # или просто len(tokenizer.tokenizer)
print("New Embedding size for you chat script:", model.get_input_embeddings().weight.shape[0])
Tesr Tokenizer size !

(venv_ji) root@jirack2:# python -c ' from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("./jirack_code_tokenizer_fixed") print("Vocab size:", len(tok)) print("pad_token_id:", tok.pad_token_id) print("eos_token_id:", tok.eos_token_id) ' Vocab size: 128259 pad_token_id: 128001 eos_token_id: 128001

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

💎 JiRack Tokenizer

The Must !

Tesr Tokenizer size !