--- language: - en - ru - fr - de - pt - es - hi - th tags: - code - coding - llama - llama 3 - llama 3 10b - llama 3 8b - onnx - int8 - web-ui - safetensors license: other license_name: cms-manhattan-jirack-v1.3 license_link: LICENSE --- # JiRack 10B FP32 , INT8 , INT4 A fast and efficient coding assistant with a clean, modern built-in web UI. Powered by Meta Llama 3.1 8B Instruct weights and a fully refactored architecture optimized for a 10B-scale model. The model was specifically designed for high-performance tuning with advanced ternary quantization options. The next version will release JiRack Ternary 10B — a highly optimized ternary model delivering exceptional speed and efficiency using Microsoft ONNX Runtime. - JiRack is cloud model and save money on cloud and can be used as expert model in RAG on cloud with ONNX JiRack java server as alternative. - Subscription 1$ per month per user in updated license if not company JiRack android client DEMO: https://www.youtube.com/watch?v=SaO6Jfb8R68 CMS Manhattan RAG & Email reply & Document and Emails Analytics https://www.youtube.com/watch?v=KRu2nLEh_6g&t=78s So I do not read my emails I am asking my JiRack to tell me news! Welcom to buy CMS Manhattan AI front office solution ## Traning the model It is easy to train on Blackwell 96 Gb VRAM. So you do not need data center for tune-time or QLoRa on cheap GPU card . Let me know if you need code ! ## Quick Start Watch the JiRack 10B in action: Run on docker it. ### Run with Docker --- --Default CPU int8-- - docker run -d \ --name jirack_10b \ -p 7869:7869 \ --restart unless-stopped \ cmsmanhattan/jirack_10b_int8:latest --Default CPU int4 -- - docker run -d \ --name jirack_10b \ -p 7869:7869 \ --restart unless-stopped \ cmsmanhattan/jirack_10b_int4:latest --Multi CPU-- - docker run -d \ --name jirack_10b \ -p 7869:7869 \ --restart unless-stopped \ --memory=20g \ --cpus=12 \ cmsmanhattan/jirack_10b_int8:latest ---GPU-- - docker run -d \ --name jirack_10b \ -p 7869:7869 \ --gpus all \ --restart unless-stopped \ cmsmanhattan/jirack_10b_gpu_int8:latest --- services: image: cmsmanhattan/jirack_10b_int8:1.0.2 container_name: jirack_onnx_service ports: - "7869:7869" volumes: - .:/app - ./web:/app/web environment: - MAX_TOKENS=1024 - TEMPERATURE=0.7 - TOP_P=0.9 - DEFAULT_STREAM=False - INTRA_THREADS=4 - USE_ENV_ALLOCATOR=1 deploy: resources: limits: memory: 16g ## Access the UI Once the container is running, open your browser and navigate to: **`http://localhost:7869`** This opens the **JiRack UI** — a clean web interface designed for chat. ## Changing the Port The listening port can be easily modified directly from the **Settings** panel within the JiRack Chat UI. ## Licensing - The **JiRack 10B model** is provided under a **commercial enterprise license**. - All **JiRack UI clients** are provided under a commercial license. - However, the UI clients can be used for free when running together with the official JiRack Docker containers, as long as they are not redistributed separately. ### Subscription Plans ### Ready to Deploy JiRack? Get immediate access to the repositories, architecture blueprints, and deployment containers. #### 3. JiRack Enterprise price: -- It is about 36$ per user for year . #### 3. JiRack private price: -- It is about 12$ per user for year . For commercial licensing, cluster deployment , performance tuning , or enterprise use of the JiRack 10B, please contact us. - JiRack android chat client with voice and ollama API setup : https://huggingface.co/kgrabko/JiRackTernary_1b/resolve/main/app-release.apk or ( Google play) - JiRack MS Windows 11 Desktop chat client with ollama API setup: https://huggingface.co/kgrabko/JiRackTernary_1b/resolve/main/jirack-chat.zip - Live email chat with model via support@cmsmanhattan.com ## Hardware Recommendations for AMD Systems ### Recommended Hardware for JiRack Coder10B INT8 . It is one dcoker container | Use Case | CPU | GPU (ROCm) | VRAM / RAM | Expected Speed | Recommendation | |-----------------------|----------------------------------|-----------------------------------|----------------|---------------------|--------------------| | **Recommended** | Ryzen 7 7700 / 9700X | RX 7900 XTX / 7900 XT | 24GB VRAM | 50-75 tokens/s | Best choice | | **High Performance** | Ryzen 9 7950X / 9950X | RX 7900 XTX | 24GB+ VRAM | 65-90 tokens/s | Excellent | | **Enterprise** | EPYC 7003/9004 series | MI300X or 2x RX 7900 XTX | 48GB+ VRAM | 90-140 tokens/s | For 32B model | | **Budget Option** | Ryzen 5 7600 / 9600X | RX 7800 XT (16GB) | 16GB VRAM | 35-50 tokens/s | Acceptable | ### Important Memory Notes Even though the 10B INT8 model itself takes approximately **8–9 GB**, we recommend **at least 24GB VRAM** for the following reasons: - KV-cache consumption during generation (especially with long context) - ONNX Runtime overhead and temporary buffers - System stability and to avoid Out of Memory errors - Room for larger context windows **Minimum recommended:** 24GB VRAM (RX 7900 series) **Ideal:** 24–32GB VRAM For pure CPU inference (no GPU), we recommend at least **64GB system RAM** (Ryzen 9 7950X/9950X). --- I added the default model in full FP32 precision, which is approximately 62 GB in size. This serves as the base for quantization, allowing us to find the optimal balance between model size and performance. ## 📧 Contact & Licensing For joint venture opportunities, hardware integration, or licensing inquiries: - **Email:** [grabko@cmsmanhattan.com](mailto:grabko@cmsmanhattan.com) - **Phone:** +1 (516) 777-0945 - **Location:** New York, USA