IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA
Model Details
| Property | Value |
|---|---|
| Base Model | ibm-granite/granite-20b-code-instruct-8k |
| Architecture | Granite Dense Decoder-only Transformer (GPT-BigCode) |
| Parameters | 20B |
| Context Length | 8,192 tokens |
| Languages | 116 programming languages |
| Quantization | TevunahAi Ultra-Hybrid GPTQ + EoRA |
| Original Size | ~40 GB (BF16) |
| Quantized Size | ~14-15 GB |
| Compression | ~63% reduction |
| Active VRAM | ~14.5 GB (with inference overhead) |
| License | Apache 2.0 |
Architecture Breakdown
IBM Granite-20B-Code-Instruct is an enterprise-grade code model trained on 3T+ tokens from 116 programming languages:
Layer Composition (52 total layers)
- 52 Transformer Decoder Layers: GPT-BigCode style architecture
- 48 Attention Heads: Multi-Query Attention (MQA) with 1 KV head
- Hidden Size: 6,144
- Intermediate Size: 24,576
- Vocab Size: 49,152
- GELU Activation: Standard Gaussian Error Linear Unit
- LayerNorm: Traditional layer normalization (not RMSNorm)
- Learned Position Embeddings: Absolute positional encoding (not RoPE)
Why This Matters
- 116 programming languages: Comprehensive language coverage
- 3T+ training tokens: Massive code understanding
- Enterprise-ready: Apache 2.0 licensed for commercial use
- Multi-Query Attention: Efficient inference with 48:1 query-to-KV ratio
- Instruction-tuned: Optimized for following coding commands
Quantization Strategy
TevunahAi Ultra-Hybrid Mixed-Precision with EoRA Error Recovery
This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.
| Component | Precision | EoRA Rank | Rationale |
|---|---|---|---|
| Layer 0 (all projections) | INT8 | 2048 | Maximum error correction at input - errors propagate through entire model |
| Layer 51 (all projections) | INT8 | 2048 | Maximum error correction at output - directly affects token prediction |
| Attention c_attn/c_proj (layers 1-50) | INT8 | 128 | Quality preservation for code understanding |
| MLP c_fc/c_proj (layers 1-42) | INT4 | 128 | Maximum compression in middle layers |
| MLP c_fc/c_proj (layers 43-50) | INT8 | 128 | Higher precision near output |
| Embeddings | BF16 | - | Preserved for token accuracy |
| LM Head | BF16 | - | Preserved for output quality |
Why EoRA-2048 Boundaries?
- Layer 0: First layer errors compound through all 51 subsequent layers
- Layer 51: Final layer directly determines next token prediction
- Rank 2048: Maximum error correction capacity (~8-10% recovery vs ~5% for rank 128)
- 208 layer-specific rules: Not a blanket quantization - each projection optimized individually
Calibration
- 1,800 samples (7x industry standard of 256)
- 2,048 sequence length
- Code-focused datasets: Code-Feedback, Evol-Instruct-Code, UltraChat, SlimOrca
- Premium calibration for superior quality retention on code tasks
Performance Benchmarks
Qualitative Code Tests (7/7 passed)
| Test | Result | Details |
|---|---|---|
| Basic Code Generation | β PASS | Simple function with test assertion |
| LCS Algorithm (Python) | β PASS | Correct DP implementation with backtracking |
| Debounce (JavaScript) | β PASS | Proper clearTimeout + apply pattern |
| Code Explanation | β PASS | Understood memoization decorator |
| Bug Fixing | β PASS | Fixed binary search (// and <=) |
| SQL Generation | β PASS | Clean JOIN with GROUP BY and LIMIT |
| Rust Generation | β PASS | Idiomatic palindrome check |
Inference Performance
| Metric | Value |
|---|---|
| VRAM Usage | 14.5 GB |
| Generation Speed | 30-31 tok/s |
| Tests Passed | 7/7 |
Code Generation Quality
The quantized model correctly generates:
# LCS Algorithm - Generated correctly with DP + backtracking
def longest_common_subsequence(s1: str, s2: str) -> str:
"""Finds the longest common subsequence (LCS) of two strings."""
m, n = len(s1), len(s2)
lcs = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
lcs[i][j] = lcs[i - 1][j - 1] + 1
else:
lcs[i][j] = max(lcs[i - 1][j], lcs[i][j - 1])
# ... backtracking to build result
// Debounce - Generated correctly with proper JS patterns
function debounce(func, delay) {
let timer;
return function() {
clearTimeout(timer);
timer = setTimeout(() => func.apply(this, arguments), delay);
};
}
Usage
GPTQModel (Recommended)
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model = GPTQModel.from_quantized(
"TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
trust_remote_code=True
)
prompt = """# Python function to find the longest common subsequence of two strings
def longest_common_subsequence(s1: str, s2: str) -> str:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
trust_remote_code=True
)
# Use same generation code as above
vLLM (Production)
pip install -U vllm
vllm serve TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ \
--max-model-len 8192 \
--trust-remote-code \
--tensor-parallel-size 1
Installation
pip install gptqmodel transformers>=4.48
Supported Programming Languages
Granite-20B-Code supports 116 programming languages including:
Tier 1 (Excellent): Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby
Tier 2 (Strong): Swift, Kotlin, Scala, R, Julia, MATLAB, SQL, Bash, PowerShell, Perl
Tier 3 (Good): Lua, Haskell, Erlang, Elixir, Clojure, F#, OCaml, Dart, Groovy, and 90+ more
Known Issues
- Tokenizer regex warning: Can be safely ignored or fixed with
fix_mistral_regex=Truewhen loading tokenizer
Memory Requirements
Inference (quantized model)
| Context | VRAM Required |
|---|---|
| Short (2K) | 14-15 GB |
| Medium (4K) | 16-18 GB |
| Full (8K) | 20-24 GB |
Tested on: RTX 5000 Ada (32GB) - 14.5 GB active VRAM during inference
Quantization (reproduction)
- Minimum: 32GB VRAM + 64GB RAM
- Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)
Quantization Details
| Specification | Value |
|---|---|
| Method | GPTQ + Ultra-Hybrid + EoRA |
| Quantizer | GPTQModel |
| EoRA Boundary Rank | 2048 (layers 0 & 51) |
| EoRA Standard Rank | 128 (layers 1-50) |
| Calibration Samples | 1,800 (7x industry standard) |
| Sequence Length | 2,048 tokens |
| Group Size | 128 |
| desc_act | False |
| sym | True (symmetric quantization) |
| Bits (default) | 4 |
| Layer Rules | 208 custom precision rules |
Use Cases
Ideal for:
- π» Code generation across 116 languages
- π Code explanation and documentation
- π Bug fixing and code review
- π Code translation between languages
- π SQL query generation
- π’ Enterprise deployments (Apache 2.0)
- π§ Resource-constrained environments (40GB β 15GB)
Technical Specifications
| Specification | Value |
|---|---|
| Model Family | IBM Granite Code |
| Variant | 20B-Code-Instruct-8k |
| Total Parameters | 20B |
| Total Layers | 52 |
| Hidden Size | 6,144 |
| Intermediate Size | 24,576 |
| Attention Heads | 48 |
| KV Heads | 1 (MQA) |
| Activation | GELU |
| Normalization | LayerNorm |
| Position Encoding | Learned absolute |
| Context Length | 8,192 |
| Vocab Size | 49,152 |
| Training Data | 3T+ code tokens |
| Languages | 116 programming languages |
Training Details
| Phase | Details |
|---|---|
| Phase 1 | 3T tokens from 116 programming languages |
| Phase 2 | 500B tokens high-quality code + natural language |
| Fine-tuning | Instruction tuning for code tasks |
Acknowledgments
- IBM Research for developing the Granite code model family
- NVIDIA for the EoRA (Error-optimized Low-Rank Adaptation) technique used in this quantization
- GPTQModel team for the excellent quantization framework
License
Apache 2.0 - Enterprise-friendly open source license allowing commercial use, modification, and distribution.
Citation
@software{granite_20b_code_gptq_2025,
title = {IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
author = {TevunahAi},
year = {2025},
note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum code quality retention},
url = {https://huggingface.co/TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ}
}
@misc{granite_code_2024,
title = {Granite Code Models: A Family of Open Foundation Models for Code Intelligence},
author = {IBM Research},
year = {2024},
url = {https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k}
}
- Downloads last month
- 15
Model tree for TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ
Base model
ibm-granite/granite-20b-code-base-8k