IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA

Model Details

Property	Value
Base Model	ibm-granite/granite-20b-code-instruct-8k
Architecture	Granite Dense Decoder-only Transformer (GPT-BigCode)
Parameters	20B
Context Length	8,192 tokens
Languages	116 programming languages
Quantization	TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size	~40 GB (BF16)
Quantized Size	~14-15 GB
Compression	~63% reduction
Active VRAM	~14.5 GB (with inference overhead)
License	Apache 2.0

Architecture Breakdown

IBM Granite-20B-Code-Instruct is an enterprise-grade code model trained on 3T+ tokens from 116 programming languages:

Layer Composition (52 total layers)

52 Transformer Decoder Layers: GPT-BigCode style architecture
48 Attention Heads: Multi-Query Attention (MQA) with 1 KV head
Hidden Size: 6,144
Intermediate Size: 24,576
Vocab Size: 49,152
GELU Activation: Standard Gaussian Error Linear Unit
LayerNorm: Traditional layer normalization (not RMSNorm)
Learned Position Embeddings: Absolute positional encoding (not RoPE)

Why This Matters

116 programming languages: Comprehensive language coverage
3T+ training tokens: Massive code understanding
Enterprise-ready: Apache 2.0 licensed for commercial use
Multi-Query Attention: Efficient inference with 48:1 query-to-KV ratio
Instruction-tuned: Optimized for following coding commands

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with EoRA Error Recovery

This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.

Component	Precision	EoRA Rank	Rationale
Layer 0 (all projections)	INT8	2048	Maximum error correction at input - errors propagate through entire model
Layer 51 (all projections)	INT8	2048	Maximum error correction at output - directly affects token prediction
Attention c_attn/c_proj (layers 1-50)	INT8	128	Quality preservation for code understanding
MLP c_fc/c_proj (layers 1-42)	INT4	128	Maximum compression in middle layers
MLP c_fc/c_proj (layers 43-50)	INT8	128	Higher precision near output
Embeddings	BF16	-	Preserved for token accuracy
LM Head	BF16	-	Preserved for output quality

Why EoRA-2048 Boundaries?

Layer 0: First layer errors compound through all 51 subsequent layers
Layer 51: Final layer directly determines next token prediction
Rank 2048: Maximum error correction capacity (~8-10% recovery vs ~5% for rank 128)
208 layer-specific rules: Not a blanket quantization - each projection optimized individually

Calibration

1,800 samples (7x industry standard of 256)
2,048 sequence length
Code-focused datasets: Code-Feedback, Evol-Instruct-Code, UltraChat, SlimOrca
Premium calibration for superior quality retention on code tasks

Performance Benchmarks

Qualitative Code Tests (7/7 passed)

Test	Result	Details
Basic Code Generation	✅ PASS	Simple function with test assertion
LCS Algorithm (Python)	✅ PASS	Correct DP implementation with backtracking
Debounce (JavaScript)	✅ PASS	Proper clearTimeout + apply pattern
Code Explanation	✅ PASS	Understood memoization decorator
Bug Fixing	✅ PASS	Fixed binary search (// and <=)
SQL Generation	✅ PASS	Clean JOIN with GROUP BY and LIMIT
Rust Generation	✅ PASS	Idiomatic palindrome check

Inference Performance

Metric	Value
VRAM Usage	14.5 GB
Generation Speed	30-31 tok/s
Tests Passed	7/7

Code Generation Quality

The quantized model correctly generates:

# LCS Algorithm - Generated correctly with DP + backtracking
def longest_common_subsequence(s1: str, s2: str) -> str:
    """Finds the longest common subsequence (LCS) of two strings."""
    m, n = len(s1), len(s2)
    lcs = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                lcs[i][j] = lcs[i - 1][j - 1] + 1
            else:
                lcs[i][j] = max(lcs[i - 1][j], lcs[i][j - 1])
    # ... backtracking to build result

// Debounce - Generated correctly with proper JS patterns
function debounce(func, delay) {
  let timer;
  return function() {
    clearTimeout(timer);
    timer = setTimeout(() => func.apply(this, arguments), delay);
  };
}

Usage

GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.from_quantized(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

prompt = """# Python function to find the longest common subsequence of two strings
def longest_common_subsequence(s1: str, s2: str) -> str:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

# Use same generation code as above

vLLM (Production)

pip install -U vllm

vllm serve TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ \
    --max-model-len 8192 \
    --trust-remote-code \
    --tensor-parallel-size 1

Installation

pip install gptqmodel transformers>=4.48

Supported Programming Languages

Granite-20B-Code supports 116 programming languages including:

Tier 1 (Excellent): Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby

Tier 2 (Strong): Swift, Kotlin, Scala, R, Julia, MATLAB, SQL, Bash, PowerShell, Perl

Tier 3 (Good): Lua, Haskell, Erlang, Elixir, Clojure, F#, OCaml, Dart, Groovy, and 90+ more

Known Issues

Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True when loading tokenizer

Memory Requirements

Inference (quantized model)

Context	VRAM Required
Short (2K)	14-15 GB
Medium (4K)	16-18 GB
Full (8K)	20-24 GB

Tested on: RTX 5000 Ada (32GB) - 14.5 GB active VRAM during inference

Quantization (reproduction)

Minimum: 32GB VRAM + 64GB RAM
Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification	Value
Method	GPTQ + Ultra-Hybrid + EoRA
Quantizer	GPTQModel
EoRA Boundary Rank	2048 (layers 0 & 51)
EoRA Standard Rank	128 (layers 1-50)
Calibration Samples	1,800 (7x industry standard)
Sequence Length	2,048 tokens
Group Size	128
desc_act	False
sym	True (symmetric quantization)
Bits (default)	4
Layer Rules	208 custom precision rules

Use Cases

Ideal for:

💻 Code generation across 116 languages
📝 Code explanation and documentation
🐛 Bug fixing and code review
🔄 Code translation between languages
📊 SQL query generation
🏢 Enterprise deployments (Apache 2.0)
🔧 Resource-constrained environments (40GB → 15GB)

Technical Specifications

Specification	Value
Model Family	IBM Granite Code
Variant	20B-Code-Instruct-8k
Total Parameters	20B
Total Layers	52
Hidden Size	6,144
Intermediate Size	24,576
Attention Heads	48
KV Heads	1 (MQA)
Activation	GELU
Normalization	LayerNorm
Position Encoding	Learned absolute
Context Length	8,192
Vocab Size	49,152
Training Data	3T+ code tokens
Languages	116 programming languages

Training Details

Phase	Details
Phase 1	3T tokens from 116 programming languages
Phase 2	500B tokens high-quality code + natural language
Fine-tuning	Instruction tuning for code tasks

Acknowledgments

IBM Research for developing the Granite code model family
NVIDIA for the EoRA (Error-optimized Low-Rank Adaptation) technique used in this quantization
GPTQModel team for the excellent quantization framework

License

Apache 2.0 - Enterprise-friendly open source license allowing commercial use, modification, and distribution.

Citation

@software{granite_20b_code_gptq_2025,
  title = {IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum code quality retention},
  url = {https://huggingface.co/TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ}
}

@misc{granite_code_2024,
  title = {Granite Code Models: A Family of Open Foundation Models for Code Intelligence},
  author = {IBM Research},
  year = {2024},
  url = {https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k}
}

https://huggingface.co/TevunahAi