IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA

Model Details

Property Value
Base Model ibm-granite/granite-20b-code-instruct-8k
Architecture Granite Dense Decoder-only Transformer (GPT-BigCode)
Parameters 20B
Context Length 8,192 tokens
Languages 116 programming languages
Quantization TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size ~40 GB (BF16)
Quantized Size ~14-15 GB
Compression ~63% reduction
Active VRAM ~14.5 GB (with inference overhead)
License Apache 2.0

Architecture Breakdown

IBM Granite-20B-Code-Instruct is an enterprise-grade code model trained on 3T+ tokens from 116 programming languages:

Layer Composition (52 total layers)

  • 52 Transformer Decoder Layers: GPT-BigCode style architecture
  • 48 Attention Heads: Multi-Query Attention (MQA) with 1 KV head
  • Hidden Size: 6,144
  • Intermediate Size: 24,576
  • Vocab Size: 49,152
  • GELU Activation: Standard Gaussian Error Linear Unit
  • LayerNorm: Traditional layer normalization (not RMSNorm)
  • Learned Position Embeddings: Absolute positional encoding (not RoPE)

Why This Matters

  • 116 programming languages: Comprehensive language coverage
  • 3T+ training tokens: Massive code understanding
  • Enterprise-ready: Apache 2.0 licensed for commercial use
  • Multi-Query Attention: Efficient inference with 48:1 query-to-KV ratio
  • Instruction-tuned: Optimized for following coding commands

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with EoRA Error Recovery

This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.

Component Precision EoRA Rank Rationale
Layer 0 (all projections) INT8 2048 Maximum error correction at input - errors propagate through entire model
Layer 51 (all projections) INT8 2048 Maximum error correction at output - directly affects token prediction
Attention c_attn/c_proj (layers 1-50) INT8 128 Quality preservation for code understanding
MLP c_fc/c_proj (layers 1-42) INT4 128 Maximum compression in middle layers
MLP c_fc/c_proj (layers 43-50) INT8 128 Higher precision near output
Embeddings BF16 - Preserved for token accuracy
LM Head BF16 - Preserved for output quality

Why EoRA-2048 Boundaries?

  • Layer 0: First layer errors compound through all 51 subsequent layers
  • Layer 51: Final layer directly determines next token prediction
  • Rank 2048: Maximum error correction capacity (~8-10% recovery vs ~5% for rank 128)
  • 208 layer-specific rules: Not a blanket quantization - each projection optimized individually

Calibration

  • 1,800 samples (7x industry standard of 256)
  • 2,048 sequence length
  • Code-focused datasets: Code-Feedback, Evol-Instruct-Code, UltraChat, SlimOrca
  • Premium calibration for superior quality retention on code tasks

Performance Benchmarks

Qualitative Code Tests (7/7 passed)

Test Result Details
Basic Code Generation βœ… PASS Simple function with test assertion
LCS Algorithm (Python) βœ… PASS Correct DP implementation with backtracking
Debounce (JavaScript) βœ… PASS Proper clearTimeout + apply pattern
Code Explanation βœ… PASS Understood memoization decorator
Bug Fixing βœ… PASS Fixed binary search (// and <=)
SQL Generation βœ… PASS Clean JOIN with GROUP BY and LIMIT
Rust Generation βœ… PASS Idiomatic palindrome check

Inference Performance

Metric Value
VRAM Usage 14.5 GB
Generation Speed 30-31 tok/s
Tests Passed 7/7

Code Generation Quality

The quantized model correctly generates:

# LCS Algorithm - Generated correctly with DP + backtracking
def longest_common_subsequence(s1: str, s2: str) -> str:
    """Finds the longest common subsequence (LCS) of two strings."""
    m, n = len(s1), len(s2)
    lcs = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                lcs[i][j] = lcs[i - 1][j - 1] + 1
            else:
                lcs[i][j] = max(lcs[i - 1][j], lcs[i][j - 1])
    # ... backtracking to build result
// Debounce - Generated correctly with proper JS patterns
function debounce(func, delay) {
  let timer;
  return function() {
    clearTimeout(timer);
    timer = setTimeout(() => func.apply(this, arguments), delay);
  };
}

Usage

GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.from_quantized(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

prompt = """# Python function to find the longest common subsequence of two strings
def longest_common_subsequence(s1: str, s2: str) -> str:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

# Use same generation code as above

vLLM (Production)

pip install -U vllm

vllm serve TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ \
    --max-model-len 8192 \
    --trust-remote-code \
    --tensor-parallel-size 1

Installation

pip install gptqmodel transformers>=4.48

Supported Programming Languages

Granite-20B-Code supports 116 programming languages including:

Tier 1 (Excellent): Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby

Tier 2 (Strong): Swift, Kotlin, Scala, R, Julia, MATLAB, SQL, Bash, PowerShell, Perl

Tier 3 (Good): Lua, Haskell, Erlang, Elixir, Clojure, F#, OCaml, Dart, Groovy, and 90+ more

Known Issues

  • Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True when loading tokenizer

Memory Requirements

Inference (quantized model)

Context VRAM Required
Short (2K) 14-15 GB
Medium (4K) 16-18 GB
Full (8K) 20-24 GB

Tested on: RTX 5000 Ada (32GB) - 14.5 GB active VRAM during inference

Quantization (reproduction)

  • Minimum: 32GB VRAM + 64GB RAM
  • Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification Value
Method GPTQ + Ultra-Hybrid + EoRA
Quantizer GPTQModel
EoRA Boundary Rank 2048 (layers 0 & 51)
EoRA Standard Rank 128 (layers 1-50)
Calibration Samples 1,800 (7x industry standard)
Sequence Length 2,048 tokens
Group Size 128
desc_act False
sym True (symmetric quantization)
Bits (default) 4
Layer Rules 208 custom precision rules

Use Cases

Ideal for:

  • πŸ’» Code generation across 116 languages
  • πŸ“ Code explanation and documentation
  • πŸ› Bug fixing and code review
  • πŸ”„ Code translation between languages
  • πŸ“Š SQL query generation
  • 🏒 Enterprise deployments (Apache 2.0)
  • πŸ”§ Resource-constrained environments (40GB β†’ 15GB)

Technical Specifications

Specification Value
Model Family IBM Granite Code
Variant 20B-Code-Instruct-8k
Total Parameters 20B
Total Layers 52
Hidden Size 6,144
Intermediate Size 24,576
Attention Heads 48
KV Heads 1 (MQA)
Activation GELU
Normalization LayerNorm
Position Encoding Learned absolute
Context Length 8,192
Vocab Size 49,152
Training Data 3T+ code tokens
Languages 116 programming languages

Training Details

Phase Details
Phase 1 3T tokens from 116 programming languages
Phase 2 500B tokens high-quality code + natural language
Fine-tuning Instruction tuning for code tasks

Acknowledgments

  • IBM Research for developing the Granite code model family
  • NVIDIA for the EoRA (Error-optimized Low-Rank Adaptation) technique used in this quantization
  • GPTQModel team for the excellent quantization framework

License

Apache 2.0 - Enterprise-friendly open source license allowing commercial use, modification, and distribution.

Citation

@software{granite_20b_code_gptq_2025,
  title = {IBM Granite-20B-Code-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum code quality retention},
  url = {https://huggingface.co/TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ}
}

@misc{granite_code_2024,
  title = {Granite Code Models: A Family of Open Foundation Models for Code Intelligence},
  author = {IBM Research},
  year = {2024},
  url = {https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k}
}

https://huggingface.co/TevunahAi

Downloads last month
15
Safetensors
Model size
27B params
Tensor type
BF16
Β·
F16
Β·
I32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ

Collection including TevunahAi/Granite-20B-Code-Instruct-TevunahAi-GPTQ