README.md · Multilingual-Multimodal-NLP/IndustrialCoder-Base at refs/pr/1

File size: 6,861 Bytes

87cdf5f
 
 
 
 
 
 
8ae9584
 
87cdf5f
 
 
 
 
 
 
8ae9584
87cdf5f
 
 
8ae9584
87cdf5f
 
 
 
 
 
 
 
8ae9584
 
 
87cdf5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ae9584
87cdf5f
 
 
 
 
 
8ae9584
87cdf5f
 
 
8ae9584
87cdf5f
 
 
 
 
8ae9584
87cdf5f
 
 
 
8ae9584
87cdf5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ae9584
87cdf5f
 
 
 
 
8ae9584
87cdf5f
8ae9584
87cdf5f
 
 
8ae9584
 
87cdf5f
 
8ae9584
 
 
 
 
 
 
 
 
 
87cdf5f
 
 
 
8ae9584
87cdf5f
 
 
 
 
 
 
 
8ae9584
87cdf5f
 
 
8ae9584
87cdf5f
 
 
 
 
 
 
8ae9584
87cdf5f
 
 
 
 
8ae9584
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87cdf5f
 
 
 
8ae9584
 
 
 
87cdf5f
8ae9584
87cdf5f
 
 
 
 
 
 
 
8ae9584
 
87cdf5f
 
 
 
8ae9584

---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- industrial-code
- pretrained
- base-model
- verilog
- cuda
- triton
- chip-design
- cad
---

# InCoder-32B-Base: Code Foundation Model for Industrial Scenarios

<div align="center">

[![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base)
[![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
[![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

</div>

## Model Summary

**InCoder-32B-Base** is the pre-trained base model of the InCoder family — the first 32B-parameter code foundation model purpose-built for industrial code intelligence. This is the base (non-instruction-tuned) checkpoint, suitable for code completion, fill-in-the-middle (FIM), and further fine-tuning.

For the instruction-tuned variant, see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the reasoning variant, see [IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking).

Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:

| Domain | Languages & Frameworks |
|---|---|
| 🔧 **Chip Design** | Verilog, SystemVerilog, RTL |
| ⚡ **GPU Kernel Optimization** | CUDA, Triton |
| 🖥️ **Embedded Systems** | C/C++, ARM Cortex-M4, STM32 |
| 🔨 **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
| 📐 **3D Modeling / CAD** | CadQuery, OpenCascade, Python |

---

## Model Architecture

InCoder-32B-Base adopts a standard decoder-only Transformer architecture:

| Hyperparameter | Value |
|---|---|
| Parameters | ~32B |
| Layers | 64 |
| Hidden Size | 5,120 |
| Attention Heads | 40 (8 KV heads, GQA) |
| Max Context Length | 131,072 (128K) |
| Positional Encoding | RoPE (θ = 500,000) |
| Precision | BFloat16 |
| Vocabulary Size | 76,800 |

---

## Training Pipeline: Code-Flow

InCoder-32B-Base is trained through a two-stage **Code-Flow** pipeline:

### Stage 1 — Pre-training & Annealing
- **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
- **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
- **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives on 4,096 GPUs.

### Stage 2 — Mid-Training (Context Extension)
Context window extended progressively from 8K to 128K tokens:
- **8K → 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
- **32K → 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.

---

## Usage

### Installation

```bash
pip install transformers accelerate
```

### Code Completion

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = """// Synthesizable Verilog: UART transmitter (8N1 protocol)
module uart_tx (
    input wire clk,
    input wire rst_n,
    input wire [7:0] data_in,
    input wire tx_start,
    output reg tx,
    output reg tx_busy
);
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Fill-in-the-Middle (FIM)

InCoder-32B-Base supports FIM completion for code infilling tasks:

```python
prefix = """// CUDA kernel for RMS Normalization
__global__ void rms_norm_kernel(float* output, const float* input,
                                 const float* weight, int N, float eps) {
    int idx = blockIdx.x;
"""
suffix = """
    output[idx * N + tid] = normalized * weight[tid];
}"""

fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Deployment with vLLM

```bash
vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Base \
    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
```

---

## Fine-tuning

We provide an SFT framework in the [GitHub repository](https://github.com/CSJianYang/Industrial-Coder/tree/main/sft). See the README for data preparation and training instructions.

---

## Model Family

| Model | Type | HuggingFace |
|---|---|---|
| InCoder-32B-Base | Pre-trained | [🤗 IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
| InCoder-32B | Instruct | [🤗 IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
| InCoder-32B-Thinking | Reasoning | [🤗 IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| InCoder-32B-FP8 | FP8 Quantized | [🤗 IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
| InCoder-32B-AWQ-INT4 | AWQ INT4 | [🤗 IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [🤗 IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |

---

## Limitations & Disclaimers

This is a **base model** — it has not been instruction-tuned and does not follow conversational instructions. It is best suited for:
- Code completion and generation
- Fill-in-the-middle (FIM) tasks
- Further fine-tuning for downstream applications

Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.

---

## Citation

```bibtex
@article{yang2026incoder,
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
          and others},
  journal={arXiv preprint arXiv:2603.16790},
  year={2026}
}
```