Text Generation
Transformers
Safetensors
iquestcoder
code
industrial-code
reasoning
thinking
verilog
cuda
triton
chip-design
cad
conversational
custom_code
Instructions to use lemensym/IndustrialCoder-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lemensym/IndustrialCoder-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lemensym/IndustrialCoder-Thinking", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("lemensym/IndustrialCoder-Thinking", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lemensym/IndustrialCoder-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lemensym/IndustrialCoder-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lemensym/IndustrialCoder-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lemensym/IndustrialCoder-Thinking
- SGLang
How to use lemensym/IndustrialCoder-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lemensym/IndustrialCoder-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lemensym/IndustrialCoder-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lemensym/IndustrialCoder-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lemensym/IndustrialCoder-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use lemensym/IndustrialCoder-Thinking with Docker Model Runner:
docker model run hf.co/lemensym/IndustrialCoder-Thinking
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - code | |
| - industrial-code | |
| - reasoning | |
| - thinking | |
| - verilog | |
| - cuda | |
| - triton | |
| - chip-design | |
| - cad | |
| # InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios | |
| <div align="center"> | |
| [](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) | |
| [](https://github.com/CSJianYang/Industrial-Coder) | |
| [](https://huggingface.co/papers/2603.16790) | |
| [](LICENSE) | |
| </div> | |
| ## Model Summary | |
| **InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning — debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues. | |
| For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base). | |
| --- | |
| ## Key Results | |
| ### General Code Benchmarks | |
| | Benchmark | InCoder-32B | InCoder-32B-Thinking | | |
| |---|:---:|:---:| | |
| | HumanEval+ | 89.6 | **91.5** | | |
| | MBPP+ | 78.3 | **80.1** | | |
| | BigCodeBench (Full) | 49.8 | **51.2** | | |
| | LiveCodeBench (Pass@1) | 49.14 | **52.3** | | |
| ### Industrial Code Benchmarks | |
| | Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking | | |
| |---|---|:---:|:---:| | |
| | VeriScope Score | Chip Design | 80.7 | **82.3** | | |
| | CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** | | |
| | KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** | | |
| > The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning. | |
| --- | |
| ## Model Architecture | |
| Same architecture as InCoder-32B, with thinking-aware post-training: | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Parameters | ~32B | | |
| | Layers | 64 | | |
| | Hidden Size | 5,120 | | |
| | Attention Heads | 40 (8 KV heads, GQA) | | |
| | Max Context Length | 131,072 (128K) | | |
| | Positional Encoding | RoPE (θ = 500,000) | | |
| | Precision | BFloat16 | | |
| --- | |
| ## How Thinking Mode Works | |
| InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to: | |
| 1. **Decompose** complex problems into sub-tasks | |
| 2. **Reason** about constraints, edge cases, and hardware semantics | |
| 3. **Plan** the solution structure before writing code | |
| Example output: | |
| ``` | |
| <think> | |
| The user wants a UART transmitter module. Let me think through the design: | |
| 1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT | |
| 2. 8N1 means: 8 data bits, no parity, 1 stop bit | |
| 3. Need a baud rate counter derived from the clock frequency | |
| 4. Shift register to serialize the 8-bit data LSB first | |
| </think> | |
| module uart_tx ( | |
| input wire clk, | |
| ... | |
| ``` | |
| You can **disable** thinking mode to get direct answers (behaves like the instruct variant): | |
| ```python | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True, | |
| enable_thinking=False | |
| ) | |
| ``` | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers accelerate | |
| ``` | |
| ### Thinking Mode (default) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| messages = [ | |
| {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n int i = threadIdx.x;\n if (i < N) c[i] = a[i] + b[i];\n}"} | |
| ] | |
| # Thinking mode (default) — model reasons before answering | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20) | |
| output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False) | |
| # Parse thinking and response | |
| if "</think>" in output: | |
| thinking = output.split("</think>")[0].replace("<think>\n", "").strip() | |
| response = output.split("</think>")[1].strip() | |
| print(f"Thinking:\n{thinking}\n\nResponse:\n{response}") | |
| else: | |
| print(output) | |
| ``` | |
| ### Non-Thinking Mode | |
| ```python | |
| # Disable thinking — direct answer without reasoning trace | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True, | |
| enable_thinking=False | |
| ) | |
| ``` | |
| ### With Tool Calls | |
| ```python | |
| tools = [{ | |
| "type": "function", | |
| "function": { | |
| "name": "run_verilog_sim", | |
| "description": "Run Verilog simulation with Icarus Verilog", | |
| "parameters": { | |
| "type": "object", | |
| "properties": { | |
| "code": {"type": "string", "description": "Verilog source code"}, | |
| "testbench": {"type": "string", "description": "Testbench code"} | |
| } | |
| } | |
| } | |
| }] | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True, tools=tools | |
| ) | |
| ``` | |
| ### Deployment with vLLM | |
| ```bash | |
| vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \ | |
| --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code | |
| ``` | |
| ### Recommended Sampling Parameters | |
| | Use case | temperature | top_p | top_k | max_new_tokens | | |
| |---|:---:|:---:|:---:|:---:| | |
| | Thinking (default) | 0.6 | 0.85 | 20 | 8192 | | |
| | Non-thinking / precise | 0.2 | 0.95 | — | 4096 | | |
| --- | |
| ## Model Family | |
| | Model | Type | HuggingFace | | |
| |---|---|---| | |
| | InCoder-32B-Base | Pre-trained | [🤗 IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) | | |
| | InCoder-32B | Instruct | [🤗 IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) | | |
| | **InCoder-32B-Thinking** | **Reasoning** | [🤗 IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) | | |
| | InCoder-32B-FP8 | FP8 Quantized | [🤗 IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) | | |
| | InCoder-32B-AWQ-INT4 | AWQ INT4 | [🤗 IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) | | |
| | InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [🤗 IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) | | |
| --- | |
| ## Limitations & Disclaimers | |
| - The thinking trace may occasionally contain reasoning errors or hallucinated constraints — always verify the final code output. | |
| - For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation. | |
| - Based on failure analysis, the model may struggle with: | |
| - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C. | |
| - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios. | |
| - **Optimization**: Correct but sub-optimal GPU kernel performance. | |
| Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @article{yang2026incoder, | |
| title={InCoder-32B: Code Foundation Model for Industrial Scenarios}, | |
| author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn | |
| and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin | |
| and others}, | |
| journal={arXiv preprint arXiv:2603.16790}, | |
| year={2026} | |
| } | |
| ``` | |