CodeRM-GRPO-Selection-8B — NOESIS AWQ-INT4

AWQ-INT4 (GEMM kernel) quantization of CodeRM-GRPO-Selection-8B, a code-domain reward model trained with GRPO (Group Relative Policy Optimization) on top of Qwen/Qwen3-8B. This bundle is the deployment variant for the NOESIS-VC-ONE platform — it fits a 6 GB GPU (RTX 3060 Laptop) and serves as the M5-CODE branch best-of-N selector at inference time.

Field	Value
Architecture	`Qwen3ForCausalLM` (scoring backbone)
Hidden size	4 096
Layers	36
Attention heads	32
KV heads	8 (GQA)
Head dim	128
Vocab	151 936 (Qwen3 standard)
Context length	32 768 (positional 40 960)
Base model	`Qwen/Qwen3-8B` (Apache 2.0)
Fine-tune method	GRPO (Shao et al., DeepSeekMath / DeepSeek-R1 lineage)
Quantization	AWQ INT4, GEMM kernel, group_size=128, zero_point=true
Bundle size	~6.1 GB on disk (down from ~16 GB BF16)
Runtime VRAM	~5.5 GB peak (fits RTX 3060 6 GB)
Required runtime	`transformers >= 5.8.1` with native `AwqConfig`
License	Apache 2.0 (inherited from Qwen3-8B + GRPO fine-tune)

What's in this bundle

File	Purpose
`model-00001-of-00002.safetensors` (4.0 GB)	AWQ-quantized weight shard 1/2
`model-00002-of-00002.safetensors` (2.1 GB)	AWQ-quantized weight shard 2/2
`model.safetensors.index.json`	shard map (qweight / qzeros / scales per Linear)
`config.json`	`quantization_config.quant_method="awq"` + AWQ params
`tokenizer.json` / `tokenizer_config.json`	Qwen3 BPE tokenizer (vocab 151 936)
`chat_template.jinja`	Qwen3 standard chat template
`generation_config.json`	inherited defaults
`noesis_provenance.json`	full NOESIS provenance (see below)
`LICENSE`	Apache 2.0

Quantization details (sealed in `noesis_provenance.json`)

Parameter	Value
Method	AWQ via `autoawq`
Kernel	GEMM
`w_bit`	4
`q_group_size`	128
`zero_point`	`true`
Calibration samples	64
Calibration `max_seq_len`	384
Calibration source	`noesis_router_dataset_50k_curated.jsonl`
RNG seed	`1729`
Wall-clock	57.13 minutes
`force_arch_override`	`null` (auto-detected `Qwen3ForCausalLM`)
NOESIS framework	DHCF-FNO v15.7

What GRPO-Selection means

Group Relative Policy Optimization scores candidates relative to the group of competing candidates rather than against an absolute return. For code-reward selection that translates to:

Sample N candidates from a code-generation expert (e.g. M5-CODE).
Run each candidate through this reward model → group of N scalar scores.
Pick argmax (best-of-N) or rank-based selection (top-k).

Group-relative scoring produces sharper preference signals than classic absolute-reward PPO on coding tasks where many candidates are "almost correct" but only a few actually pass tests.

Quick start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

bundle = "AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(bundle)
model = AutoModelForCausalLM.from_pretrained(
    bundle,
    device_map={"": 0},          # AWQ kernels prefer single-device load
    dtype=torch.float16,         # AWQ activations are fp16
).eval()


def score(prompt: str, code: str) -> float:
    text = (
        f"<|im_start|>user\n{prompt}<|im_end|>\n"
        f"<|im_start|>assistant\n{code}<|im_end|>"
    )
    ids = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        logits = model(**ids).logits[:, -1, :]
    return float(logits.softmax(-1).max())


candidates = [
    "def solve(n):\n    return n + 1",
    "def solve(n):\n    return n * 2",
]
prompt = "Write a function that returns n+1."
scores = [score(prompt, c) for c in candidates]
print("Best:", candidates[scores.index(max(scores))])

AWQ runtime note. transformers >= 5.8.1 reads the quantization_config block in config.json and instantiates the AWQ kernels automatically — no autoawq import is needed at inference time. The autoawq library is only required if you want to re-quantize from BF16 sources.

NOESIS integration

In the NOESIS-VC-ONE platform CodeRM serves as the code-domain selector inside the M5-CODE branch:

M5-CODE generates N candidates
        │
        ▼
CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4   ← (this bundle)
        │   group-relative scores
        ▼
Orchestrator picks argmax
        │
        ▼
QC-4B verifies executability

Apache 2.0 lineage end-to-end keeps this branch commercial-clean.

Sealed NOESIS rules

R-APACHE-CLEAN — Apache 2.0 preserved (Qwen3 base + GRPO fine-tune + AWQ quant).
R-REWARD-MODEL-FROZEN — reward model is frozen during inference; no gradient feedback into M5-CODE at production runtime.
R-BEST-OF-N-CAP — production best-of-N selection is capped at N=8 to bound VRAM / latency on RTX 3060.
R-AWQ-DEVICE-MAP-SINGLE — AWQ kernels require device_map={"":0}, never "auto" (mirrors the NF4 rule and applies for the same reason: kernel expects the full computation graph on one device).

Provenance

Step	Source / output
Base weights	`Qwen/Qwen3-8B` (© Alibaba / Qwen Team, Apache 2.0)
Fine-tune	GRPO on code-reward dataset (author's pipeline)
Source format	BF16 native safetensors × 4 shards (~16 GB)
Quantization	AWQ INT4 GEMM via `autoawq`, group_size=128, w_bit=4
Bundle	this repo — `AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4`
NOESIS slot	M5-CODE branch reward-selection head

License

Apache License 2.0. Qwen3-8B base © Alibaba Cloud / Qwen Team (2025-2026). GRPO fine-tune © CodeRM-GRPO-Selection author(s). NOESIS AWQ-INT4 quantization layer © AMAImedia 2026 (NOESIS DHCF-FNO project). See LICENSE.

NOESIS DHCF-FNO framework — AMAImedia.com. BF16 source bundle: AMAImedia/CodeRM-GRPO-Selection-8B (if/when published).

Downloads last month: 6

Safetensors

Model size

8B params

Tensor type

I32

BF16

Model tree for AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(336)

this model