CodeRM-GRPO-Selection-8B — NOESIS AWQ-INT4

AWQ-INT4 (GEMM kernel) quantization of CodeRM-GRPO-Selection-8B, a code-domain reward model trained with GRPO (Group Relative Policy Optimization) on top of Qwen/Qwen3-8B. This bundle is the deployment variant for the NOESIS-VC-ONE platform — it fits a 6 GB GPU (RTX 3060 Laptop) and serves as the M5-CODE branch best-of-N selector at inference time.

Field Value
Architecture Qwen3ForCausalLM (scoring backbone)
Hidden size 4 096
Layers 36
Attention heads 32
KV heads 8 (GQA)
Head dim 128
Vocab 151 936 (Qwen3 standard)
Context length 32 768 (positional 40 960)
Base model Qwen/Qwen3-8B (Apache 2.0)
Fine-tune method GRPO (Shao et al., DeepSeekMath / DeepSeek-R1 lineage)
Quantization AWQ INT4, GEMM kernel, group_size=128, zero_point=true
Bundle size ~6.1 GB on disk (down from ~16 GB BF16)
Runtime VRAM ~5.5 GB peak (fits RTX 3060 6 GB)
Required runtime transformers >= 5.8.1 with native AwqConfig
License Apache 2.0 (inherited from Qwen3-8B + GRPO fine-tune)

What's in this bundle

File Purpose
model-00001-of-00002.safetensors (4.0 GB) AWQ-quantized weight shard 1/2
model-00002-of-00002.safetensors (2.1 GB) AWQ-quantized weight shard 2/2
model.safetensors.index.json shard map (qweight / qzeros / scales per Linear)
config.json quantization_config.quant_method="awq" + AWQ params
tokenizer.json / tokenizer_config.json Qwen3 BPE tokenizer (vocab 151 936)
chat_template.jinja Qwen3 standard chat template
generation_config.json inherited defaults
noesis_provenance.json full NOESIS provenance (see below)
LICENSE Apache 2.0

Quantization details (sealed in noesis_provenance.json)

Parameter Value
Method AWQ via autoawq
Kernel GEMM
w_bit 4
q_group_size 128
zero_point true
Calibration samples 64
Calibration max_seq_len 384
Calibration source noesis_router_dataset_50k_curated.jsonl
RNG seed 1729
Wall-clock 57.13 minutes
force_arch_override null (auto-detected Qwen3ForCausalLM)
NOESIS framework DHCF-FNO v15.7

What GRPO-Selection means

Group Relative Policy Optimization scores candidates relative to the group of competing candidates rather than against an absolute return. For code-reward selection that translates to:

  1. Sample N candidates from a code-generation expert (e.g. M5-CODE).
  2. Run each candidate through this reward model → group of N scalar scores.
  3. Pick argmax (best-of-N) or rank-based selection (top-k).

Group-relative scoring produces sharper preference signals than classic absolute-reward PPO on coding tasks where many candidates are "almost correct" but only a few actually pass tests.

Quick start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

bundle = "AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(bundle)
model = AutoModelForCausalLM.from_pretrained(
    bundle,
    device_map={"": 0},          # AWQ kernels prefer single-device load
    dtype=torch.float16,         # AWQ activations are fp16
).eval()


def score(prompt: str, code: str) -> float:
    text = (
        f"<|im_start|>user\n{prompt}<|im_end|>\n"
        f"<|im_start|>assistant\n{code}<|im_end|>"
    )
    ids = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        logits = model(**ids).logits[:, -1, :]
    return float(logits.softmax(-1).max())


candidates = [
    "def solve(n):\n    return n + 1",
    "def solve(n):\n    return n * 2",
]
prompt = "Write a function that returns n+1."
scores = [score(prompt, c) for c in candidates]
print("Best:", candidates[scores.index(max(scores))])

AWQ runtime note. transformers >= 5.8.1 reads the quantization_config block in config.json and instantiates the AWQ kernels automatically — no autoawq import is needed at inference time. The autoawq library is only required if you want to re-quantize from BF16 sources.

NOESIS integration

In the NOESIS-VC-ONE platform CodeRM serves as the code-domain selector inside the M5-CODE branch:

M5-CODE generates N candidates
        │
        ▼
CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4   ← (this bundle)
        │   group-relative scores
        ▼
Orchestrator picks argmax
        │
        ▼
QC-4B verifies executability

Apache 2.0 lineage end-to-end keeps this branch commercial-clean.

Sealed NOESIS rules

  • R-APACHE-CLEAN — Apache 2.0 preserved (Qwen3 base + GRPO fine-tune + AWQ quant).
  • R-REWARD-MODEL-FROZEN — reward model is frozen during inference; no gradient feedback into M5-CODE at production runtime.
  • R-BEST-OF-N-CAP — production best-of-N selection is capped at N=8 to bound VRAM / latency on RTX 3060.
  • R-AWQ-DEVICE-MAP-SINGLE — AWQ kernels require device_map={"":0}, never "auto" (mirrors the NF4 rule and applies for the same reason: kernel expects the full computation graph on one device).

Provenance

Step Source / output
Base weights Qwen/Qwen3-8B (© Alibaba / Qwen Team, Apache 2.0)
Fine-tune GRPO on code-reward dataset (author's pipeline)
Source format BF16 native safetensors × 4 shards (~16 GB)
Quantization AWQ INT4 GEMM via autoawq, group_size=128, w_bit=4
Bundle this repo — AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4
NOESIS slot M5-CODE branch reward-selection head

License

Apache License 2.0. Qwen3-8B base © Alibaba Cloud / Qwen Team (2025-2026). GRPO fine-tune © CodeRM-GRPO-Selection author(s). NOESIS AWQ-INT4 quantization layer © AMAImedia 2026 (NOESIS DHCF-FNO project). See LICENSE.


NOESIS DHCF-FNO framework — AMAImedia.com. BF16 source bundle: AMAImedia/CodeRM-GRPO-Selection-8B (if/when published).

Downloads last month
19
Safetensors
Model size
8B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AMAImedia/CodeRM-GRPO-Selection-8B-NOESIS-AWQ-INT4

Finetuned
Qwen/Qwen3-8B
Quantized
(283)
this model