CodeLens / kaggle_Details.md
ArshVerma's picture
Fix nested markdown blocks for Kaggle parser
be2f706
|
Raw
History Blame Contribute Delete
5.25 kB

Model Summary

CodeLens-Reviewer is an AI agent fine-tuned to act as a Senior Code Reviewer. Built upon the Qwen2.5-Coder (7B) architecture, this model has been instruction-tuned by Arsh Verma to autonomously detect bugs, identify security vulnerabilities, and critique architectural flaws in Python Pull Requests. It interacts directly with the CodeLens Evaluation Environment, outputting perfectly structured JSON actions to simulate an automated code review loop.

Usage

This model is intended to be used either standalone for code inference or plugged into the live CodeLens Evaluation Environment.

Inference with Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ArshVerma/CodeLens-Reviewer", 
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

prompt = """You are an expert code reviewer. Review the following code and output a JSON action indicating any bugs or issues you find.

Code:
def process(data):
    for i in data:
        if i == "remove": data.remove(i)
    return data
"""

messages = [
    {"role": "system", "content": "You are an expert code reviewer. Output only valid JSON."},
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])

Shape of Inputs/Outputs:

  • Input: Natural language instructions and a python code snippet.
  • Output: A strict JSON object containing action, issue_description, filename, line_number, severity, and category.

Known Failures: The model may hallucinate specific line numbers if the provided code diff is extremely long or poorly formatted.

System

This model serves as the core "Agent" in the CodeLens Evaluation System. It receives system prompts, noise budgets, and task definitions from the CodeLens Python backend. Its downstream dependency is the CodeLens WebSocket dashboard (hosted on Hugging Face Spaces), which parses the JSON outputs to assign rewards (positive/negative) and update the live leaderboard.

Implementation requirements

  • Hardware: Trained on a single NVIDIA Tesla T4 x2 GPU via Kaggle.
  • Software: Unsloth, PyTorch, Hugging Face Transformers, TRL.
  • Compute: Fine-tuning took less than 15 minutes due to Unsloth's highly optimized 4-bit LoRA training pipeline. Inference requires < 6GB of VRAM, making it incredibly lightweight and accessible.

Model Characteristics

Model initialization

The model was fine-tuned from the pre-trained unsloth/Qwen2.5-Coder-7B-Instruct base model.

Model stats

  • Size: 7 Billion Parameters.
  • Weights: LoRA adapters applied to Q, K, V, O, gate, up, and down projections.
  • Latency: Highly optimized for real-time inference (2x faster than native Hugging Face using Unsloth).

Other details

  • Quantization: The model was trained and quantized in 4-bit (bitsandbytes) to severely reduce VRAM requirements.
  • Gradient Checkpointing: Unsloth's exact gradient checkpointing was used to save memory.

Data Overview

The model was trained on a highly specific, synthetically generated instructional dataset targeting code review tasks.

Training data

The training data (ArshVerma/CodeLens-Dataset) contains custom prompt-completion pairs. Each row simulates a CodeLens environment state containing:

  • A pr_title and pr_description.
  • A code diff containing planted bugs, security flaws, or architectural issues.
  • A golden JSON completion representing the ideal code review action.

Demographic groups

N/A. This model processes programming logic and code structure.

Evaluation data

The model was evaluated iteratively against the live CodeLens engine by observing its capability to identify 3 distinct tasks: bug_detection, security_audit, and architectural_review.


Evaluation Results

Summary

The model successfully learned the JSON formatting constraints and correctly identifies logic bugs (e.g., modifying arrays during iteration) without hallucinating markdown wrappers.

Subgroup evaluation results

  • Bug Detection: High accuracy. Successfully identifies off-by-one and logical loop errors.
  • Security Audit: Capable of identifying basic input sanitization failures.
  • Architecture: Demonstrates baseline capability in flagging Single Responsibility Principle (SRP) violations.

Fairness

N/A. Evaluation was strictly based on binary reward functions (detecting the objective bug = +1 reward).

Usage limitations

This model is intended strictly for evaluating code within isolated sandboxes or assisting developers. It should not be used to automatically reject or approve production Pull Requests without human oversight, as false positives are possible.

Ethics

No sensitive or PII data was used during the training of this model. Risks of adversarial exploitation exist if malicious code is embedded to trigger arbitrary outputs, but the model outputs strictly to JSON, minimizing execution risks.