---
license: mit
tags:
- ppo
- qlora
- reinforcement-learning
- llama-3
- mmlu
pipeline_tag: text-generation
---

# PPO-QLoRA Trained Model (spark-model-QLoRA)

This repository contains an agent (actor and critic models) trained using Proximal Policy Optimization (PPO) with QLoRA.
The training was performed using the scripts and models available in the `spark_rl` directory of the `explore-rl` project.

**Base Model:** `meta-llama/Llama-3-8B-Instruct` (or specify if different, based on your `train.py` arguments)

## Model Components

The `model_final` directory (uploaded here as the root of these components) contains:

*   **`actor/`**: LoRA adapters for the actor (policy) model.
*   **`critic/`**: LoRA adapters for the critic (value) model's base LLM, and a `value_head.pt` file for its custom value prediction head.
*   **`tokenizer/`**: The Hugging Face tokenizer used during training.
*   **`hyperparams.txt`**: Key hyperparameters used for the PPO training.
*   **`models.py`**: Contains the `LLMActorLora` and `LLMCriticLora` class definitions required to load and use these models.

## How to Use

To use these models, you will need the `LLMActorLora` and `LLMCriticLora` classes from the included `models.py` file.

```python
import torch
from transformers import AutoTokenizer
from models import LLMActorLora, LLMCriticLora # models.py is in this repository

# --- Configuration ---
BASE_MODEL_ID = "meta-llama/Llama-3-8B-Instruct" # IMPORTANT: Ensure this matches the model used for training!
MODEL_REPO_PATH = "gabrielbo/spark-model-QLoRA" # Or local path if downloaded
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --- Load Tokenizer ---
try:
    tokenizer = AutoTokenizer.from_pretrained(f"{MODEL_REPO_PATH}/tokenizer")
except Exception: # Fallback if tokenizer is in the root
    tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO_PATH)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" # Ensure consistency if PPO agent used left padding

# --- Load Actor ---
actor = LLMActorLora(
    device=DEVICE,
    model_id=BASE_MODEL_ID,
    # lora_r and disable_quantization can be defaults or from hyperparams.txt
)
# Path to actor adapters within the model repo
actor_adapters_path = f"{MODEL_REPO_PATH}/actor" 
actor.load_pretrained(actor_adapters_path)
actor.model.eval()
print("Actor loaded successfully.")

# --- Load Critic ---
critic = LLMCriticLora(
    device=DEVICE,
    model_id=BASE_MODEL_ID,
    # lora_r and disable_quantization can be defaults or from hyperparams.txt
)
# Path to critic components within the model repo
critic_components_path = f"{MODEL_REPO_PATH}/critic"
critic.load_pretrained(critic_components_path)
critic.model.eval()
critic.value_head.eval()
print("Critic loaded successfully.")

# --- Example: Generating an action (conceptual) ---
# This part is highly dependent on how your PPOAgent prepares inputs.
# The following is a generic example. You'll need to adapt it.

# Example input construction (refer to PPOAgent.prepare_batch)
question = "What is the capital of France?"
state_text = "The current context is a geography quiz."
input_text = f"Question: {question}\n\nState: {state_text}\n\nAction:"

inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(DEVICE)

print(f"\nGenerating action for: {input_text}")
with torch.no_grad():
    # Actor generates token IDs
    # Note: Generation kwargs might be needed (e.g., temperature, top_p from hyperparams.txt or evaluate.py)
    generated_ids = actor.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=50, # Adjust as needed
        # temperature=0.7, # Example
        # top_p=0.9,       # Example
        do_sample=True   # Example, if sampling was used
    )
    
    # Decode the generated action
    # The generated output includes the input_text, so we need to slice it off.
    # This depends on tokenizer.padding_side; if "left", then slicing logic changes.
    # Assuming tokenizer.padding_side = "right" (default for many models) or handled by generate
    
    # If tokenizer.padding_side was "left" for generation, the input is at the end.
    # For simplicity, let's assume the output only contains new tokens after input.
    # This might need adjustment based on specific generation config.
    
    # A common way to get only the generated part:
    response_ids = generated_ids[0][inputs.input_ids.shape[-1]:]
    action_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    
    print(f"Generated Action: {action_text.strip()}")

    # --- Example: Getting a value estimate (conceptual) ---
    value_prediction = critic.forward(inputs.input_ids, attention_mask=inputs.attention_mask)
    print(f"Value prediction for the state: {value_prediction.item()}")

```

## Training Details

The model was trained using the PPO algorithm with the following key settings (see `hyperparams.txt` for more details):

*   **Learning Rate (Actor)**: (Refer to `lr` in `hyperparams.txt`)
*   **Learning Rate (Critic)**: (Refer to `critic_lr` in `hyperparams.txt`)
*   **PPO Clip Ratio**: (Refer to `clip_ratio` in `hyperparams.txt`)
*   **KL Coefficient**: (Refer to `kl_coef` in `hyperparams.txt`)
*   **Target KL**: (Refer to `target_kl` in `hyperparams.txt`)
*   **Batch Size**: (As per your training script, e.g., `args.batch`)
*   **PPO Epochs**: (As per your training script, e.g., `args.ppo_epochs`)
*   **Total PPO Iterations**: (As per your training script, e.g., `args.steps`)

The specific dataset used for training was MMLU trajectories.

## Intended Use

This model is intended for tasks requiring sequential decision-making and reasoning, similar to the MMLU benchmark. It can be used as a starting point for further fine-tuning or for direct application in relevant domains.

## Limitations

*   The model's performance is tied to the quality and characteristics of the offline trajectory data it was trained on.
*   As a LoRA-adapted model, it relies on the capabilities of the base `meta-llama/Llama-3-8B-Instruct` model.
*   The generation behavior may require careful prompt engineering.

## Citation

If you use this model or the `spark_rl` codebase, please consider citing the original `explore-rl` repository:
[Link to your explore-rl GitHub repository, if public]