rs545837
/

PIPer-Stage2-RL-Final

Safetensors

qwen3

Model card Files Files and versions

xet

Community

rs545837 commited on Feb 9

Commit

f631d6d

verified ·

1 Parent(s): a65a2ce

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +125 -0

README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# PIPer Stage 2 RL - Final Checkpoint
+**100% pass@5 on EnvBench evaluation set!**
+This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.
+## Model Description
+- **Base Model**: Qwen3-8B-am
+- **Training Pipeline**:
+  - **Stage 1**: Supervised Fine-Tuning on 2,250 ShareGPT conversations
+  - **Stage 2**: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
+- **Hardware**: 8x NVIDIA H200 GPUs
+- **Training Time**: ~3 hours total
+## Performance
+| Metric | Value |
+|--------|-------|
+| **pass@5** (20-sample eval) | **100%** (20/20 problems) |
+| Baseline (paper) | 19.4% |
+| Baseline (reproduction) | 30% |
+| **Improvement** | **+70 percentage points** |
+## Training Data
+- **Stage 1**: [PIPer-SFT-ShareGPT-Data](https://huggingface.co/datasets/PIPer-SFT-ShareGPT-Data)
+  - 2,250 training conversations
+  - 250 validation conversations
+- **Stage 2**: [PIPer-EnvBench-Data](https://huggingface.co/datasets/PIPer-EnvBench-Data)
+  - 228 environment setup problems (training)
+  - 96 environment setup problems (test)
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "PIPer-Stage2-RL-Final",
+    trust_remote_code=True,
+    torch_dtype="bfloat16",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")
+# Format prompt
+messages = [{
+    "role": "user",
+    "content": "Your task is to generate a bash script that will set up a Python development environment..."
+}]
+inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
+outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## Training Details
+### Stage 1 Configuration
+- **Dataset**: ShareGPT conversations
+- **Batch Size**: 256 (8 GPUs × 32 samples per GPU)
+- **Learning Rate**: 2e-5
+- **Epochs**: 3
+- **Sequence Length**: 4096
+- **Training Steps**: 24
+### Stage 2 Configuration
+- **Algorithm**: PPO (Proximal Policy Optimization)
+- **Dataset**: EnvBench environment setup problems
+- **Batch Size**: 128
+- **Reward Function**: Strict shellcheck validation
+- **Epochs**: 40
+- **Sequence Length**: 8192
+- **Training Steps**: 40
+## Evaluation Results
+Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):
+- **20/20 problems passed** (100% success rate)
+- Most problems achieved 5/5 correct samples
+- Strong consistency across samples
+### Sample Reward Distributions
+- Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
+- Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
+- Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] ✓
+- ...
+## Architecture
+- **Framework**: veRL (Versatile Reinforcement Learning)
+- **Distribution**: FSDP (Fully Sharded Data Parallel)
+- **Inference**: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
+- **Attention**: Flash Attention 2
+## Checkpoints
+- **Stage 1 SFT**: [PIPer-Stage1-SFT-ShareGPT](https://huggingface.co/PIPer-Stage1-SFT-ShareGPT)
+- **Stage 2 RL** (this model): [PIPer-Stage2-RL-Final](https://huggingface.co/PIPer-Stage2-RL-Final)
+## Citation
+Based on the PIPer paper:
+```bibtex
+@article{piper2025,
+  title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
+  author={...},
+  journal={arXiv preprint},
+  year={2025}
+}
+```
+## License
+Same as base model (Qwen3-8B-am)
+## Acknowledgments
+- JetBrains Research for the PIPer codebase and EnvBench dataset
+- Qwen team for the base model
+- veRL team for the training framework