rs545837 commited on
Commit
f631d6d
·
verified ·
1 Parent(s): a65a2ce

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PIPer Stage 2 RL - Final Checkpoint
2
+
3
+ **100% pass@5 on EnvBench evaluation set!**
4
+
5
+ This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.
6
+
7
+ ## Model Description
8
+
9
+ - **Base Model**: Qwen3-8B-am
10
+ - **Training Pipeline**:
11
+ - **Stage 1**: Supervised Fine-Tuning on 2,250 ShareGPT conversations
12
+ - **Stage 2**: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
13
+ - **Hardware**: 8x NVIDIA H200 GPUs
14
+ - **Training Time**: ~3 hours total
15
+
16
+ ## Performance
17
+
18
+ | Metric | Value |
19
+ |--------|-------|
20
+ | **pass@5** (20-sample eval) | **100%** (20/20 problems) |
21
+ | Baseline (paper) | 19.4% |
22
+ | Baseline (reproduction) | 30% |
23
+ | **Improvement** | **+70 percentage points** |
24
+
25
+ ## Training Data
26
+
27
+ - **Stage 1**: [PIPer-SFT-ShareGPT-Data](https://huggingface.co/datasets/PIPer-SFT-ShareGPT-Data)
28
+ - 2,250 training conversations
29
+ - 250 validation conversations
30
+
31
+ - **Stage 2**: [PIPer-EnvBench-Data](https://huggingface.co/datasets/PIPer-EnvBench-Data)
32
+ - 228 environment setup problems (training)
33
+ - 96 environment setup problems (test)
34
+
35
+ ## Usage
36
+
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+
40
+ model = AutoModelForCausalLM.from_pretrained(
41
+ "PIPer-Stage2-RL-Final",
42
+ trust_remote_code=True,
43
+ torch_dtype="bfloat16",
44
+ device_map="auto"
45
+ )
46
+ tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")
47
+
48
+ # Format prompt
49
+ messages = [{
50
+ "role": "user",
51
+ "content": "Your task is to generate a bash script that will set up a Python development environment..."
52
+ }]
53
+
54
+ inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
55
+ outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
56
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
57
+ print(response)
58
+ ```
59
+
60
+ ## Training Details
61
+
62
+ ### Stage 1 Configuration
63
+ - **Dataset**: ShareGPT conversations
64
+ - **Batch Size**: 256 (8 GPUs × 32 samples per GPU)
65
+ - **Learning Rate**: 2e-5
66
+ - **Epochs**: 3
67
+ - **Sequence Length**: 4096
68
+ - **Training Steps**: 24
69
+
70
+ ### Stage 2 Configuration
71
+ - **Algorithm**: PPO (Proximal Policy Optimization)
72
+ - **Dataset**: EnvBench environment setup problems
73
+ - **Batch Size**: 128
74
+ - **Reward Function**: Strict shellcheck validation
75
+ - **Epochs**: 40
76
+ - **Sequence Length**: 8192
77
+ - **Training Steps**: 40
78
+
79
+ ## Evaluation Results
80
+
81
+ Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):
82
+
83
+ - **20/20 problems passed** (100% success rate)
84
+ - Most problems achieved 5/5 correct samples
85
+ - Strong consistency across samples
86
+
87
+ ### Sample Reward Distributions
88
+ - Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
89
+ - Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
90
+ - Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] ✓
91
+ - ...
92
+
93
+ ## Architecture
94
+
95
+ - **Framework**: veRL (Versatile Reinforcement Learning)
96
+ - **Distribution**: FSDP (Fully Sharded Data Parallel)
97
+ - **Inference**: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
98
+ - **Attention**: Flash Attention 2
99
+
100
+ ## Checkpoints
101
+
102
+ - **Stage 1 SFT**: [PIPer-Stage1-SFT-ShareGPT](https://huggingface.co/PIPer-Stage1-SFT-ShareGPT)
103
+ - **Stage 2 RL** (this model): [PIPer-Stage2-RL-Final](https://huggingface.co/PIPer-Stage2-RL-Final)
104
+
105
+ ## Citation
106
+
107
+ Based on the PIPer paper:
108
+ ```bibtex
109
+ @article{piper2025,
110
+ title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
111
+ author={...},
112
+ journal={arXiv preprint},
113
+ year={2025}
114
+ }
115
+ ```
116
+
117
+ ## License
118
+
119
+ Same as base model (Qwen3-8B-am)
120
+
121
+ ## Acknowledgments
122
+
123
+ - JetBrains Research for the PIPer codebase and EnvBench dataset
124
+ - Qwen team for the base model
125
+ - veRL team for the training framework