evals
Browse files
README.md
CHANGED
|
@@ -139,6 +139,26 @@ response = client.chat.completions.create(
|
|
| 139 |
|
| 140 |
Proxy Lite scored 72.4% on the [WebVoyager](https://huggingface.co/datasets/convergence-ai/WebVoyager2025Valid) benchmark, placing it 1st out of all available open-weights models.
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
### Out-of-Scope Use
|
| 144 |
|
|
|
|
| 139 |
|
| 140 |
Proxy Lite scored 72.4% on the [WebVoyager](https://huggingface.co/datasets/convergence-ai/WebVoyager2025Valid) benchmark, placing it 1st out of all available open-weights models.
|
| 141 |
|
| 142 |
+
A breakdown of the results by website is shown below:
|
| 143 |
+
|
| 144 |
+
| web_name | Success Rate (%) | Finish Rate (%) | Avg. Steps |
|
| 145 |
+
|---------------------|-----------------|-----------------|------------|
|
| 146 |
+
| Allrecipes | 87.8 | 95.1 | 10.3 |
|
| 147 |
+
| Amazon | 70.0 | 90.0 | 7.1 |
|
| 148 |
+
| Apple | 82.1 | 89.7 | 10.7 |
|
| 149 |
+
| ArXiv | 60.5 | 79.1 | 16.0 |
|
| 150 |
+
| BBC News | 69.4 | 77.8 | 15.9 |
|
| 151 |
+
| Booking | 70.0 | 85.0 | 24.8 |
|
| 152 |
+
| Cambridge Dict. | 86.0 | 97.7 | 5.7 |
|
| 153 |
+
| Coursera | 82.5 | 97.5 | 4.7 |
|
| 154 |
+
| ESPN | 53.8 | 87.2 | 14.9 |
|
| 155 |
+
| GitHub | 85.0 | 92.5 | 10.0 |
|
| 156 |
+
| Google Flights | 38.5 | 51.3 | 34.8 |
|
| 157 |
+
| Google Map | 78.9 | 94.7 | 9.6 |
|
| 158 |
+
| Google Search | 71.4 | 92.9 | 6.0 |
|
| 159 |
+
| Huggingface | 68.6 | 74.3 | 18.4 |
|
| 160 |
+
| Wolfram Alpha | 78.3 | 93.5 | 6.1 |
|
| 161 |
+
|
| 162 |
|
| 163 |
### Out-of-Scope Use
|
| 164 |
|