arturo-fredes commited on
Commit
efb7e53
·
verified ·
1 Parent(s): 01ee45f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -10
README.md CHANGED
@@ -225,13 +225,9 @@ Benchmark scores were obtained with the following setups. Methodology varies by
225
  #### Metrics reported
226
 
227
  - **System Output Throughput (higher is better)**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
228
- - **End-to-End Latency per Query (lower is better):** Median end-to-end response time for each query from the time the query is sent.
229
- - **Output Speed per Query (higher is better):** Median output tokens per second after the first token is received for each query.
230
  - **Time to first token (TTFT) (lower is better):** Median time to first token.
231
- - **Estimated total memory — (lower is better):** Median from each GuideLLM phase (estimated total footprint: weights plus KV contribution from monitored usage).
232
  - **Model weights (lower is better):**
233
 
234
- On the same hardware and harness, **HyperNova 60B 2605** is compared to **gpt-oss-120b** using GuideLLM. Each table lists **median** values for that model at each **concurrency phase** (1 → 256 concurrent requests).
235
 
236
  | Metric | GPT-OSS-120B | Hypernova 60B 2605 |
237
  |--------|-------------:|-------------------:|
@@ -244,18 +240,17 @@ On the same hardware and harness, **HyperNova 60B 2605** is compared to **gpt-os
244
  | Model weights (GB) | 121.54 | 31.81 |
245
 
246
 
247
-
248
  #### Performance evaluation conditions
249
 
250
  Our performance evaluation follows the spirit of [Artificial Analysis](https://artificialanalysis.ai/methodology/system-load-test).
251
-
252
- - **Inference library**: vLLM 0.13.0
253
  - **Monitoring libraries**: GuideLLM, nvidia-ml-py
254
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
255
- - **Conditions**: **concurrency phases** 1, 2, 4, 8, 16, 32, 64, 128, 192, and 256 concurrent requests (one GuideLLM phase each)
256
  - **Phase duration**: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
257
- - **Workload shape:** input length is ~1000 tokens per query (median); median output length varies by phase and model.
258
- - **Streaming**: Benchmarking is conducted with streaming enabled.
 
259
 
260
  The figure below is a **side-by-side comparison at concurrency = 128 only**
261
 
 
225
  #### Metrics reported
226
 
227
  - **System Output Throughput (higher is better)**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
 
 
228
  - **Time to first token (TTFT) (lower is better):** Median time to first token.
 
229
  - **Model weights (lower is better):**
230
 
 
231
 
232
  | Metric | GPT-OSS-120B | Hypernova 60B 2605 |
233
  |--------|-------------:|-------------------:|
 
240
  | Model weights (GB) | 121.54 | 31.81 |
241
 
242
 
 
243
  #### Performance evaluation conditions
244
 
245
  Our performance evaluation follows the spirit of [Artificial Analysis](https://artificialanalysis.ai/methodology/system-load-test).
246
+ - **Inference library**: vLLM 0.18.0
 
247
  - **Monitoring libraries**: GuideLLM, nvidia-ml-py
248
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
249
+ - **Conditions**: **concurrency phases** 128
250
  - **Phase duration**: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
251
+ - **Workload shape:** 1k input / 1k output
252
+ - **Decode:** temperature: 0.0, top_p: 1.0
253
+
254
 
255
  The figure below is a **side-by-side comparison at concurrency = 128 only**
256