Getting 110 tokens/sec on my RTX 3090, 24 GB VRam

#19
by bikkikumarsha - opened

Build llama.cpp on WSL and running the GLM-4.7-Flash-UD-Q4_K_XL.gguf model.
Context size is 120K which can be comfortably fit on my vram. This gave me 110 tokens/sec.

I also tried GLM-4.7-Flash-UD-Q5_K_XL.gguf model which performed similarly at 60k context. But as the context fills the memory starts to gets occupied resulting in gradual speed decline (2-3 token / sec decline per turn)

Here is a rough benchmark between the 2 models.

Benchmark Question

Context

You are acting as a senior applied AI engineer asked to help a product team diagnose churn and data quality issues in a subscription service.

Below is a messy event log extract and a short product requirements note. The data is intentionally imperfect.


Product Notes (read carefully)

  • Users can have multiple subscriptions, but only one can be active at a time.

  • cancelled_at may appear before subscription_start due to backfilled data.

  • Timestamps are in mixed time zones, but all end with Z or an offset.

  • The business defines 30‑day retention as:

    A user who still has an active subscription 30 days after their first subscription start.

  • The analytics team wants:

    1. A retention rate

    2. A list of data quality issues

    3. A clean, reusable function that could be productionized

  • Assume no external internet access, but standard Python libraries are allowed.


Event Log (CSV)

user_id,subscription_id,event_type,event_time u1,s1,started,2023-01-01T10:00:00Z u1,s1,cancelled,2023-01-20T09:00:00Z u2,s2,started,2023-01-05T08:00:00-05:00 u2,s3,started,2023-02-10T12:00:00Z u3,s4,started,2023-01-15T14:00:00Z u3,s4,cancelled,2023-01-10T10:00:00Z u4,s5,started,2023-01-01T00:00:00Z u4,s5,cancelled,2023-02-15T00:00:00Z u5,s6,started,2023-01-31T23:59:59Z


Your Tasks

1. Clarification & Assumptions

  • List up to 3 clarifying questions you would ask the product or data team.

  • Then proceed anyway, clearly stating the assumptions you choose if those questions remain unanswered.

2. Data Reasoning

  • Identify at least 5 non‑trivial data quality or modeling issues in the dataset.

  • Explain why each issue matters for retention analysis.

3. Coding

  • Write a clean, readable Python function that:

    • Parses the data

    • Normalizes timestamps

    • Computes the 30‑day retention rate

  • Include brief inline comments and describe the time complexity.

4. Validation & Testing

  • Propose 2–3 test cases (edge cases preferred).

  • Explain how you would validate correctness in production.

5. Agentic / Tool Reasoning

  • If you had access to tools (e.g., Python execution, SQL, dashboards), describe:

    • Which tools you would use

    • In what order

    • What signals would tell you to change course

6. Executive Summary

  • Write a 5–7 sentence summary for a non‑technical stakeholder explaining:

    • The retention result

    • Confidence level

    • Key risks and next steps


Constraints

  • Do not skip steps.

  • Do not answer with bullet points only—mix narrative, structure, and code.

  • Optimize for clarity, correctness, and judgment, not brevity.

Metric Response 1 Score GLM-4.7-Flash-UD-Q5_K_XL.gguf Response 1 remarks GLM-4.7-Flash-UD-Q4_K_XL.gguf Response 2 remarks
Coding ability 30 Major failure: provided function crashes (KeyError: 'active') because it never records per-user active status but later sums cohorts['active']. Also misstates time complexity (loop filters DF per user ⇒ closer to O(U·N)). Uses pandas despite “standard libraries” constraint ambiguity. 70 Code is runnable and uses stdlib (csv, datetime), good portability. Timestamp parsing via fromisoformat (+ Z fix) is solid. Weaknesses: claims UTC normalization but doesn’t explicitly convert; ignores subscription_id and does not model “restart within 30 days” correctly; complexity statement mentions sorting but code doesn’t sort.
Multi-turn reasoning (agentic & tool use) 78 Strong: good clarifying questions, explicit assumptions, decent tool stack (SQL, Great Expectations, orchestration, dashboards) + signals. 72 Good structure and sensible tool ordering + signals. Slightly less concrete than R1; some assumptions are stated but then inconsistently applied later.
Raw intelligence 70 Catches several real issues (timezones, backfilled cancels, user vs sub aggregation). Some “issues” are more about parsing/library choice than modeling. Minor date reasoning glitch (mentions Feb 29 in 2023) but overall conclusion aligns with the intended logic. 60 Good identification of common pitfalls, but makes a clear retention reasoning error (calls u5 churned despite no cancellation). Some points are fluff (DST mention without relevance here).
Coherence over long context 55 Narrative is broadly consistent, but coherence breaks because the executive summary depends on a metric the code cannot compute (code crashes). Also vacillates about stdlib vs pandas. 45 Biggest issue: internal contradiction—code returns 80%, but the response repeatedly asserts 60% (comments + executive summary). Also says “sort events” while not actually sorting.
Production readiness & robustness (new) 50 Good instincts (monitoring, validation tools, pipeline audit), but not production-ready due to non-running code + dependency mismatch + weak outputs for “data quality issues” (just a string). 65 More production-friendly: stdlib-only, clear function signature, test ideas + SQL baseline validation plan. Still lacks subscription-state modeling and returns only a float (no issue list / audit output), but closer to shippable.
Overall (average of above) 57 Strong analysis/agent framing, but coding execution quality is a blocker. 62 Better executable code + portability, but major self-contradiction on the key result hurts trust.

Sign up or log in to comment