🚀 Starting Agent Loop Tool Efficiency Test

by bukit - opened Mar 2

Mar 2

•

📊 Configuration:
Base URL: http://localhost:9099/v1
Model: Qwen3.5-4B-UD-Q5_K_XL
Test Cases: 17
Output: results/agent_test_results_Qwen3.5-4B-UD-Q5_K_XL_20260302_230444.json
Log File: logs/agent_test_logs_Qwen3.5-4B-UD-Q5_K_XL_20260302_230444.log

🔄 Running agent tests...
Starting agent test suite with 17 test cases
Running agent test: zero_greeting
Running agent test: simple_add_iphone
Running agent test: zero_thank_you
Running agent test: zero_capabilities
Running agent test: simple_search_electronics
Running agent test: zero_weather_question
Running agent test: medium_search_category_and_add
Running agent test: simple_view_cart
Running agent test: simple_remove_product
Running agent test: simple_checkout
Running agent test: medium_search_and_add
Running agent test: complex_shopping_workflow
Running agent test: medium_remove_and_add
Running agent test: medium_view_and_add
Running agent test: complex_cart_management
Running agent test: complex_gift_shopping
Running agent test: zero_general_question
✅ Tests completed in 50.3882221s

📈 Agent Test Results

Total Tests: 17
✅ Passed: 14
❌ Failed: 3
⏱️ Total LLM Time: 7m43.9050266s
⏱️ Average Time per Request: 10.78848899s

📋 Test Case Results:

Test Case: zero_thank_you
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 3.0060852s
Tool Calls: 0

Test Case: zero_greeting
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 4.1913922s
Tool Calls: 0

Test Case: zero_weather_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 7.9975621s
Tool Calls: 0

Test Case: zero_general_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 10.5423246s
Tool Calls: 0

Test Case: simple_checkout
Status: ✅ PASSED
Matched Path: direct_checkout
Response Time: 13.3653614s
Tool Calls: 1
Tools Used: checkout

Test Case: zero_capabilities
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 16.552381s
Tool Calls: 0

Test Case: simple_search_electronics
Status: ✅ PASSED
Matched Path: search_by_category
Response Time: 17.2668578s
Tool Calls: 1
Tools Used: search_products

Test Case: simple_view_cart
Status: ✅ PASSED
Matched Path: view_cart
Response Time: 21.2515924s
Tool Calls: 1
Tools Used: view_cart

Test Case: simple_add_iphone
Status: ✅ PASSED
Matched Path: search_then_add
Response Time: 29.4939025s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: simple_remove_product
Status: ✅ PASSED
Matched Path: direct_remove
Response Time: 32.7854818s
Tool Calls: 1
Tools Used: remove_from_cart

Test Case: medium_search_category_and_add
Status: ✅ PASSED
Matched Path: search_then_add
Response Time: 33.9196s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: medium_view_and_add
Status: ✅ PASSED
Matched Path: view_then_add
Response Time: 39.1388373s
Tool Calls: 2
Tools Used: view_cart, add_to_cart

Test Case: complex_shopping_workflow
Status: ❌ FAILED
Response Time: 44.6991283s
Tool Calls: 5
Tools Used: search_products, add_to_cart, add_to_cart, view_cart, checkout

Test Case: medium_search_and_add
Status: ✅ PASSED
Matched Path: search_by_query
Response Time: 46.0124159s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: complex_cart_management
Status: ❌ FAILED
Response Time: 46.7918539s
Tool Calls: 3
Tools Used: view_cart, remove_from_cart, add_to_cart

Test Case: medium_remove_and_add
Status: ❌ FAILED
Response Time: 47.0538152s
Tool Calls: 3
Tools Used: view_cart, search_products, add_to_cart

Test Case: complex_gift_shopping
Status: ✅ PASSED
Matched Path: gift_shopping_workflow
Response Time: 50.3849348s
Tool Calls: 5
Tools Used: search_products, add_to_cart, search_products, add_to_cart, view_cart

❌ Failed Tests Details:

Test Case: complex_shopping_workflow
Expected Tool Variants: 4
Variant 1 (full_workflow_with_iphone): 4 tools
Variant 2 (full_workflow_with_headphones): 4 tools
Variant 3 (full_workflow_with_headphones_and_iphone): 5 tools
Variant 4 (full_workflow_with_iphone_and_headphones): 5 tools
Actual Tool Calls: 5
1. search_products
2. add_to_cart
3. add_to_cart
4. view_cart
5. checkout
Response Time: 44.6991283s

Test Case: complex_cart_management
Expected Tool Variants: 1
Variant 1 (cart_organization): 3 tools
Actual Tool Calls: 3
1. view_cart
2. remove_from_cart
3. add_to_cart
Response Time: 46.7918539s

Test Case: medium_remove_and_add
Expected Tool Variants: 1
Variant 1 (remove_then_add): 2 tools
Actual Tool Calls: 3
1. view_cart
2. search_products
3. add_to_cart
Response Time: 47.0538152s

📊 Overall Success Rate: 82.35%

llama-server --port 9099 -ngl 99 -fa on -c 16000 --temp 0 -m X:\path\to\Qwen3.5-4B-UD-Q5_K_XL.gguf --mmproj X:\path\to\mmproj-Qwen3.5-4b-F32.gguf

52 tokens/s on RTX 3060 12GB // https://github.com/docker/model-test/

Interesting, Qwen3.5-4B-UD-Q8_K_XL success rate 70.59%, 9b Q2_K_XL and Q5_K_XL never above 80% 🤔

My conclusion Qwen3.5-4B-UD-Q5_K_XL is the most optimized tool calling model for <9b Qwen3.5 series.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

🚀 Starting Agent Loop Tool Efficiency Test

📈 Agent Test Results

📋 Test Case Results:

Test Case: zero_thank_you Status: ✅ PASSED Matched Path: no_tools Response Time: 3.0060852s Tool Calls: 0

Test Case: zero_greeting Status: ✅ PASSED Matched Path: no_tools Response Time: 4.1913922s Tool Calls: 0

Test Case: zero_weather_question Status: ✅ PASSED Matched Path: no_tools Response Time: 7.9975621s Tool Calls: 0

Test Case: zero_general_question Status: ✅ PASSED Matched Path: no_tools Response Time: 10.5423246s Tool Calls: 0

Test Case: simple_checkout Status: ✅ PASSED Matched Path: direct_checkout Response Time: 13.3653614s Tool Calls: 1 Tools Used: checkout

Test Case: zero_capabilities Status: ✅ PASSED Matched Path: no_tools Response Time: 16.552381s Tool Calls: 0

Test Case: simple_search_electronics Status: ✅ PASSED Matched Path: search_by_category Response Time: 17.2668578s Tool Calls: 1 Tools Used: search_products

Test Case: simple_view_cart Status: ✅ PASSED Matched Path: view_cart Response Time: 21.2515924s Tool Calls: 1 Tools Used: view_cart

Test Case: simple_add_iphone Status: ✅ PASSED Matched Path: search_then_add Response Time: 29.4939025s Tool Calls: 2 Tools Used: search_products, add_to_cart

Test Case: simple_remove_product Status: ✅ PASSED Matched Path: direct_remove Response Time: 32.7854818s Tool Calls: 1 Tools Used: remove_from_cart

Test Case: medium_search_category_and_add Status: ✅ PASSED Matched Path: search_then_add Response Time: 33.9196s Tool Calls: 2 Tools Used: search_products, add_to_cart

Test Case: medium_view_and_add Status: ✅ PASSED Matched Path: view_then_add Response Time: 39.1388373s Tool Calls: 2 Tools Used: view_cart, add_to_cart

Test Case: complex_shopping_workflow Status: ❌ FAILED Response Time: 44.6991283s Tool Calls: 5 Tools Used: search_products, add_to_cart, add_to_cart, view_cart, checkout

Test Case: medium_search_and_add Status: ✅ PASSED Matched Path: search_by_query Response Time: 46.0124159s Tool Calls: 2 Tools Used: search_products, add_to_cart

Test Case: complex_cart_management Status: ❌ FAILED Response Time: 46.7918539s Tool Calls: 3 Tools Used: view_cart, remove_from_cart, add_to_cart

Test Case: medium_remove_and_add Status: ❌ FAILED Response Time: 47.0538152s Tool Calls: 3 Tools Used: view_cart, search_products, add_to_cart

Test Case: complex_gift_shopping Status: ✅ PASSED Matched Path: gift_shopping_workflow Response Time: 50.3849348s Tool Calls: 5 Tools Used: search_products, add_to_cart, search_products, add_to_cart, view_cart