FluxEM-Qwen3-4B Tool-Calling
Summary
FluxEM-Qwen3-4B is a tool-orchestration wrapper around Qwen3-4B-Instruct that routes queries to deterministic FluxEM tools across 11 algebraic domains (arithmetic, physics, chemistry, biology, math, music, geometry, graphs, sets, logic, number theory). It is not a fine-tuned model; it is a tool-calling pipeline and benchmark suite for reproducible evaluation.
Research framing
Inspired by NVIDIA ToolOrchestra (https://research.nvidia.com/labs/lpr/ToolOrchestra/) and prior embedding work, this project asks what counts as a "tool" when the knowledge is algebraic and not naturally tokenized. FluxEM encoders are algebraically structured and deterministic; treating them as tools can yield reliable computation with less training compute than trying to learn those operators from text.
How it works
- Domain routing and tool selection (LLM-first, with pattern fallback)
- Query extraction into tool inputs
- Tool execution in FluxEM
- Response returns tool output directly (no LLM regeneration)
Quickstart
- Install dependencies (choose one):
- Apple Silicon / MLX:
pip install fluxem[full-mlx] - Transformers fallback:
pip install fluxem[huggingface] torch
- Apple Silicon / MLX:
- Download a local Qwen3-4B model:
- MLX: https://huggingface.co/Qwen/Qwen3-4B (MLX quantized)
- Enable MLX (only if using MLX):
export FLUXEM_ENABLE_MLX=1 - Run:
from experiments.qwen3_toolcalling.qwen3_wrapper import create_wrapper
wrapper = create_wrapper(
model_path="~/.mlx/models/Qwen/Qwen3-4B-Instruct-MLX",
tool_selection="llm",
llm_query_extraction=True,
)
wrapper.load_model()
result = wrapper.generate_with_tools(
"A box has 12 apples. You eat 5. How many are left?"
)
print(result)
Benchmarks (internal)
We include two internal synthetic datasets to validate the tool-routing pipeline, not general language understanding:
- Hard: 58 prompts across 11 domains (
experiments/qwen3_toolcalling/benchmark_data_hard.py) - Very hard: 33 prompts across 11 domains (
experiments/qwen3_toolcalling/benchmark_data_very_hard.py)
Example questions by domain (from benchmark_data_very_hard.py):
- arithmetic: "Compute ((987654321 - 123456789) * (54321 + 98765) + 7^9) / 13"
- physics: "Convert 88 ft/s to m/s"
- chemistry: "What is the molecular weight of Ca3(PO4)2?"
- biology: "What's the GC content of ATGCGTACGATCGGATCCGATCGTAGCTAGC?"
- math: "Calculate the determinant of [[4, 2, 0, 1], [3, 5, 1, 2], [2, 0, 3, 4], [1, 2, 4, 0]]"
- music: "Transpose [1, 4, 8] by -7 semitones"
- geometry: "Rotate [7, -2] by 225 degrees"
- graphs: "What's the shortest path from node 0 to node 6 in Graph(nodes={0,1,2,3,4,5,6}, edges=[(0,1),(1,2),(2,6),(0,3),(3,4),(4,5),(5,6),(1,4)])?"
- sets: "What is the union of {-10, -5, 0, 5, 10} and {-5, -3, 2, 5, 7}?"
- logic: "Is 'p or not p' a tautology?"
- number_theory: "What is 13^123 mod 997?"
Results (internal)
Very hard benchmark:
- Overall accuracy: tool-calling 100.0%, baseline 39.4%
- Average improvement: 2.0x
- Raw results:
experiments/qwen3_toolcalling/results/benchmark_results_20260107_004119.json
Hard benchmark:
- Overall accuracy: tool-calling 100.0%, baseline 43.1%
- Average improvement: 2.3x
- Raw results:
experiments/qwen3_toolcalling/results/benchmark_results_20260106_232757.json
Related work
| Approach | Method | Difference |
|---|---|---|
| NALU | Learned log/exp gates | FluxEM uses deterministic algebraic tools with no learned parameters. |
| xVal | Learned scaling for numerical reasoning | FluxEM provides multi-domain encoders with exact operators. |
| Abacus | Positional digits for arithmetic | FluxEM encodes algebraic structure rather than positional digits. |
| ToolOrchestra | LLM tool orchestration | FluxEM applies orchestration to algebraic tools with deterministic outputs. |
Usage
See experiments/qwen3_toolcalling/README.md for setup and full benchmark instructions.
Limitations
- Benchmarks are internal and synthetic; they validate routing and tool execution, not general knowledge.
- Tool coverage is limited to FluxEM's 11 domains.
- Natural language query extraction can still fail for unusual phrasing.
- No multi-turn tool calling yet.
License
MIT (wrapper code). The base model is subject to its original license terms.
Model tree for hunterbown/fluxem-qwen3-4b
Base model
Qwen/Qwen3-4B-Instruct-2507