FluxEM-Qwen3-4B Tool-Calling

Summary

FluxEM-Qwen3-4B is a tool-orchestration wrapper around Qwen3-4B-Instruct that routes queries to deterministic FluxEM tools across 11 algebraic domains (arithmetic, physics, chemistry, biology, math, music, geometry, graphs, sets, logic, number theory). It is not a fine-tuned model; it is a tool-calling pipeline and benchmark suite for reproducible evaluation.

Research framing

Inspired by NVIDIA ToolOrchestra (https://research.nvidia.com/labs/lpr/ToolOrchestra/) and prior embedding work, this project asks what counts as a "tool" when the knowledge is algebraic and not naturally tokenized. FluxEM encoders are algebraically structured and deterministic; treating them as tools can yield reliable computation with less training compute than trying to learn those operators from text.

How it works

Domain routing and tool selection (LLM-first, with pattern fallback)
Query extraction into tool inputs
Tool execution in FluxEM
Response returns tool output directly (no LLM regeneration)

Quickstart

Install dependencies (choose one):
- Apple Silicon / MLX: pip install fluxem[full-mlx]
- Transformers fallback: pip install fluxem[huggingface] torch
Download a local Qwen3-4B model:
- MLX: https://huggingface.co/Qwen/Qwen3-4B (MLX quantized)
Enable MLX (only if using MLX): export FLUXEM_ENABLE_MLX=1
Run:

from experiments.qwen3_toolcalling.qwen3_wrapper import create_wrapper

wrapper = create_wrapper(
    model_path="~/.mlx/models/Qwen/Qwen3-4B-Instruct-MLX",
    tool_selection="llm",
    llm_query_extraction=True,
)
wrapper.load_model()

result = wrapper.generate_with_tools(
    "A box has 12 apples. You eat 5. How many are left?"
)
print(result)

Benchmarks (internal)

We include two internal synthetic datasets to validate the tool-routing pipeline, not general language understanding:

Hard: 58 prompts across 11 domains (experiments/qwen3_toolcalling/benchmark_data_hard.py)
Very hard: 33 prompts across 11 domains (experiments/qwen3_toolcalling/benchmark_data_very_hard.py)

Example questions by domain (from benchmark_data_very_hard.py):

arithmetic: "Compute ((987654321 - 123456789) * (54321 + 98765) + 7^9) / 13"
physics: "Convert 88 ft/s to m/s"
chemistry: "What is the molecular weight of Ca3(PO4)2?"
biology: "What's the GC content of ATGCGTACGATCGGATCCGATCGTAGCTAGC?"
math: "Calculate the determinant of [[4, 2, 0, 1], [3, 5, 1, 2], [2, 0, 3, 4], [1, 2, 4, 0]]"
music: "Transpose [1, 4, 8] by -7 semitones"
geometry: "Rotate [7, -2] by 225 degrees"
graphs: "What's the shortest path from node 0 to node 6 in Graph(nodes={0,1,2,3,4,5,6}, edges=[(0,1),(1,2),(2,6),(0,3),(3,4),(4,5),(5,6),(1,4)])?"
sets: "What is the union of {-10, -5, 0, 5, 10} and {-5, -3, 2, 5, 7}?"
logic: "Is 'p or not p' a tautology?"
number_theory: "What is 13^123 mod 997?"

Results (internal)

Very hard benchmark:

Overall accuracy: tool-calling 100.0%, baseline 39.4%
Average improvement: 2.0x
Raw results: experiments/qwen3_toolcalling/results/benchmark_results_20260107_004119.json

Hard benchmark:

Overall accuracy: tool-calling 100.0%, baseline 43.1%
Average improvement: 2.3x
Raw results: experiments/qwen3_toolcalling/results/benchmark_results_20260106_232757.json

Related work

Approach	Method	Difference
NALU	Learned log/exp gates	FluxEM uses deterministic algebraic tools with no learned parameters.
xVal	Learned scaling for numerical reasoning	FluxEM provides multi-domain encoders with exact operators.
Abacus	Positional digits for arithmetic	FluxEM encodes algebraic structure rather than positional digits.
ToolOrchestra	LLM tool orchestration	FluxEM applies orchestration to algebraic tools with deterministic outputs.

Usage

See experiments/qwen3_toolcalling/README.md for setup and full benchmark instructions.

Limitations

Benchmarks are internal and synthetic; they validate routing and tool execution, not general knowledge.
Tool coverage is limited to FluxEM's 11 domains.
Natural language query extraction can still fail for unusual phrasing.
No multi-turn tool calling yet.

License

MIT (wrapper code). The base model is subject to its original license terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hunterbown/fluxem-qwen3-4b

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1537)

this model

Papers for hunterbown/fluxem-qwen3-4b