FluxEM-Qwen3-4B Tool-Calling

Summary

FluxEM-Qwen3-4B is a tool-orchestration wrapper around Qwen3-4B-Instruct that routes queries to deterministic FluxEM tools across 11 algebraic domains (arithmetic, physics, chemistry, biology, math, music, geometry, graphs, sets, logic, number theory). It is not a fine-tuned model; it is a tool-calling pipeline and benchmark suite for reproducible evaluation.

Research framing

Inspired by NVIDIA ToolOrchestra (https://research.nvidia.com/labs/lpr/ToolOrchestra/) and prior embedding work, this project asks what counts as a "tool" when the knowledge is algebraic and not naturally tokenized. FluxEM encoders are algebraically structured and deterministic; treating them as tools can yield reliable computation with less training compute than trying to learn those operators from text.

How it works

  • Domain routing and tool selection (LLM-first, with pattern fallback)
  • Query extraction into tool inputs
  • Tool execution in FluxEM
  • Response returns tool output directly (no LLM regeneration)

Quickstart

  1. Install dependencies (choose one):
    • Apple Silicon / MLX: pip install fluxem[full-mlx]
    • Transformers fallback: pip install fluxem[huggingface] torch
  2. Download a local Qwen3-4B model:
  3. Enable MLX (only if using MLX): export FLUXEM_ENABLE_MLX=1
  4. Run:
from experiments.qwen3_toolcalling.qwen3_wrapper import create_wrapper

wrapper = create_wrapper(
    model_path="~/.mlx/models/Qwen/Qwen3-4B-Instruct-MLX",
    tool_selection="llm",
    llm_query_extraction=True,
)
wrapper.load_model()

result = wrapper.generate_with_tools(
    "A box has 12 apples. You eat 5. How many are left?"
)
print(result)

Benchmarks (internal)

We include two internal synthetic datasets to validate the tool-routing pipeline, not general language understanding:

  • Hard: 58 prompts across 11 domains (experiments/qwen3_toolcalling/benchmark_data_hard.py)
  • Very hard: 33 prompts across 11 domains (experiments/qwen3_toolcalling/benchmark_data_very_hard.py)

Example questions by domain (from benchmark_data_very_hard.py):

  • arithmetic: "Compute ((987654321 - 123456789) * (54321 + 98765) + 7^9) / 13"
  • physics: "Convert 88 ft/s to m/s"
  • chemistry: "What is the molecular weight of Ca3(PO4)2?"
  • biology: "What's the GC content of ATGCGTACGATCGGATCCGATCGTAGCTAGC?"
  • math: "Calculate the determinant of [[4, 2, 0, 1], [3, 5, 1, 2], [2, 0, 3, 4], [1, 2, 4, 0]]"
  • music: "Transpose [1, 4, 8] by -7 semitones"
  • geometry: "Rotate [7, -2] by 225 degrees"
  • graphs: "What's the shortest path from node 0 to node 6 in Graph(nodes={0,1,2,3,4,5,6}, edges=[(0,1),(1,2),(2,6),(0,3),(3,4),(4,5),(5,6),(1,4)])?"
  • sets: "What is the union of {-10, -5, 0, 5, 10} and {-5, -3, 2, 5, 7}?"
  • logic: "Is 'p or not p' a tautology?"
  • number_theory: "What is 13^123 mod 997?"

Results (internal)

Very hard benchmark:

  • Overall accuracy: tool-calling 100.0%, baseline 39.4%
  • Average improvement: 2.0x
  • Raw results: experiments/qwen3_toolcalling/results/benchmark_results_20260107_004119.json

Hard benchmark:

  • Overall accuracy: tool-calling 100.0%, baseline 43.1%
  • Average improvement: 2.3x
  • Raw results: experiments/qwen3_toolcalling/results/benchmark_results_20260106_232757.json

Related work

Approach Method Difference
NALU Learned log/exp gates FluxEM uses deterministic algebraic tools with no learned parameters.
xVal Learned scaling for numerical reasoning FluxEM provides multi-domain encoders with exact operators.
Abacus Positional digits for arithmetic FluxEM encodes algebraic structure rather than positional digits.
ToolOrchestra LLM tool orchestration FluxEM applies orchestration to algebraic tools with deterministic outputs.

Usage

See experiments/qwen3_toolcalling/README.md for setup and full benchmark instructions.

Limitations

  • Benchmarks are internal and synthetic; they validate routing and tool execution, not general knowledge.
  • Tool coverage is limited to FluxEM's 11 domains.
  • Natural language query extraction can still fail for unusual phrasing.
  • No multi-turn tool calling yet.

License

MIT (wrapper code). The base model is subject to its original license terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hunterbown/fluxem-qwen3-4b

Finetuned
(410)
this model

Papers for hunterbown/fluxem-qwen3-4b