Trident / README.md
ziadrone's picture
Update README.md
3e0aa0b verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
library_name: transformers
model_type: causal-lm
base_model: Qwen/Qwen3-4B
tags:
  - reasoning
  - tree-of-thoughts
  - gnn
  - self-improving
  - autonomous-training
  - multi-agent
  - variance-curriculum
  - reinforcement-learning
datasets:
  - gsm8k
  - mmlu
  - gpqa
  - arc-challenge
  - truthfulqa
metrics:
  - accuracy
inference: true
training: true

TRIDENT

TRIDENT is a reasoning-focused 4B-parameter language model that improves its own reasoning capability through algorithmic self-improvement, rather than parameter scaling.

The model is built on Qwen3-4B and enhanced using the TRIDENT framework: a combination of GNN-guided Tree-of-Thoughts search, multi-agent reasoning policies, and variance-based self-training.


Overview

Traditional large language model training depends on:

  • Human-written reasoning traces
  • Manually curated preference datasets
  • Static fine-tuning pipelines

TRIDENT removes these dependencies.

Instead, the model:

  1. Explores multiple reasoning paths
  2. Evaluates them using a learned GNN policy
  3. Selects high-uncertainty problems automatically
  4. Generates its own training supervision
  5. Distills improvements back into the model using LoRA

model-index:

  • name: TRIDENT results:
    • task: type: text-generation dataset: name: GSM8K type: gsm8k split: test metrics:
      • type: accuracy value: 86.58
    • task: type: text-generation dataset: name: MMLU type: mmlu split: test metrics:
      • type: accuracy value: 72.61
    • task: type: text-generation dataset: name: GPQA type: gpqa split: test metrics:
      • type: accuracy value: 42.42
    • task: type: text-generation dataset: name: ARC-Challenge type: arc-challenge split: test metrics:
      • type: accuracy value: 59.0

Core Capabilities

GNN-Guided Tree-of-Thoughts

Reasoning is represented as a directed graph of intermediate states.
A 3-layer Graph Convolutional Network predicts a promise score for each branch, guiding exploration and pruning.

Multi-Agent Reasoning

Four internal agents (Conservative, Exploratory, Balanced, Reflective) vote on reasoning actions to balance exploration and correctness.

Variance-Based Curriculum

Problems are selected for training based on reward variance, targeting examples where the model is inconsistent and learning signal is highest.

Self-Generative Reasoning Loop

No human-annotated reasoning traces are used.
The model autonomously generates, evaluates, and curates its own reasoning data.

Stable Training

A multi-layer reward stabilization mechanism prevents:

  • Reward collapse
  • Loss explosions
  • Infinite failure loops

The architecture is compatible with future GRPO-style reinforcement learning.




Benchmark Results

Accuracy comparison against the base model:

Benchmark Qwen3-4B TRIDENT
GSM8K (5-shot) 74.14 86.58
MMLU (5-shot) 47.70 72.61
ARC-C (25-shot) 54.0 59.0
GPQA (0-shot) 28.28 42.42
Winogrande (0-shot) 59.6 67.08
TruthfulQA (0-shot) 54.9 54.7

Highlight:
+14.14 percentage point improvement on GPQA (0-shot).


Intended Use

TRIDENT is suitable for:

  • Multi-step mathematical reasoning
  • Scientific and logical inference
  • Hard QA benchmarks
  • Planning and hypothesis exploration
  • Research on reasoning systems

Limitations

  • Higher inference-time compute than single-pass models
  • Not optimized for low-latency chat
  • Best used where reasoning depth matters more than speed

Ethical Considerations

  • No human-written reasoning traces used
  • No preference data collection
  • Training relies on verifiable task rewards
  • Like all LLMs, may hallucinate outside structured reasoning workflows

Paper link

https://www.shivik.in/shivik-labs/trident

Citation

@article{puri2025trident,
  title={TRIDENT: Thought-based Reasoning and Improvement through Deep Exploration of Neuronal Trees},
  author={Puri, Shivansh and Khandelwal, Abhisek and Joshi, Vedant and Yadav, Akash},
  year={2025}
}