|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
model_type: causal-lm |
|
|
base_model: Qwen/Qwen3-4B |
|
|
tags: |
|
|
- reasoning |
|
|
- tree-of-thoughts |
|
|
- gnn |
|
|
- self-improving |
|
|
- autonomous-training |
|
|
- multi-agent |
|
|
- variance-curriculum |
|
|
- reinforcement-learning |
|
|
datasets: |
|
|
- gsm8k |
|
|
- mmlu |
|
|
- gpqa |
|
|
- arc-challenge |
|
|
- truthfulqa |
|
|
metrics: |
|
|
- accuracy |
|
|
inference: true |
|
|
training: true |
|
|
--- |
|
|
|
|
|
# TRIDENT |
|
|
|
|
|
**TRIDENT** is a reasoning-focused 4B-parameter language model that improves its own reasoning capability through **algorithmic self-improvement**, rather than parameter scaling. |
|
|
|
|
|
The model is built on **Qwen3-4B** and enhanced using the **TRIDENT framework**: a combination of GNN-guided Tree-of-Thoughts search, multi-agent reasoning policies, and variance-based self-training. |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
Traditional large language model training depends on: |
|
|
- Human-written reasoning traces |
|
|
- Manually curated preference datasets |
|
|
- Static fine-tuning pipelines |
|
|
|
|
|
**TRIDENT removes these dependencies.** |
|
|
|
|
|
Instead, the model: |
|
|
1. Explores multiple reasoning paths |
|
|
2. Evaluates them using a learned GNN policy |
|
|
3. Selects high-uncertainty problems automatically |
|
|
4. Generates its own training supervision |
|
|
5. Distills improvements back into the model using LoRA |
|
|
|
|
|
--- |
|
|
model-index: |
|
|
- name: TRIDENT |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: GSM8K |
|
|
type: gsm8k |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 86.58 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: MMLU |
|
|
type: mmlu |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 72.61 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: GPQA |
|
|
type: gpqa |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 42.42 |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: ARC-Challenge |
|
|
type: arc-challenge |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 59.0 |
|
|
|
|
|
## Core Capabilities |
|
|
|
|
|
### GNN-Guided Tree-of-Thoughts |
|
|
Reasoning is represented as a directed graph of intermediate states. |
|
|
A 3-layer Graph Convolutional Network predicts a **promise score** for each branch, guiding exploration and pruning. |
|
|
|
|
|
### Multi-Agent Reasoning |
|
|
Four internal agents (Conservative, Exploratory, Balanced, Reflective) vote on reasoning actions to balance exploration and correctness. |
|
|
|
|
|
### Variance-Based Curriculum |
|
|
Problems are selected for training based on **reward variance**, targeting examples where the model is inconsistent and learning signal is highest. |
|
|
|
|
|
### Self-Generative Reasoning Loop |
|
|
No human-annotated reasoning traces are used. |
|
|
The model autonomously generates, evaluates, and curates its own reasoning data. |
|
|
|
|
|
### Stable Training |
|
|
A multi-layer reward stabilization mechanism prevents: |
|
|
- Reward collapse |
|
|
- Loss explosions |
|
|
- Infinite failure loops |
|
|
|
|
|
The architecture is compatible with future GRPO-style reinforcement learning. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
--- |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
Accuracy comparison against the base model: |
|
|
|
|
|
| Benchmark | Qwen3-4B | TRIDENT | |
|
|
|--------|--------|-----------| |
|
|
| GSM8K (5-shot) | 74.14 | **86.58** | |
|
|
| MMLU (5-shot) | 47.70 | **72.61** | |
|
|
| ARC-C (25-shot) | 54.0 | **59.0** | |
|
|
| GPQA (0-shot) | 28.28 | **42.42** | |
|
|
| Winogrande (0-shot) | 59.6 | **67.08** | |
|
|
| TruthfulQA (0-shot) | 54.9 | **54.7** | |
|
|
|
|
|
**Highlight:** |
|
|
+14.14 percentage point improvement on **GPQA (0-shot)**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
TRIDENT is suitable for: |
|
|
- Multi-step mathematical reasoning |
|
|
- Scientific and logical inference |
|
|
- Hard QA benchmarks |
|
|
- Planning and hypothesis exploration |
|
|
- Research on reasoning systems |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Higher inference-time compute than single-pass models |
|
|
- Not optimized for low-latency chat |
|
|
- Best used where reasoning depth matters more than speed |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- No human-written reasoning traces used |
|
|
- No preference data collection |
|
|
- Training relies on verifiable task rewards |
|
|
- Like all LLMs, may hallucinate outside structured reasoning workflows |
|
|
|
|
|
--- |
|
|
## Paper link |
|
|
|
|
|
https://www.shivik.in/shivik-labs/trident |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{puri2025trident, |
|
|
title={TRIDENT: Thought-based Reasoning and Improvement through Deep Exploration of Neuronal Trees}, |
|
|
author={Puri, Shivansh and Khandelwal, Abhisek and Joshi, Vedant and Yadav, Akash}, |
|
|
year={2025} |
|
|
} |