---
license: other
license_name: research-only
license_link: LICENSE
language:
  - en
  - zh
tags:
  - reflexive-intelligence
  - multi-reward-grpo
  - cognitive-architecture
  - financial-reasoning
  - observer-depth
  - phase-transition
  - ouroboros
  - mixture-of-experts
pipeline_tag: text-generation
library_name: transformers
---

# Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning

**Ouroboros V24** is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through **24 iterative rounds** of multi-reward GRPO with a **54-dimensional cognitive reward topology**.

> ⚠️ **Weights are not publicly released.** This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author.

## Architecture

### Base Model
- **Type**: Mixture-of-Experts (MoE)
- **Total Parameters**: ~35B
- **Active Parameters**: ~3B per token
- **Context Window**: 32K tokens

### Training Methodology
- **Algorithm**: R-GRPO (Reflexive Group Relative Policy Optimization)
- **Training Rounds**: 24 iterative cycles (V1 → V24)
- **Adapter Strategy**: 20-layer sequential LoRA merge chain
- **Reward Architecture**: SCRGNDWMT (9-tier, 54 sub-dimensions)

### 9-Tier Reward Topology (SCRGNDWMT)

| Tier | Name | Sub-dimensions | Description |
|------|------|----------------|-------------|
| **S** | Structure | 6 | XML formatting, JSON decision blocks |
| **C** | Content | 7 | Domain expertise, data fidelity, causal depth |
| **R** | Reasoning | 5 | Temporal-causal chains, counterfactual depth |
| **G** | Game Theory | 5 | K-level thinking, deception detection, coalition |
| **N** | Narrative | 4 | Scenario construction, debate, arc coherence |
| **D** | Data Fidelity | 3 | Numerical accuracy, source attribution |
| **W** | World Model | 6 | Regime detection, cross-market transmission, macro |
| **M** | Metacognition | 7 | Self-awareness, Bayesian confidence, falsification |
| **T** | Temporal-Causal | 5 | Causal chains, temporal depth, granularity |

### V24 Upgrades (from V22)
- **C7 (CausalChainDepthV2)**: Multi-step causal chains with time-lag annotations
- **M7 (BayesianConfidence)**: Calibrated confidence field in JSON decisions
- **W3 (CrossMarketPath)**: Structural contagion paths (Market A → Mechanism → Market B)
- **M5 (FalsificationV2)**: Quantitative, price-based invalidation conditions

### Key Training Parameters

| Parameter | Value |
|-----------|-------|
| Learning rate | 5 × 10⁻⁷ |
| Group size | 12 |
| Max completion tokens | 1000 |
| Temperature | 1.15 |
| β-annealing | Stable (β=0.05) ↔ Break-up (β=0.03) |
| LoRA rank | ≥ 10 |

## Key Results

### Reflexive Intelligence Emergence
During V17 training, reflexive reasoning emerged through a **discontinuous phase transition** at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program.

### V24 Training (ongoing)
- **54-dimensional reward** actively guiding cognitive development
- **Bayesian confidence calibration** observed from Step 18
- **Cross-market causal reasoning** emerging by Step 25
- **Zero gradient failures** through 55+ steps

## Research Program

This model is part of a six-paper research program:

| Paper | Title | DOI |
|-------|-------|-----|
| P1 | Reflexive Intelligence in LLMs | [10.5281/zenodo.19557261](https://doi.org/10.5281/zenodo.19557261) |
| P2 | Observer Depth (ReflexBench) | [10.5281/zenodo.19627242](https://doi.org/10.5281/zenodo.19627242) |
| P3 | When Rewards Collide (Multi-Reward GRPO) | [10.5281/zenodo.19665969](https://doi.org/10.5281/zenodo.19665969) |
| P4 | Ouroboros V22 Architecture | [10.5281/zenodo.19666786](https://doi.org/10.5281/zenodo.19666786) |
| P5 | The Cognitive Lifecycle | [10.5281/zenodo.19666806](https://doi.org/10.5281/zenodo.19666806) |
| P6 | Cognitive Reward Topology | [10.5281/zenodo.19666829](https://doi.org/10.5281/zenodo.19666829) |

## Related Resources

| Resource | Link |
|----------|------|
| **ReflexBench Dataset** | [MMJBDS/reflexbench](https://huggingface.co/datasets/MMJBDS/reflexbench) |
| **ReflexBench Eval Results** | [MMJBDS/reflexbench-eval](https://huggingface.co/datasets/MMJBDS/reflexbench-eval) |
| **Papers Repository** | [github.com/mmjbds/ouroboros-papers](https://github.com/mmjbds/ouroboros-papers) |
| **Evaluation Code** | [github.com/mmjbds/reflexbench](https://github.com/mmjbds/reflexbench) |

## Citation

```bibtex
@article{zhang2026ouroborosv22,
  title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666786}
}

@article{zhang2026topology,
  title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666829}
}
```

## Author

- **Mian Zhang** — Independent AI Researcher
- **ORCID**: [0009-0001-9556-3839](https://orcid.org/0009-0001-9556-3839)
- **Email**: 373743743@qq.com
- **GitHub**: [@mmjbds](https://github.com/mmjbds)
- **Twitter/X**: [@Henry_Avery666](https://x.com/Henry_Avery666)
- **LinkedIn**: [henryavery-mianzhang](https://linkedin.com/in/henryavery-mianzhang)

## License

This model card is released under CC BY 4.0. Model weights are not publicly available.