--- license: other license_name: research-only license_link: LICENSE language: - en - zh tags: - reflexive-intelligence - multi-reward-grpo - cognitive-architecture - financial-reasoning - observer-depth - phase-transition - ouroboros - mixture-of-experts pipeline_tag: text-generation library_name: transformers --- # Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning **Ouroboros V24** is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through **24 iterative rounds** of multi-reward GRPO with a **54-dimensional cognitive reward topology**. > ⚠️ **Weights are not publicly released.** This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author. ## Architecture ### Base Model - **Type**: Mixture-of-Experts (MoE) - **Total Parameters**: ~35B - **Active Parameters**: ~3B per token - **Context Window**: 32K tokens ### Training Methodology - **Algorithm**: R-GRPO (Reflexive Group Relative Policy Optimization) - **Training Rounds**: 24 iterative cycles (V1 → V24) - **Adapter Strategy**: 20-layer sequential LoRA merge chain - **Reward Architecture**: SCRGNDWMT (9-tier, 54 sub-dimensions) ### 9-Tier Reward Topology (SCRGNDWMT) | Tier | Name | Sub-dimensions | Description | |------|------|----------------|-------------| | **S** | Structure | 6 | XML formatting, JSON decision blocks | | **C** | Content | 7 | Domain expertise, data fidelity, causal depth | | **R** | Reasoning | 5 | Temporal-causal chains, counterfactual depth | | **G** | Game Theory | 5 | K-level thinking, deception detection, coalition | | **N** | Narrative | 4 | Scenario construction, debate, arc coherence | | **D** | Data Fidelity | 3 | Numerical accuracy, source attribution | | **W** | World Model | 6 | Regime detection, cross-market transmission, macro | | **M** | Metacognition | 7 | Self-awareness, Bayesian confidence, falsification | | **T** | Temporal-Causal | 5 | Causal chains, temporal depth, granularity | ### V24 Upgrades (from V22) - **C7 (CausalChainDepthV2)**: Multi-step causal chains with time-lag annotations - **M7 (BayesianConfidence)**: Calibrated confidence field in JSON decisions - **W3 (CrossMarketPath)**: Structural contagion paths (Market A → Mechanism → Market B) - **M5 (FalsificationV2)**: Quantitative, price-based invalidation conditions ### Key Training Parameters | Parameter | Value | |-----------|-------| | Learning rate | 5 × 10⁻⁷ | | Group size | 12 | | Max completion tokens | 1000 | | Temperature | 1.15 | | β-annealing | Stable (β=0.05) ↔ Break-up (β=0.03) | | LoRA rank | ≥ 10 | ## Key Results ### Reflexive Intelligence Emergence During V17 training, reflexive reasoning emerged through a **discontinuous phase transition** at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program. ### V24 Training (ongoing) - **54-dimensional reward** actively guiding cognitive development - **Bayesian confidence calibration** observed from Step 18 - **Cross-market causal reasoning** emerging by Step 25 - **Zero gradient failures** through 55+ steps ## Research Program This model is part of a six-paper research program: | Paper | Title | DOI | |-------|-------|-----| | P1 | Reflexive Intelligence in LLMs | [10.5281/zenodo.19557261](https://doi.org/10.5281/zenodo.19557261) | | P2 | Observer Depth (ReflexBench) | [10.5281/zenodo.19627242](https://doi.org/10.5281/zenodo.19627242) | | P3 | When Rewards Collide (Multi-Reward GRPO) | [10.5281/zenodo.19665969](https://doi.org/10.5281/zenodo.19665969) | | P4 | Ouroboros V22 Architecture | [10.5281/zenodo.19666786](https://doi.org/10.5281/zenodo.19666786) | | P5 | The Cognitive Lifecycle | [10.5281/zenodo.19666806](https://doi.org/10.5281/zenodo.19666806) | | P6 | Cognitive Reward Topology | [10.5281/zenodo.19666829](https://doi.org/10.5281/zenodo.19666829) | ## Related Resources | Resource | Link | |----------|------| | **ReflexBench Dataset** | [MMJBDS/reflexbench](https://huggingface.co/datasets/MMJBDS/reflexbench) | | **ReflexBench Eval Results** | [MMJBDS/reflexbench-eval](https://huggingface.co/datasets/MMJBDS/reflexbench-eval) | | **Papers Repository** | [github.com/mmjbds/ouroboros-papers](https://github.com/mmjbds/ouroboros-papers) | | **Evaluation Code** | [github.com/mmjbds/reflexbench](https://github.com/mmjbds/reflexbench) | ## Citation ```bibtex @article{zhang2026ouroborosv22, title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition}, author={Zhang, Mian}, year={2026}, doi={10.5281/zenodo.19666786} } @article{zhang2026topology, title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO}, author={Zhang, Mian}, year={2026}, doi={10.5281/zenodo.19666829} } ``` ## Author - **Mian Zhang** — Independent AI Researcher - **ORCID**: [0009-0001-9556-3839](https://orcid.org/0009-0001-9556-3839) - **Email**: 373743743@qq.com - **GitHub**: [@mmjbds](https://github.com/mmjbds) - **Twitter/X**: [@Henry_Avery666](https://x.com/Henry_Avery666) - **LinkedIn**: [henryavery-mianzhang](https://linkedin.com/in/henryavery-mianzhang) ## License This model card is released under CC BY 4.0. Model weights are not publicly available.