Upload README.md with huggingface_hub

9f117a1 verified about 2 months ago

5.5 kB

	---
	license: other
	license_name: research-only
	license_link: LICENSE
	language:
	- en
	- zh
	tags:
	- reflexive-intelligence
	- multi-reward-grpo
	- cognitive-architecture
	- financial-reasoning
	- observer-depth
	- phase-transition
	- ouroboros
	- mixture-of-experts
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning

	Ouroboros V24 is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through 24 iterative rounds of multi-reward GRPO with a 54-dimensional cognitive reward topology.

	> ⚠️ Weights are not publicly released. This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author.

	## Architecture

	### Base Model
	- Type: Mixture-of-Experts (MoE)
	- Total Parameters: ~35B
	- Active Parameters: ~3B per token
	- Context Window: 32K tokens

	### Training Methodology
	- Algorithm: R-GRPO (Reflexive Group Relative Policy Optimization)
	- Training Rounds: 24 iterative cycles (V1 → V24)
	- Adapter Strategy: 20-layer sequential LoRA merge chain
	- Reward Architecture: SCRGNDWMT (9-tier, 54 sub-dimensions)

	### 9-Tier Reward Topology (SCRGNDWMT)

	\| Tier \| Name \| Sub-dimensions \| Description \|
	\|------\|------\|----------------\|-------------\|
	\| S \| Structure \| 6 \| XML formatting, JSON decision blocks \|
	\| C \| Content \| 7 \| Domain expertise, data fidelity, causal depth \|
	\| R \| Reasoning \| 5 \| Temporal-causal chains, counterfactual depth \|
	\| G \| Game Theory \| 5 \| K-level thinking, deception detection, coalition \|
	\| N \| Narrative \| 4 \| Scenario construction, debate, arc coherence \|
	\| D \| Data Fidelity \| 3 \| Numerical accuracy, source attribution \|
	\| W \| World Model \| 6 \| Regime detection, cross-market transmission, macro \|
	\| M \| Metacognition \| 7 \| Self-awareness, Bayesian confidence, falsification \|
	\| T \| Temporal-Causal \| 5 \| Causal chains, temporal depth, granularity \|

	### V24 Upgrades (from V22)
	- C7 (CausalChainDepthV2): Multi-step causal chains with time-lag annotations
	- M7 (BayesianConfidence): Calibrated confidence field in JSON decisions
	- W3 (CrossMarketPath): Structural contagion paths (Market A → Mechanism → Market B)
	- M5 (FalsificationV2): Quantitative, price-based invalidation conditions

	### Key Training Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning rate \| 5 × 10⁻⁷ \|
	\| Group size \| 12 \|
	\| Max completion tokens \| 1000 \|
	\| Temperature \| 1.15 \|
	\| β-annealing \| Stable (β=0.05) ↔ Break-up (β=0.03) \|
	\| LoRA rank \| ≥ 10 \|

	## Key Results

	### Reflexive Intelligence Emergence
	During V17 training, reflexive reasoning emerged through a discontinuous phase transition at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program.

	### V24 Training (ongoing)
	- 54-dimensional reward actively guiding cognitive development
	- Bayesian confidence calibration observed from Step 18
	- Cross-market causal reasoning emerging by Step 25
	- Zero gradient failures through 55+ steps

	## Research Program

	This model is part of a six-paper research program:

	\| Paper \| Title \| DOI \|
	\|-------\|-------\|-----\|
	\| P1 \| Reflexive Intelligence in LLMs \| [10.5281/zenodo.19557261](https://doi.org/10.5281/zenodo.19557261) \|
	\| P2 \| Observer Depth (ReflexBench) \| [10.5281/zenodo.19627242](https://doi.org/10.5281/zenodo.19627242) \|
	\| P3 \| When Rewards Collide (Multi-Reward GRPO) \| [10.5281/zenodo.19665969](https://doi.org/10.5281/zenodo.19665969) \|
	\| P4 \| Ouroboros V22 Architecture \| [10.5281/zenodo.19666786](https://doi.org/10.5281/zenodo.19666786) \|
	\| P5 \| The Cognitive Lifecycle \| [10.5281/zenodo.19666806](https://doi.org/10.5281/zenodo.19666806) \|
	\| P6 \| Cognitive Reward Topology \| [10.5281/zenodo.19666829](https://doi.org/10.5281/zenodo.19666829) \|

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| ReflexBench Dataset \| [MMJBDS/reflexbench](https://huggingface.co/datasets/MMJBDS/reflexbench) \|
	\| ReflexBench Eval Results \| [MMJBDS/reflexbench-eval](https://huggingface.co/datasets/MMJBDS/reflexbench-eval) \|
	\| Papers Repository \| [github.com/mmjbds/ouroboros-papers](https://github.com/mmjbds/ouroboros-papers) \|
	\| Evaluation Code \| [github.com/mmjbds/reflexbench](https://github.com/mmjbds/reflexbench) \|

	## Citation

	```bibtex
	@article{zhang2026ouroborosv22,
	title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition},
	author={Zhang, Mian},
	year={2026},
	doi={10.5281/zenodo.19666786}
	}

	@article{zhang2026topology,
	title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO},
	author={Zhang, Mian},
	year={2026},
	doi={10.5281/zenodo.19666829}
	}
	```

	## Author

	- Mian Zhang — Independent AI Researcher
	- ORCID: [0009-0001-9556-3839](https://orcid.org/0009-0001-9556-3839)
	- Email: 373743743@qq.com
	- GitHub: [@mmjbds](https://github.com/mmjbds)
	- Twitter/X: [@Henry_Avery666](https://x.com/Henry_Avery666)
	- LinkedIn: [henryavery-mianzhang](https://linkedin.com/in/henryavery-mianzhang)

	## License

	This model card is released under CC BY 4.0. Model weights are not publicly available.

	---
	license: other
	license_name: research-only
	license_link: LICENSE
	language:
	- en
	- zh
	tags:
	- reflexive-intelligence
	- multi-reward-grpo
	- cognitive-architecture
	- financial-reasoning
	- observer-depth
	- phase-transition
	- ouroboros
	- mixture-of-experts
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning

	Ouroboros V24 is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through 24 iterative rounds of multi-reward GRPO with a 54-dimensional cognitive reward topology.

	> ⚠️ Weights are not publicly released. This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author.

	## Architecture

	### Base Model
	- Type: Mixture-of-Experts (MoE)
	- Total Parameters: ~35B
	- Active Parameters: ~3B per token
	- Context Window: 32K tokens

	### Training Methodology
	- Algorithm: R-GRPO (Reflexive Group Relative Policy Optimization)
	- Training Rounds: 24 iterative cycles (V1 → V24)
	- Adapter Strategy: 20-layer sequential LoRA merge chain
	- Reward Architecture: SCRGNDWMT (9-tier, 54 sub-dimensions)

	### 9-Tier Reward Topology (SCRGNDWMT)

	\| Tier \| Name \| Sub-dimensions \| Description \|
	\|------\|------\|----------------\|-------------\|
	\| S \| Structure \| 6 \| XML formatting, JSON decision blocks \|
	\| C \| Content \| 7 \| Domain expertise, data fidelity, causal depth \|
	\| R \| Reasoning \| 5 \| Temporal-causal chains, counterfactual depth \|
	\| G \| Game Theory \| 5 \| K-level thinking, deception detection, coalition \|
	\| N \| Narrative \| 4 \| Scenario construction, debate, arc coherence \|
	\| D \| Data Fidelity \| 3 \| Numerical accuracy, source attribution \|
	\| W \| World Model \| 6 \| Regime detection, cross-market transmission, macro \|
	\| M \| Metacognition \| 7 \| Self-awareness, Bayesian confidence, falsification \|
	\| T \| Temporal-Causal \| 5 \| Causal chains, temporal depth, granularity \|

	### V24 Upgrades (from V22)
	- C7 (CausalChainDepthV2): Multi-step causal chains with time-lag annotations
	- M7 (BayesianConfidence): Calibrated confidence field in JSON decisions
	- W3 (CrossMarketPath): Structural contagion paths (Market A → Mechanism → Market B)
	- M5 (FalsificationV2): Quantitative, price-based invalidation conditions

	### Key Training Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning rate \| 5 × 10⁻⁷ \|
	\| Group size \| 12 \|
	\| Max completion tokens \| 1000 \|
	\| Temperature \| 1.15 \|
	\| β-annealing \| Stable (β=0.05) ↔ Break-up (β=0.03) \|
	\| LoRA rank \| ≥ 10 \|

	## Key Results

	### Reflexive Intelligence Emergence
	During V17 training, reflexive reasoning emerged through a discontinuous phase transition at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program.

	### V24 Training (ongoing)
	- 54-dimensional reward actively guiding cognitive development
	- Bayesian confidence calibration observed from Step 18
	- Cross-market causal reasoning emerging by Step 25
	- Zero gradient failures through 55+ steps

	## Research Program

	This model is part of a six-paper research program:

	\| Paper \| Title \| DOI \|
	\|-------\|-------\|-----\|
	\| P1 \| Reflexive Intelligence in LLMs \| [10.5281/zenodo.19557261](https://doi.org/10.5281/zenodo.19557261) \|
	\| P2 \| Observer Depth (ReflexBench) \| [10.5281/zenodo.19627242](https://doi.org/10.5281/zenodo.19627242) \|
	\| P3 \| When Rewards Collide (Multi-Reward GRPO) \| [10.5281/zenodo.19665969](https://doi.org/10.5281/zenodo.19665969) \|
	\| P4 \| Ouroboros V22 Architecture \| [10.5281/zenodo.19666786](https://doi.org/10.5281/zenodo.19666786) \|
	\| P5 \| The Cognitive Lifecycle \| [10.5281/zenodo.19666806](https://doi.org/10.5281/zenodo.19666806) \|
	\| P6 \| Cognitive Reward Topology \| [10.5281/zenodo.19666829](https://doi.org/10.5281/zenodo.19666829) \|

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| ReflexBench Dataset \| [MMJBDS/reflexbench](https://huggingface.co/datasets/MMJBDS/reflexbench) \|
	\| ReflexBench Eval Results \| [MMJBDS/reflexbench-eval](https://huggingface.co/datasets/MMJBDS/reflexbench-eval) \|
	\| Papers Repository \| [github.com/mmjbds/ouroboros-papers](https://github.com/mmjbds/ouroboros-papers) \|
	\| Evaluation Code \| [github.com/mmjbds/reflexbench](https://github.com/mmjbds/reflexbench) \|

	## Citation

	```bibtex
	@article{zhang2026ouroborosv22,
	title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition},
	author={Zhang, Mian},
	year={2026},
	doi={10.5281/zenodo.19666786}
	}

	@article{zhang2026topology,
	title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO},
	author={Zhang, Mian},
	year={2026},
	doi={10.5281/zenodo.19666829}
	}
	```

	## Author

	- Mian Zhang — Independent AI Researcher
	- ORCID: [0009-0001-9556-3839](https://orcid.org/0009-0001-9556-3839)
	- Email: 373743743@qq.com
	- GitHub: [@mmjbds](https://github.com/mmjbds)
	- Twitter/X: [@Henry_Avery666](https://x.com/Henry_Avery666)
	- LinkedIn: [henryavery-mianzhang](https://linkedin.com/in/henryavery-mianzhang)

	## License

	This model card is released under CC BY 4.0. Model weights are not publicly available.