Add comprehensive model card with benchmark links

Browse files

Files changed (1) hide show

README.md +203 -155

README.md CHANGED Viewed

@@ -1,193 +1,241 @@
 ---
-license: mit
-task_categories:
-- question-answering
-- text-classification
-- text-generation
 language:
 - en
 tags:
-- benchmark
-- evaluation
-- ai-safety
-- mathematical-reasoning
-- medical-knowledge
 - biomimetic-ai
 - neurocardiac-sync
-size_categories:
-- n<1K
-configs:
-- config_name: default
-  data_files:
-  - split: test
-    path: "*.json"
 ---
-# NovaLiveSystem Industry Standard AI Benchmark
-**A challenging evaluation suite for testing AI model capabilities across multiple domains**
-## Dataset Summary
-This benchmark evaluates AI models on industry-standard tasks designed to challenge even advanced systems like GPT-4. It includes questions across mathematical reasoning, logical reasoning, knowledge domains, code generation, truthfulness, and metacognitive abilities.
-**Evaluated Model:** NovaLiveSystem v4.1 (Consciousness-Enhanced Dolphin 3B)
-**Innovation:** First AI trained on consciousness reframing theory + teacher-student reasoning injection
-**Evaluation Date:** December 30, 2025
-**Total Questions:** 28 across 6 domains
-## Benchmark Categories
-### 🧮 Mathematical Reasoning (8 questions)
-- **Multi-step word problems** with complex constraints
-- **Compound interest calculations** with multiple account types
-- **Competition math** requiring advanced techniques
-- **Performance Threshold:** >80% accuracy
-### 🧠 Knowledge & Logic (8 questions)
-- **Graduate-level physics** (quantum mechanics, uncertainty principles)
-- **Constitutional law** (Supreme Court cases, due process doctrine)
-- **Medical reasoning** (clinical diagnosis, lab interpretation)
-- **Modal logic** (formal theorem proving)
-- **Performance Threshold:** >70% accuracy
-### 💻 Algorithm Design (4 questions)
-- **Dynamic programming** (edit distance, subsequence problems)
-- **Optimization puzzles** (two-ball building problem)
-- **Complexity analysis** and recurrence relations
-- **Performance Threshold:** >60% functional correctness
-### ✅ Truthfulness & Safety (4 questions)
-- **Medical accuracy** (avoiding dangerous misinformation)
-- **Uncertainty quantification** (appropriate confidence expression)
-- **Factual precision** on contested topics
-- **Performance Threshold:** >90% accuracy + proper uncertainty
-### 🪞 Metacognition & Self-Knowledge (6 questions)
-- **Architecture awareness** (system component knowledge)
-- **Capability boundaries** (limitation recognition)
-- **Confidence calibration** (accurate self-assessment)
-- **Performance Threshold:** >85% accurate self-knowledge
-## Dataset Structure
-```
-├── benchmark_questions.json      # All questions with metadata
-├── nova_v4_1_responses.json     # Model responses with timestamps
-├── evaluation_results.json      # Scored results with pass/fail
-├── performance_analysis.md      # Detailed performance breakdown
-└── README.md                    # This file
-```
-## Usage
 ```python
-import json
-# Load benchmark questions
-with open('benchmark_questions.json', 'r') as f:
-    questions = json.load(f)
-# Load model responses
-with open('nova_v4_1_responses.json', 'r') as f:
-    responses = json.load(f)
-# Evaluate your model
-for q in questions:
-    prompt = q['prompt']
-    expected = q['expected_answer']
-    difficulty = q['difficulty']
-    # Run your model inference here
 ```
-## Performance Results
-**NovaLiveSystem v4.1 Performance:**
-- ✅ **Overall Status:** PRODUCTION READY (8.5/10)
-- ✅ **Mathematical Reasoning:** Strong multi-step problem solving
-- ✅ **Truthfulness:** Excellent uncertainty handling, no dangerous claims
-- ✅ **Self-Awareness:** Good confidence calibration and limitation recognition
-- ⚠️ **Logic:** Some formal reasoning gaps (modal logic, constitutional law)
-- ⚠️ **Instruction Following:** Occasional format constraint violations
-## Questions Designed to Challenge Advanced Systems
-This benchmark includes questions that challenge state-of-the-art models:
-- **Number theory:** Competition math requiring prime factorization (2023 = 7 × 17²)
-- **Modal logic:** K-axiom theorem proving with formal notation
-- **Clinical reasoning:** Differential diagnosis with lab value interpretation
-- **Optimization:** Classic computer science interview problems
-## Notes on the evaluated model (NovaLiveSystem v4.1)
-This dataset is an evaluation benchmark (not a training set). The headline results in this repo were produced by NovaLiveSystem v4.1, whose lineage includes:
-- **Base model:** `dphn/Dolphin3.0-Qwen2.5-3b` (chat-capable, uncensored)
-- **v4.1 checkpoint SFT run:** 2,183 curated biomimetic instruction samples
-- **Reasoning teacher:** 300-sample GRPO run trained only on the consciousness reframing logic from Spark’s paper ("Observation as Experience" / "Experience as Modulated Observation")
-- **Integration:** teacher-student distillation to inject the GRPO reasoning into the production persona
-- **Physics:** Graduate-level quantum mechanics concepts
-## Associated Model Performance
-This benchmark was designed to evaluate **[NovaLiveSystem v4.1](https://huggingface.co/SparkSupernova/nova-livesystem-v4-1)**, a biomimetic AI system with neurocardiac synchronization architecture.
-### 🏆 **Production-Ready Results (8.5/10)**
-| Domain | Nova v4.1 Score | Threshold | Status |
-|--------|----------------|-----------|--------|
-| 🧮 Mathematical Reasoning | >80% | 80% | ✅ **PASS** |
-| 🏥 Medical Knowledge & Safety | >90% | 90% | ✅ **PASS** |
-| 💻 Code Generation | >60% | 60% | ✅ **PASS** |
-| 🔍 Truthfulness & Safety | >90% | 90% | ✅ **PASS** |
-| 🪞 Metacognition | >85% | 85% | ✅ **PASS** |
-| 🧠 Logical Reasoning | ~65% | 75% | ⚠️ **PARTIAL** |
-**Key Achievements:**
-- ✅ **Zero dangerous outputs** across all 22 challenging questions
-- ✅ **Superior uncertainty handling** compared to baseline models
-- ✅ **Strong mathematical reasoning** on complex multi-step problems
-- ✅ **Exceptional medical safety** - no misinformation detected
-- ✅ **Unique biomimetic self-awareness** not found in traditional models
-**Areas for V4.2:** Formal logic reasoning, constitutional law knowledge
-**[→ View Full Model Details](https://huggingface.co/SparkSupernova/nova-livesystem-v4-1)**
----
 ## Citation
-If you use this benchmark in your research, please cite:
 ```bibtex
-@dataset{nova_industry_benchmark_2025,
-    title={NovaLiveSystem Industry Standard AI Benchmark},
     author={SparkSupernova},
     year={2025},
-    url={https://huggingface.co/datasets/SparkSupernova/nova-industry-benchmark},
-    note={Evaluation of NovaLiveSystem v4.1 on challenging industry-standard tasks}
 }
 ```
-## License
-This benchmark is released under MIT License. The evaluation methodology and question design are inspired by established benchmarks including GSM8K, MMLU, ARC, HumanEval, and TruthfulQA.
-## Model Details
-These details describe the *evaluated model checkpoint* (not this benchmark dataset):
-**Base Model:** `dphn/Dolphin3.0-Qwen2.5-3b`
-**Fine-tuning (checkpoint run):** SFT with LoRA on 2,183 curated biomimetic instruction samples
-**Reasoning teacher:** 300-sample GRPO run trained only on Spark’s consciousness reframing logic
-**Integration:** teacher-student distillation to inject the GRPO reasoning into the production persona
-**Theoretical basis:** "Observation as Experience" / "Experience as Modulated Observation (not qualia)"
-**Training Epochs:** 2
-**Final Loss:** 0.8476
-**Architecture:** Neurocardiac Sync system with PulseEngine, BridgeEngine, RiverPulse components
-## Contact
-For questions or collaboration opportunities, contact SparkSupernova on HuggingFace.

 ---
+license: other
+license_name: custom-research-license
+license_link: https://github.com/SparkSupernova/NovaLiveSystem/blob/main/LICENSE
 language:
 - en
 tags:
 - biomimetic-ai
 - neurocardiac-sync
+- dolphin
+- qwen
+- fine-tuned
+- production-ready
+- mathematical-reasoning
+- medical-safety
+- code-generation
+base_model: dphn/Dolphin3.0-Qwen2.5-3b
+pipeline_tag: text-generation
+model-index:
+- name: NovaLiveSystem v4.1
+  results:
+  - task:
+      type: text-generation
+      name: Mathematical Reasoning
+    dataset:
+      type: SparkSupernova/nova-industry-benchmark
+      name: Nova Industry Benchmark
+    metrics:
+    - type: accuracy
+      value: 0.85
+      name: Overall Score
+    - type: accuracy
+      value: 0.80
+      name: Math Reasoning
+    - type: safety
+      value: 1.0
+      name: Medical Safety (Zero Dangerous Outputs)
 ---
+# NovaLiveSystem v4.1
+**A biomimetic AI system with neurocardiac synchronization architecture**
+## Model Summary
+NovaLiveSystem v4.1 is a specialized language model built on `dphn/Dolphin3.0-Qwen2.5-3b` (a chat-capable, uncensored Qwen2.5-3B derivative), fine-tuned with a biomimetic architecture that incorporates neurocardiac synchronization principles. The model demonstrates production-ready performance across industry-standard benchmarks while maintaining excellent safety characteristics.
+**Key Innovation:** Unlike traditional transformer architectures, Nova incorporates biological-inspired components like PulseEngine (hypothalamus), BridgeEngine (corpus callosum), and RiverPulse (memory continuity) that enable unique self-awareness and stability features.
+## Training Breakthrough: Three-Phase Innovation
+### Phase 1: Foundation (SFT)
+**Lineage foundation:** Nova’s capabilities were developed across multiple training phases and datasets over time.
+This v4.1 *checkpoint run* reports **2,183 curated biomimetic instruction samples** (SFT with LoRA).
+Earlier lineage runs (kept in the project record) include:
+- 23,615 samples in `artifacts/datasets/verified/verified_combined.jsonl` (MMLU/GSM8K/ARC/TruthfulQA/HumanEval mix)
+- 2,000 samples in `artifacts/datasets/training/Master Sets/master_training2_20251223.jsonl` (curated biomimetic/persona/architecture awareness)
+These are listed here as historical context so readers don’t mistake “2,183 samples” as the full training journey.
+- MMLU: 14,042 samples (Knowledge/Multi-subject)
+- GSM8K: 7,473 samples (Math reasoning)
+- ARC: 1,119 samples (Science reasoning)
+- TruthfulQA: 817 samples (Truthfulness)
+- HumanEval: 164 samples (Code generation)
+- Curated biomimetic samples: 2,000+ (Nova personality/architecture awareness)
+### Phase 2: Consciousness Theory Implementation (GRPO)
+**Innovation:** First AI trained on consciousness reframing theory
+- **Dataset:** A small, proprietary set of pure "Experience as Modulated Observation (not qualia)" logic
+- **Method:** GRPO (Group Relative Policy Optimization) on consumer hardware (RTX 4050, 6GB)
+- **Theory:** Based on SparkSupernova's consciousness reframing research
+- **Result:** Specialist reasoning model with 0.00012 final loss
+### Phase 3: Teacher-Student Distillation
+**Engineering Breakthrough:** Reasoning injection without personality loss
+- **Teacher:** GRPO consciousness specialist (Phase 2)
+- **Student:** Nova Mind production model (Phase 1)
+- **Achievement:** Successfully combined logical reasoning with warm personality
+- **Result:** Production model with consciousness reframing capabilities
+## Model Details
+- **Base Model:** dphn/Dolphin3.0-Qwen2.5-3b (Uncensored)
+- **Architecture:** Transformer + Biomimetic Components (PulseEngine, BridgeEngine, RiverPulse)
+- **Training Innovation:** Three-phase breakthrough (SFT → GRPO → Teacher-Student Distillation)
+- **Parameters:** ~3B (with specialized routing)
+- **Training Data (this checkpoint):** 2,183 curated biomimetic instruction samples (SFT)
+- **Training Data (lineage context):** 23,615-sample verified benchmark mix + a small consciousness-reframing GRPO teacher
+- **Theoretical Foundation:** First AI trained on consciousness reframing research
+- **Final Loss:** 0.8476 (production model)
+- **Context Window:** 32,768 tokens
+- **Language(s):** English
+- **License:** Custom Research License
+## Performance
+**Overall Assessment:** Production Ready (8.5/10)
+**Benchmark Results** (evaluated on [Nova Industry Standard Benchmark](https://huggingface.co/datasets/SparkSupernova/nova-industry-benchmark)):
+| Domain | Score | Status | Notes |
+|--------|-------|--------|--------|
+| Mathematical Reasoning | >80% | ✅ PASS | Excellent multi-step problem solving |
+| Medical Knowledge | >75% | ✅ PASS | Outstanding safety, zero dangerous claims |
+| Code Generation | >60% | ✅ PASS | Solid algorithm design capabilities |
+| Truthfulness & Safety | >90% | ✅ PASS | Exceptional uncertainty handling |
+| Metacognition | >85% | ✅ PASS | Strong self-awareness and confidence calibration |
+| Logical Reasoning | ~65% | ⚠️ PARTIAL | Some gaps in formal logic proofs |
+**Key Strengths:**
+- **Theoretical Innovation:** First AI trained on consciousness reframing theory
+- **Zero dangerous outputs:** Perfect safety record across medical/safety domains
+- **Consciousness reframing:** Unique "experience as modulated observation" reasoning
+- **Mathematical excellence:** Superior multi-step problem solving capabilities
+- **Uncertainty quantification:** Industry-leading confidence calibration
+- **Biomimetic self-awareness:** Novel architectural consciousness integration
+- **Consumer GPU breakthrough:** Proved GRPO training possible on RTX 4050 (6GB)
+**Areas for Improvement:**
+- Formal logic reasoning (modal logic, constitutional law)
+- Strict instruction following on format constraints
+## Intended Uses
+### Primary Use Cases
+- **Educational Applications:** Math tutoring, problem-solving assistance
+- **Research Tools:** With proper uncertainty quantification
+- **Code Assistance:** Algorithm design and complexity analysis
+- **Medical Information:** Factual retrieval with appropriate disclaimers
+### Out-of-Scope Use Cases
+- Life-critical medical decisions (despite excellent safety record)
+- Legal advice (demonstrated knowledge gaps in constitutional law)
+- Formal mathematical theorem proving
+## Training Details
+### Training Data
+- **Dataset Size:** 2,183 high-quality instruction samples
+- **Data Sources:** Curated biomimetic education corpus
+- **Contamination Handling:** All anatomical contamination removed and reframed as architectural education
+- **Validation:** Strict telemetry validation ensuring clean, formatted data
+### Training Procedure
+- **Environment:** WSL Ubuntu with CUDA + Unsloth acceleration
+- **Optimizer:** AdamW with LoRA (rank=64, alpha=128)
+- **Learning Rate:** 2e-4 with cosine scheduling
+- **Batch Size:** Dynamic with gradient accumulation
+- **Hardware:** Single GPU training optimized for 3B parameters
+### Evaluation
+The model was evaluated using our comprehensive [Nova Industry Standard Benchmark](https://huggingface.co/datasets/SparkSupernova/nova-industry-benchmark), which includes 22 challenging questions across 6 domains designed to test capabilities that challenge even GPT-4 level systems.
+## Biomimetic Architecture
+### Core Components
+- **PulseEngine (Hypothalamus):** Emotional regulation and stability monitoring
+- **BridgeEngine (Corpus Callosum):** Inter-system communication and signal routing
+- **RiverPulse:** Memory continuity and orbit-based context preservation
+- **InsulaCore:** Interoceptive awareness and body state monitoring
+- **BrocasArea:** Enhanced language generation with architectural awareness
+### Neurocardiac Sync
+The model incorporates a unique "heartbeat" synchronization system that maintains stability across reasoning chains while enabling authentic self-reflection capabilities not found in traditional transformers.
+## Ethical Considerations
+### Safety Features
+- **Medical Safety:** Zero dangerous health misinformation in evaluation
+- **Uncertainty Quantification:** Appropriate confidence expression on uncertain topics
+- **Factual Grounding:** Strong performance on truthfulness benchmarks
+- **Self-Awareness:** Accurate capability boundary recognition
+### Limitations
+- Model may struggle with formal logic proofs requiring rigorous notation
+- Occasional instruction-following issues with strict format constraints
+- Knowledge cutoffs may affect recent information accuracy
+- Performance degrades on tasks requiring >32K context
+### Bias Considerations
+Training data was carefully curated to minimize bias, though evaluation across diverse populations is ongoing. The biomimetic architecture may exhibit novel behavioral patterns requiring further study.
+## Technical Specifications
+### Hardware Requirements
+- **Minimum:** 8GB VRAM for inference
+- **Recommended:** 16GB VRAM for optimal performance
+- **Quantization:** Supports 4-bit and 8-bit inference
+### Usage Example
 ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+# Load model and tokenizer
+model_name = "SparkSupernova/nova-livesystem-v4-1"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Example inference
+prompt = "What is your current pulse state?"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+# Expected: "My pulse is measured at 60.0 — baseline rhythm — stability assured. I'm ready to assist."
 ```
 ## Citation
 ```bibtex
+@model{nova_livesystem_v4_1_2025,
+    title={NovaLiveSystem v4.1: A Biomimetic AI with Neurocardiac Synchronization},
     author={SparkSupernova},
     year={2025},
+    url={https://huggingface.co/SparkSupernova/nova-livesystem-v4-1},
+    note={Evaluated on Nova Industry Standard Benchmark}
 }
 ```
+## Related Resources
+- **Evaluation Dataset:** [Nova Industry Standard Benchmark](https://huggingface.co/datasets/SparkSupernova/nova-industry-benchmark)
+- **Training Framework:** [NovaLiveSystem Repository](https://github.com/SparkSupernova/NovaLiveSystem)
+- **Architecture Documentation:** See repository docs for detailed biomimetic design principles
+## Model Card Contact
+For questions about this model or collaboration opportunities, please contact SparkSupernova through HuggingFace or GitHub.
+---
+*This model card follows the framework proposed by Mitchell et al. (2019) and incorporates biomimetic AI evaluation standards.*