Spaces:

Bellok
/

warbler-cda

Running on Zero

App Files Files Community

warbler-cda / README.md

Bellok

docs: enhance README with search mode guides and app info updates, add entanglement resonance feature

f22e6ff 3 months ago

preview code

raw

history blame contribute delete

13.4 kB

	---
	title: Warbler CDA FractalStat RAG
	emoji: 🦜
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 6.0.2
	app_file: app.py
	pinned: false
	license: mit
	short_description: RAG system with 8D FractalStat and 100k documents
	tags:
	- rag
	- semantic-search
	- retrieval
	- fastapi
	- fractalstat
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/68c705b6fc90bcc7a4f56721/8G2TJJT8enAFaBLJGTXka.png
	---

	# Warbler CDA - Cognitive Development Architecture RAG System

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
	[![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-green.svg)](https://fastapi.tiangolo.com/)
	[![Docker](https://img.shields.io/badge/Docker-ready-blue.svg)](https://docker.com)

	A production-ready RAG (Retrieval-Augmented Generation) system with FractalStat multi-dimensional addressing for intelligent document retrieval, semantic memory, and automatic data ingestion.

	## 🌟 Features

	### Core RAG System

	- Semantic Anchors: Persistent memory with provenance tracking
	- Hierarchical Summarization: Micro/macro distillation for efficient compression
	- Conflict Detection: Automatic detection and resolution of contradictory information
	- Memory Pooling: Performance-optimized object pooling for high-throughput scenarios

	### FractalStat Multi-Dimensional Addressing

	- 8-Dimensional Coordinates: Realm, Lineage, Adjacency, Horizon, Luminosity, Polarity, Dimensionality, Alignment
	- Hybrid Scoring: Combines semantic similarity with FractalStat resonance for superior retrieval
	- Entanglement Detection: Identifies relationships across dimensional space
	- Validated System: Comprehensive experiments (EXP-01 through EXP-10) validate uniqueness, efficiency, and narrative preservation

	### Production-Ready API

	- FastAPI Service: High-performance async API with concurrent query support
	- CLI Tools: Command-line interface for queries, ingestion, and management
	- HuggingFace Integration: Direct ingestion from HF datasets
	- Docker Support: Containerized deployment ready

	## 📚 Data Sources

	The Warbler system is trained on carefully curated, MIT-licensed datasets from HuggingFace:

	### Original Warbler Packs

	- `warbler-pack-core` - Core narrative and reasoning patterns
	- `warbler-pack-wisdom-scrolls` - Philosophical and wisdom-based content
	- `warbler-pack-faction-politics` - Political and faction dynamics

	### HuggingFace Datasets

	- arXiv Papers (`nick007x/arxiv-papers`) - 2.5M+ scholarly papers covering scientific domains
	- Due to space limits, we only ingest 100k of these documents for use on HuggingFace Spaces.
	- Prompt Engineering Report (`PromptSystematicReview/ThePromptReport`) - 83 comprehensive prompt documentation entries
	- Currently unavailable due to same reasons above.
	- Generated Novels (`GOAT-AI/generated-novels`) - 20 narrative-rich novels for storytelling patterns
	- Currently unavailable due to same reasons above.
	- Technical Manuals (`nlasso/anac-manuals-23`) - 52 procedural and operational documents
	- Currently unavailable due to same reasons above.
	- ChatEnv Enterprise (`SustcZhangYX/ChatEnv`) - 112K+ software development conversations
	- Currently unavailable due to same reasons above.
	- Portuguese Education (`Solshine/Portuguese_Language_Education_Texts`) - 21 multilingual educational texts
	- Currently unavailable due to same reasons above.
	- Educational Stories (`MU-NLPC/Edustories-en`) - 1.5K+ case studies and learning narratives

	All datasets are provided under MIT or compatible licenses. For complete attribution, see the HuggingFace Hub pages listed above.

	## 📦 Installation

	### From Source (Current Method)

	```bash
	git clone https://github.com/tiny-walnut-games/the-seed.git
	cd the-seed/warbler-cda-package
	pip install -e .
	```

	### Optional Dependencies

	```bash
	# OpenAI embeddings integration
	pip install openai

	# Development tools
	pip install pytest pytest-cov
	```

	## 🚀 Quick Start

	### Option 1: Direct Python (Easiest)

	```bash
	cd warbler-cda-package

	# Start the API with automatic pack loading
	./run_api.ps1

	# Or on Linux/Mac:
	python start_server.py
	```

	The API automatically loads all Warbler packs on startup and serves them at http://localhost:8000

	### Option 2: Docker Compose

	```bash
	cd warbler-cda-package
	docker-compose up --build
	```

	### Option 3: Kubernetes

	```bash
	cd warbler-cda-package/k8s
	./demo-docker-k8s.sh # Full auto-deploy
	```

	## 📡 API Usage Examples

	### Using the REST API

	```bash
	# Start the API first: ./run_api.ps1
	# Then test with:

	# Health check
	curl http://localhost:8000/health

	# Semantic search (plain English queries)
	curl -X POST http://localhost:8000/query \
	-H "Content-Type: application/json" \
	-d '{
	"query_id": "semantic1",
	"semantic_query": "dancing under the moon",
	"max_results": 5
	}'

	# FractalStat hybrid search (technical/science with dimensional awareness)
	curl -X POST http://localhost:8000/query \
	-H "Content-Type: application/json" \
	-d '{
	"query_id": "hybrid1",
	"semantic_query": "interplanetary approach maneuvers",
	"fractalstat_hybrid": true,
	"max_results": 5
	}'

	# Get metrics
	curl http://localhost:8000/metrics
	```

	### Understanding Search Modes

	The system provides two search approaches with intelligent fallback:

	#### Semantic Search (Default)
	- Use for: Plain English queries, casual search, general questions
	- Behavior: Pure semantic similarity matching
	- Examples: "How does gravity work?", "tell me about dancing", "operating a spaceship"
	- Results: Always returns matches when available, best for natural language

	#### FractalStat Hybrid Search
	- Use for: Technical/scientific queries, specific terminology, multi-dimensional search
	- Behavior: Combines semantic similarity with 8D FractalStat resonance
	- Examples: "rotation dynamics of Saturn's moons", "quantum chromodynamics", "interplanetary approach maneuvers"
	- Results: Superior for technical content, may filter out general results
	- Fallback: Automatically switches to semantic search if hybrid returns no results

	Pro Tip: When hybrid search fails (threshold below 0.3), the system automatically falls back to semantic search, ensuring you always get relevant results.

	### Using Python Programmatically

	```python
	import requests

	# Health check
	response = requests.get("http://localhost:8000/health")
	print(f"API Status: {response.json()['status']}")

	# Query
	query_data = {
	"query_id": "python_test",
	"semantic_query": "rotation dynamics of Saturn's moons",
	"max_results": 5,
	"fractalstat_hybrid": True
	}

	results = requests.post("http://localhost:8000/query", json=query_data).json()
	print(f"Found {len(results['results'])} results")

	# Show top result
	if results['results']:
	top_result = results['results'][0]
	print(f"Top score: {top_result['relevance_score']:.3f}")
	print(f"Content: {top_result['content'][:100]}...")
	```

	### FractalStat Hybrid Scoring

	```python
	from warbler_cda import FractalStatRAGBridge

	# Enable FractalStat hybrid scoring
	fractalstat_bridge = FractalStatRAGBridge()
	api = RetrievalAPI(
	semantic_anchors=semantic_anchors,
	embedding_provider=embedding_provider,
	fractalstat_bridge=fractalstat_bridge,
	config={"enable_fractalstat_hybrid": True}
	)

	# Query with hybrid scoring
	from warbler_cda import RetrievalQuery, RetrievalMode

	query = RetrievalQuery(
	query_id="hybrid_query_1",
	mode=RetrievalMode.SEMANTIC_SIMILARITY,
	semantic_query="Find wisdom about resilience",
	fractalstat_hybrid=True,
	weight_semantic=0.6,
	weight_fractalstat=0.4
	)

	assembly = api.retrieve_context(query)
	print(f"Found {len(assembly.results)} results with quality {assembly.assembly_quality:.3f}")
	```

	### Running the API Service

	```bash
	# Start the FastAPI service
	uvicorn warbler_cda.api.service:app --host 0.0.0.0 --port 8000

	# Or use the CLI
	warbler-api --port 8000
	```

	### Using the CLI

	```bash
	# Query the API
	warbler-cli query --query-id q1 --semantic "wisdom about courage" --max-results 10

	# Enable hybrid scoring
	warbler-cli query --query-id q2 --semantic "narrative patterns" --hybrid

	# Bulk concurrent queries
	warbler-cli bulk --num-queries 10 --concurrency 5 --hybrid

	# Check metrics
	warbler-cli metrics
	```

	## 📊 FractalStat Experiments

	The system includes validated experiments demonstrating:

	- EXP-01: Address uniqueness (0% collision rate across 10K+ entities)
	- EXP-02: Retrieval efficiency (sub-millisecond at 100K scale)
	- EXP-03: Dimension necessity (all 7 dimensions required)
	- EXP-10: Narrative preservation under concurrent load

	```python
	from warbler_cda import run_all_experiments

	# Run validation experiments
	results = run_all_experiments(
	exp01_samples=1000,
	exp01_iterations=10,
	exp02_queries=1000,
	exp03_samples=1000
	)

	print(f"EXP-01 Success: {results['EXP-01']['success']}")
	print(f"EXP-02 Success: {results['EXP-02']['success']}")
	print(f"EXP-03 Success: {results['EXP-03']['success']}")
	```

	## 🎯 Use Cases

	### 1. Intelligent Document Retrieval

	```python
	# Add documents from various sources
	for doc in documents:
	api.add_document(
	doc_id=doc["id"],
	content=doc["text"],
	metadata={
	"realm_type": "knowledge",
	"realm_label": "technical_docs",
	"lifecycle_stage": "emergence"
	}
	)

	# Retrieve with context awareness
	results = api.query_semantic_anchors("How to optimize performance?")
	```

	### 2. Narrative Coherence Analysis

	```python
	from warbler_cda import ConflictDetector

	conflict_detector = ConflictDetector(embedding_provider=embedding_provider)

	# Process statements
	statements = [
	{"id": "s1", "text": "The system is fast"},
	{"id": "s2", "text": "The system is slow"}
	]

	report = conflict_detector.process_statements(statements)
	print(f"Conflicts detected: {report['conflict_summary']}")
	```

	### 3. HuggingFace Dataset Ingestion

	```python
	from warbler_cda.utils import HFWarblerIngestor

	ingestor = HFWarblerIngestor()

	# Transform HF dataset to Warbler format
	docs = ingestor.transform_npc_dialogue("amaydle/npc-dialogue")

	# Create pack
	pack_path = ingestor.create_warbler_pack(docs, "warbler-pack-npc-dialogue")
	```

	## 🏗️ Architecture

	```none
	warbler_cda/
	├── retrieval_api.py # Main RAG API
	├── semantic_anchors.py # Semantic memory system
	├── anchor_data_classes.py # Core data structures
	├── anchor_memory_pool.py # Performance optimization
	├── summarization_ladder.py # Hierarchical compression
	├── conflict_detector.py # Conflict detection
	├── castle_graph.py # Concept extraction
	├── melt_layer.py # Memory consolidation
	├── evaporation.py # Content distillation
	├── fractalstat_rag_bridge.py # FractalStat hybrid scoring
	├── fractalstat_entity.py # FractalStat entity system
	├── fractalstat_experiments.py # Validation experiments
	├── embeddings/ # Embedding providers
	│ ├── base_provider.py
	│ ├── local_provider.py
	│ ├── openai_provider.py
	│ └── factory.py
	├── api/ # Production API
	│ ├── service.py # FastAPI service
	│ └── cli.py # CLI interface
	└── utils/ # Utilities
	├── load_warbler_packs.py
	└── hf_warbler_ingest.py
	```

	## 🔬 Technical Details

	### FractalStat Dimensions

	1. Realm: Domain classification (type + label)
	2. Lineage: Generation/version number
	3. Adjacency: Graph connectivity (0.0-1.0)
	4. Horizon: Lifecycle stage (logline, outline, scene, panel)
	5. Luminosity: Clarity/activity level (0.0-1.0)
	6. Polarity: Resonance/tension (0.0-1.0)
	7. Dimensionality: Complexity/thread count (1-7)

	### Hybrid Scoring Formula

	```math
	hybrid_score = (weight_semantic × semantic_similarity) + (weight_fractalstat × fractalstat_resonance)
	```

	Where:

	- `semantic_similarity`: Cosine similarity of embeddings
	- `fractalstat_resonance`: Multi-dimensional alignment score
	- Default weights: 60% semantic, 40% FractalStat

	## 📚 Documentation

	- [API Reference](docs/api.md)
	- [FractalStat Guide](docs/fractalstat.md)
	- [Experiments](docs/experiments.md)
	- [Deployment](docs/deployment.md)

	## 🤝 Contributing

	Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

	## 📄 License

	MIT License - see [LICENSE](LICENSE) for details.

	## 🙏 Acknowledgments

	- Built on research from The Seed project
	- FractalStat addressing system inspired by multi-dimensional data structures
	- Semantic anchoring based on cognitive architecture principles

	## 📞 Contact

	- Project: [The Seed](https://github.com/tiny-walnut-games/the-seed)
	- Issues: [GitHub Issues](https://github.com/tiny-walnut-games/the-seed/issues)
	- Discussions: [GitHub Discussions](https://github.com/tiny-walnut-games/the-seed/discussions)

	---

	### Made with ❤️ by Tiny Walnut Games