Spaces:

garywelz
/

copernicusai

Running

App Files Files Community

garywelz commited on Jan 17

Commit

6d2ebfc

verified ·

1 Parent(s): 91952b4

Upload 2 files

Browse files

Files changed (1) hide show

README.md +102 -1

README.md CHANGED Viewed

@@ -65,6 +65,29 @@ Just as a microscope enables observation of the microscopic world, CopernicusAI
 - Automatic citation extraction and formatting
 - Source validation and authenticity verification
 ### 🤖 Advanced LLM Integration
 **Multi-Model Architecture:**
@@ -254,19 +277,46 @@ A centralized **metadata repository** (not a file archive) that provides:
 ## 📈 Platform Capabilities
-### Research Coverage
 - **250+ million research papers** accessible through integrated APIs
 - **8+ academic databases** integrated with parallel search
 - **Minimum 3 sources** required per episode for quality assurance
 - **Multi-paper analysis** for comprehensive coverage
 ### Platform Features
 - **Subscriber-driven content generation** - Users prompt and create podcasts
 - **RSS feed distribution** to major podcast platforms
 - **Public and private podcast options** - Share discoveries or keep them private
 ---
 ## 🔗 Live Platform & Resources
 ### Production Deployment
@@ -537,9 +587,60 @@ This platform is designed to support grant applications to:
 ## How to Cite This Work
 Welz, G. (2024–2025). *CopernicusAI: AI-Generated Audio Briefings as a Research Interface*.
 Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai
 ---
 ## 📄 License & Attribution

 - Automatic citation extraction and formatting
 - Source validation and authenticity verification
+## 🔬 Methodology & Quality Assurance
+### Multi-Source Validation Process
+1. **Source Discovery:** Parallel search across 8+ academic databases (PubMed, arXiv, NASA ADS, Zenodo, bioRxiv, CORE, Google Scholar, News API)
+2. **Quality Scoring:** Relevance ranking using citation counts, journal impact factors, recency, and peer-review status
+3. **Minimum Requirements:** At least 3 research sources required per episode for quality assurance
+4. **Citation Extraction:** Automated extraction with manual verification and formatting
+5. **Content Generation:** LLM synthesis (Google Gemini 3, GPT-4, Claude 3) with source attribution at each claim
+6. **Validation:** Manual review of sample episodes by domain experts (ongoing)
+### Paradigm Shift Detection
+- **Citation Network Analysis:** Identifies highly cited recent papers that may represent paradigm shifts
+- **Interdisciplinary Connection Analysis:** Detects connections across domains that may indicate emerging fields
+- **Expert Review:** Validation of identified paradigm shifts against domain expert knowledge
+- **Temporal Analysis:** Tracks citation patterns over time to identify emerging trends
+### Quality Metrics
+- **Source Quality:** Average citation count, journal impact factors, peer-review status
+- **Coverage:** Number of sources per topic, cross-database coverage, temporal distribution
+- **Accuracy:** Manual validation of sample episodes by domain experts (ongoing process)
+- **Reproducibility:** Full citation tracking enables verification of all claims
+- **Transparency:** All source papers accessible via Research Tools Dashboard and database tables
 ### 🤖 Advanced LLM Integration
 **Multi-Model Architecture:**
 ## 📈 Platform Capabilities
+### Research Coverage (As of January 2025)
 - **250+ million research papers** accessible through integrated APIs
 - **8+ academic databases** integrated with parallel search
+- **23,246+ papers indexed** with full metadata and vector embeddings (dynamically growing - see [Public Project Interface](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html) for current count)
 - **Minimum 3 sources** required per episode for quality assurance
 - **Multi-paper analysis** for comprehensive coverage
 ### Platform Features
 - **Subscriber-driven content generation** - Users prompt and create podcasts
+- **64+ podcast episodes** generated across 5 disciplines (as of January 2025)
 - **RSS feed distribution** to major podcast platforms
 - **Public and private podcast options** - Share discoveries or keep them private
+- **Knowledge Engine Dashboard:** Operational since December 2025
 ---
+## ⚠️ Limitations & Future Directions
+### Current Limitations
+- **Discipline Coverage:** Currently strongest in mathematics (23,246+ papers indexed); expansion to other disciplines in progress (see Ramp-Up Plan)
+- **Process Validation:** Flowcharts are LLM-generated and benefit from expert validation for domain-specific accuracy (validation process ongoing)
+- **Source Linking:** Not all processes yet linked to specific research papers (work in progress per Quality Standards)
+- **Scale:** Current process database (~313 processes) represents proof-of-concept; target is 1,000+ processes
+- **Podcast Generation:** Requires manual review for accuracy; fully automated quality assurance in development
+- **Video Production:** Advanced video features (Phase 2+) are planned but not yet implemented
+### Future Work
+- **Expansion:** Scale to 200,000+ papers across all disciplines (see RAMP_UP_PLAN.md)
+- **Validation:** Implement systematic peer review process for process flowcharts
+- **Integration:** Enhanced cross-linking between processes, papers, and podcasts
+- **Automation:** Automated source paper suggestion and linking using vector search
+- **Quality Assurance:** Systematic validation framework for flowchart accuracy and podcast content
+- **Video Features:** Implement advanced video production capabilities (Phase 2+)
+- **Multi-modal Integration:** Enhanced integration of visual content, animations, and interactive elements
+### Known Areas for Improvement
+- **Bias in Source Selection:** Current system may favor highly cited papers; working to balance with recent, emerging research
+- **Domain Expertise:** Some domains better represented than others; actively expanding coverage
+- **Validation Coverage:** Not all content yet validated by domain experts; systematic validation in progress
 ## 🔗 Live Platform & Resources
 ### Production Deployment
 ## How to Cite This Work
+### BibTeX Format
+```bibtex
+@article{welz2025copernicusai,
+  title={CopernicusAI: AI-Generated Audio Briefings as a Research Interface},
+  author={Welz, Gary},
+  journal={Nature Communications},
+  year={2025},
+  note={Submitted},
+  url={https://huggingface.co/spaces/garywelz/copernicusai},
+  note={Preprint available upon publication}
+}
+```
+### Standard Citation Format
 Welz, G. (2024–2025). *CopernicusAI: AI-Generated Audio Briefings as a Research Interface*.
 Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai
+**Note:** When published, this citation will be updated with DOI and publication details from Nature Communications.
+---
+## 📊 Data Availability
+**Research Data:**
+- **Research Paper Metadata:** Research paper metadata is publicly accessible via the [Research Tools Dashboard](https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine) and [Research Paper Database Table](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/papers-database-table.html). Current statistics are dynamically updated at the [Public Project Interface](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html).
+- **Podcast Episodes:** All generated podcast episodes, transcripts, and metadata are accessible via the [Podcast Database](https://www.copernicusai.fyi/episodes) and [RSS Feed](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/feeds/copernicus-mvp-rss-feed.xml).
+- **Process Flowcharts:** Process flowcharts across all disciplines are publicly available in Google Cloud Storage:
+  - [Biology Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/biology-processes-database/biology-database-table.html)
+  - [Chemistry Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/chemistry-processes-database/chemistry-database-table.html)
+  - [Physics Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/physics-processes-database/physics-database-table.html)
+  - [Mathematics Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/mathematics-processes-database/mathematics-database-table.html)
+  - [Computer Science Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/computer-science-processes-database/computer-science-database-table.html)
+  - [GLMP Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/glmp-database-table.html)
+- **Science Videos:** Video database accessible via [Science Video Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/videos-database-table.html) and [Live Demo](https://scienceviddb-web-204731194849.us-central1.run.app/).
+**Source Code & Methodology:**
+- **Methodology:** Fully documented in the Programming Framework paper (see [Programming Framework Space](https://huggingface.co/spaces/garywelz/programming_framework) for methodology details).
+- **Process Generation:** LLM-powered extraction using Google Gemini 2.0 Flash, documented in Programming Framework.
+- **Database Schemas:** Documented in project documentation files (SCHEMA_EXTENSIBILITY_GUIDE.md, UNIFIED_METADATA_SCHEMA_MASTER.md).
+- **API Documentation:** RESTful API endpoints documented in the API Documentation section above.
+**Access:**
+- **Public Access:** All process databases, database tables, public interfaces, and podcast episodes are publicly accessible (no authentication required).
+- **Research Tools Dashboard:** [https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine](https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine) - Interactive knowledge graph, vector search, and RAG queries (public access).
+- **Public Project Interface:** [https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html) - Comprehensive access to all public components with dynamically updated statistics.
+- **API Endpoints:** [https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app](https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app) - RESTful API with full documentation (see API Documentation section above).
+**Reproducibility:**
+- All process flowcharts include source citations linking to research papers.
+- Podcast generation methodology is fully documented and reproducible.
+- Database structures are standardized and documented.
+- Research synthesis workflow is transparent and can be replicated.
+- All components are publicly accessible for verification and reuse.
 ---
 ## 📄 License & Attribution