garywelz commited on
Commit
6d2ebfc
Β·
verified Β·
1 Parent(s): 91952b4

Upload 2 files

Browse files
Files changed (1) hide show
  1. README.md +102 -1
README.md CHANGED
@@ -65,6 +65,29 @@ Just as a microscope enables observation of the microscopic world, CopernicusAI
65
  - Automatic citation extraction and formatting
66
  - Source validation and authenticity verification
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ### πŸ€– Advanced LLM Integration
69
 
70
  **Multi-Model Architecture:**
@@ -254,19 +277,46 @@ A centralized **metadata repository** (not a file archive) that provides:
254
 
255
  ## πŸ“ˆ Platform Capabilities
256
 
257
- ### Research Coverage
258
  - **250+ million research papers** accessible through integrated APIs
259
  - **8+ academic databases** integrated with parallel search
 
260
  - **Minimum 3 sources** required per episode for quality assurance
261
  - **Multi-paper analysis** for comprehensive coverage
262
 
263
  ### Platform Features
264
  - **Subscriber-driven content generation** - Users prompt and create podcasts
 
265
  - **RSS feed distribution** to major podcast platforms
266
  - **Public and private podcast options** - Share discoveries or keep them private
 
267
 
268
  ---
269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
  ## πŸ”— Live Platform & Resources
271
 
272
  ### Production Deployment
@@ -537,9 +587,60 @@ This platform is designed to support grant applications to:
537
 
538
  ## How to Cite This Work
539
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540
  Welz, G. (2024–2025). *CopernicusAI: AI-Generated Audio Briefings as a Research Interface*.
541
  Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai
542
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
543
  ---
544
 
545
  ## πŸ“„ License & Attribution
 
65
  - Automatic citation extraction and formatting
66
  - Source validation and authenticity verification
67
 
68
+ ## πŸ”¬ Methodology & Quality Assurance
69
+
70
+ ### Multi-Source Validation Process
71
+ 1. **Source Discovery:** Parallel search across 8+ academic databases (PubMed, arXiv, NASA ADS, Zenodo, bioRxiv, CORE, Google Scholar, News API)
72
+ 2. **Quality Scoring:** Relevance ranking using citation counts, journal impact factors, recency, and peer-review status
73
+ 3. **Minimum Requirements:** At least 3 research sources required per episode for quality assurance
74
+ 4. **Citation Extraction:** Automated extraction with manual verification and formatting
75
+ 5. **Content Generation:** LLM synthesis (Google Gemini 3, GPT-4, Claude 3) with source attribution at each claim
76
+ 6. **Validation:** Manual review of sample episodes by domain experts (ongoing)
77
+
78
+ ### Paradigm Shift Detection
79
+ - **Citation Network Analysis:** Identifies highly cited recent papers that may represent paradigm shifts
80
+ - **Interdisciplinary Connection Analysis:** Detects connections across domains that may indicate emerging fields
81
+ - **Expert Review:** Validation of identified paradigm shifts against domain expert knowledge
82
+ - **Temporal Analysis:** Tracks citation patterns over time to identify emerging trends
83
+
84
+ ### Quality Metrics
85
+ - **Source Quality:** Average citation count, journal impact factors, peer-review status
86
+ - **Coverage:** Number of sources per topic, cross-database coverage, temporal distribution
87
+ - **Accuracy:** Manual validation of sample episodes by domain experts (ongoing process)
88
+ - **Reproducibility:** Full citation tracking enables verification of all claims
89
+ - **Transparency:** All source papers accessible via Research Tools Dashboard and database tables
90
+
91
  ### πŸ€– Advanced LLM Integration
92
 
93
  **Multi-Model Architecture:**
 
277
 
278
  ## πŸ“ˆ Platform Capabilities
279
 
280
+ ### Research Coverage (As of January 2025)
281
  - **250+ million research papers** accessible through integrated APIs
282
  - **8+ academic databases** integrated with parallel search
283
+ - **23,246+ papers indexed** with full metadata and vector embeddings (dynamically growing - see [Public Project Interface](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html) for current count)
284
  - **Minimum 3 sources** required per episode for quality assurance
285
  - **Multi-paper analysis** for comprehensive coverage
286
 
287
  ### Platform Features
288
  - **Subscriber-driven content generation** - Users prompt and create podcasts
289
+ - **64+ podcast episodes** generated across 5 disciplines (as of January 2025)
290
  - **RSS feed distribution** to major podcast platforms
291
  - **Public and private podcast options** - Share discoveries or keep them private
292
+ - **Knowledge Engine Dashboard:** Operational since December 2025
293
 
294
  ---
295
 
296
+ ## ⚠️ Limitations & Future Directions
297
+
298
+ ### Current Limitations
299
+ - **Discipline Coverage:** Currently strongest in mathematics (23,246+ papers indexed); expansion to other disciplines in progress (see Ramp-Up Plan)
300
+ - **Process Validation:** Flowcharts are LLM-generated and benefit from expert validation for domain-specific accuracy (validation process ongoing)
301
+ - **Source Linking:** Not all processes yet linked to specific research papers (work in progress per Quality Standards)
302
+ - **Scale:** Current process database (~313 processes) represents proof-of-concept; target is 1,000+ processes
303
+ - **Podcast Generation:** Requires manual review for accuracy; fully automated quality assurance in development
304
+ - **Video Production:** Advanced video features (Phase 2+) are planned but not yet implemented
305
+
306
+ ### Future Work
307
+ - **Expansion:** Scale to 200,000+ papers across all disciplines (see RAMP_UP_PLAN.md)
308
+ - **Validation:** Implement systematic peer review process for process flowcharts
309
+ - **Integration:** Enhanced cross-linking between processes, papers, and podcasts
310
+ - **Automation:** Automated source paper suggestion and linking using vector search
311
+ - **Quality Assurance:** Systematic validation framework for flowchart accuracy and podcast content
312
+ - **Video Features:** Implement advanced video production capabilities (Phase 2+)
313
+ - **Multi-modal Integration:** Enhanced integration of visual content, animations, and interactive elements
314
+
315
+ ### Known Areas for Improvement
316
+ - **Bias in Source Selection:** Current system may favor highly cited papers; working to balance with recent, emerging research
317
+ - **Domain Expertise:** Some domains better represented than others; actively expanding coverage
318
+ - **Validation Coverage:** Not all content yet validated by domain experts; systematic validation in progress
319
+
320
  ## πŸ”— Live Platform & Resources
321
 
322
  ### Production Deployment
 
587
 
588
  ## How to Cite This Work
589
 
590
+ ### BibTeX Format
591
+ ```bibtex
592
+ @article{welz2025copernicusai,
593
+ title={CopernicusAI: AI-Generated Audio Briefings as a Research Interface},
594
+ author={Welz, Gary},
595
+ journal={Nature Communications},
596
+ year={2025},
597
+ note={Submitted},
598
+ url={https://huggingface.co/spaces/garywelz/copernicusai},
599
+ note={Preprint available upon publication}
600
+ }
601
+ ```
602
+
603
+ ### Standard Citation Format
604
  Welz, G. (2024–2025). *CopernicusAI: AI-Generated Audio Briefings as a Research Interface*.
605
  Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai
606
 
607
+ **Note:** When published, this citation will be updated with DOI and publication details from Nature Communications.
608
+
609
+ ---
610
+
611
+ ## πŸ“Š Data Availability
612
+
613
+ **Research Data:**
614
+ - **Research Paper Metadata:** Research paper metadata is publicly accessible via the [Research Tools Dashboard](https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine) and [Research Paper Database Table](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/papers-database-table.html). Current statistics are dynamically updated at the [Public Project Interface](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html).
615
+ - **Podcast Episodes:** All generated podcast episodes, transcripts, and metadata are accessible via the [Podcast Database](https://www.copernicusai.fyi/episodes) and [RSS Feed](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/feeds/copernicus-mvp-rss-feed.xml).
616
+ - **Process Flowcharts:** Process flowcharts across all disciplines are publicly available in Google Cloud Storage:
617
+ - [Biology Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/biology-processes-database/biology-database-table.html)
618
+ - [Chemistry Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/chemistry-processes-database/chemistry-database-table.html)
619
+ - [Physics Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/physics-processes-database/physics-database-table.html)
620
+ - [Mathematics Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/mathematics-processes-database/mathematics-database-table.html)
621
+ - [Computer Science Processes Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/computer-science-processes-database/computer-science-database-table.html)
622
+ - [GLMP Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/glmp-database-table.html)
623
+ - **Science Videos:** Video database accessible via [Science Video Database](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/videos-database-table.html) and [Live Demo](https://scienceviddb-web-204731194849.us-central1.run.app/).
624
+
625
+ **Source Code & Methodology:**
626
+ - **Methodology:** Fully documented in the Programming Framework paper (see [Programming Framework Space](https://huggingface.co/spaces/garywelz/programming_framework) for methodology details).
627
+ - **Process Generation:** LLM-powered extraction using Google Gemini 2.0 Flash, documented in Programming Framework.
628
+ - **Database Schemas:** Documented in project documentation files (SCHEMA_EXTENSIBILITY_GUIDE.md, UNIFIED_METADATA_SCHEMA_MASTER.md).
629
+ - **API Documentation:** RESTful API endpoints documented in the API Documentation section above.
630
+
631
+ **Access:**
632
+ - **Public Access:** All process databases, database tables, public interfaces, and podcast episodes are publicly accessible (no authentication required).
633
+ - **Research Tools Dashboard:** [https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine](https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine) - Interactive knowledge graph, vector search, and RAG queries (public access).
634
+ - **Public Project Interface:** [https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html](https://storage.googleapis.com/regal-scholar-453620-r7-podcast-storage/copernicusai-public-reviewer.html) - Comprehensive access to all public components with dynamically updated statistics.
635
+ - **API Endpoints:** [https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app](https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app) - RESTful API with full documentation (see API Documentation section above).
636
+
637
+ **Reproducibility:**
638
+ - All process flowcharts include source citations linking to research papers.
639
+ - Podcast generation methodology is fully documented and reproducible.
640
+ - Database structures are standardized and documented.
641
+ - Research synthesis workflow is transparent and can be replicated.
642
+ - All components are publicly accessible for verification and reuse.
643
+
644
  ---
645
 
646
  ## πŸ“„ License & Attribution