πŸ”¬

CopernicusAI

Knowledge Engine for Scientific Discovery

A collaborative research platform that transforms cutting-edge scientific research into accessible, multi-format tools for collective knowledge exploration. These are research instrumentsβ€”like microscopes for observing the collective knowledge of humanityβ€”enabling hypothesis formation, testing, and discovery across scientific disciplines.

πŸ“‹ Summary

CopernicusAI is an operational research platform that synthesizes scientific literature from 250+ million papers into AI-generated podcasts, integrates with a knowledge graph of 12,000+ indexed papers, and provides collaborative tools for research discovery. The system demonstrates production-ready multi-source research synthesis with full citation tracking and evidence-based content generation requiring minimum 3 research sources per episode.

The platform includes a fully operational Knowledge Engine Dashboard (deployed December 2025) with interactive knowledge graph visualization, vector search, and RAG capabilities, enabling researchers to explore, query, and synthesize scientific knowledge across disciplines.

Prior Work & Current Status

Prior Work (2024-2025)

CopernicusAI is an active research prototype exploring AI-generated audio briefings as an interface for assisted scientific research.

The system allows any user to generate, refine, and share AI-generated science podcasts based on structured prompts, enabling rapid orientation to a topic, iterative deepening, and personalized research briefings.

Rather than functioning as a static content platform, CopernicusAI supports collectively generated and shared research artifacts, analogous to community-driven knowledge platforms (e.g., discussion forums), but grounded in scientific sources and metadata-aware workflows.

This work demonstrates technical feasibility for:

  • β€’ AI-assisted research briefing and orientation
  • β€’ Iterative question refinement via conversational interfaces
  • β€’ Integration of text, audio, and metadata in research workflows

Current Implementation (December 2025)

The Knowledge Engine Dashboard is fully operational and deployed to Google Cloud Run, providing unified access to all components with interactive knowledge graph visualization, vector search, RAG queries, and content browsing.

See the "Knowledge Engine Ecosystem" section below for details.

🎯 Mission & Vision

Inspired by Nicolaus Copernicus who challenged accepted knowledge with evidence and rigorous analysis, CopernicusAI creates collaborative research tools that enable collective participation in scientific discovery. These platforms are instruments for exploring humanity's collective knowledgeβ€”tools for hypothesis formation, testing, and collaborative research, not just educational content.

Just as a microscope enables observation of the microscopic world, CopernicusAI tools enable observation and exploration of humanity's collective knowledge. Subscribers collaborate to prompt, generate, and refine research contentβ€”sharing discoveries publicly or keeping them private. As large language models (LLMs) and AI systems gain unprecedented knowledge, CopernicusAI provides the infrastructure for human-AI collaborative knowledge exploration, with evidence-based truth-seeking as our guiding principle.

🧩 CopernicusAI Knowledge Engine

An integrated ecosystem of research and collaboration tools designed to assist scientists in their workflow, from research discovery through knowledge synthesis to multi-format content generation.

βœ…

Knowledge Engine Dashboard - Fully Operational (December 2025)

The Knowledge Engine is now fully implemented and deployed with a working web dashboard providing unified access to all components.

Key Features:

  • βœ“ Interactive Knowledge Graph (12,000+ papers)
  • βœ“ Vector Search (semantic similarity)
  • βœ“ RAG System (with citations)
  • βœ“ Content Browsing (papers, podcasts, processes)
  • βœ“ Statistics Dashboard

Live System:

https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine (opens in new tab)

Fully deployed to Google Cloud Run, accessible 24/7

πŸ”¬

CopernicusAI

Core synthesis & distribution platform for AI-powered research and podcast generation

This platform

πŸ—ΊοΈ

Knowledge Engine Dashboard

βœ… Fully operational web interface with knowledge graph, vector search, RAG queries, and content browsing

Live System β†’
πŸ› οΈ

Programming Framework

Foundational meta-tool for universal process analysis across disciplines

Explore β†’
🧬

GLMP

Biological process visualization - 50+ processes mapped

Explore β†’
πŸ“š

Metadata Database

Core data infrastructure for research paper metadata and citation networks

Explore β†’
🎬

Video Database

Multi-modal content with transcript-based search for scientific videos

Explore β†’
βž•

Future Components

Additional tools, databases, and collaboration features will be added as the project develops

250+
Million Papers
Accessible via APIs (As of January 2025)
12,000+
Indexed Papers
In Knowledge Engine - Mathematics (As of December 2025)
64+
Podcast Episodes
Generated across 5 disciplines (As of January 2025)
8+
Academic Databases
Integrated research sources (As of January 2025)

🌟 Core Platform Capabilities

πŸŽ™οΈ

AI-Powered Podcast Generation

Collaborative research platform where subscribers prompt and generate multi-voice AI podcasts (5-10 minutes) synthesizing research from multiple academic sources. Subscribers can share their podcasts publicly or keep them private. Evidence-based content generation requiring minimum 3 research sources per episode.

Key Features:

  • βœ“ Comprehensive research integration (8+ databases)
  • βœ“ Professional multi-speaker dialogue
  • βœ“ AI-generated scientific visualizations
  • βœ“ RSS feed distribution
  • βœ“ Quality scoring & relevance ranking
  • βœ“ Paradigm shift identification

Research Integration:

  • βœ“ Real-time discovery from 8+ APIs
  • βœ“ Parallel search across databases
  • βœ“ Automatic citation extraction
  • βœ“ Source validation & verification
  • βœ“ Interdisciplinary connection analysis
πŸ€–

Advanced LLM Integration

Multi-model architecture with intelligent model selection:

Primary Models:

  • β€’ Google Gemini 3 - Latest research analysis and content generation
  • β€’ OpenAI GPT-4/GPT-3.5 - Content synthesis and quality validation
  • β€’ Anthropic Claude 3 (Sonnet, Haiku) - Alternative reasoning paths
  • β€’ ElevenLabs TTS - Multi-voice text-to-speech synthesis

Capabilities:

  • β€’ Multi-paper analysis & synthesis
  • β€’ Paradigm shift detection
  • β€’ Entity extraction (genes, proteins, compounds)
  • β€’ Citation tracking & cross-references
  • β€’ Content quality scoring
πŸ“Š

Research Resource Access

Comprehensive academic database coverage with 250+ million research papers accessible through integrated APIs.

Academic Databases:

  • β€’ PubMed/NCBI (~30+ million papers)
  • β€’ arXiv (~2+ million preprints)
  • β€’ NASA ADS (~15+ million papers)
  • β€’ Zenodo (100K+ datasets)
  • β€’ bioRxiv/medRxiv (preprints)
  • β€’ CORE (~200+ million papers)
  • β€’ Google Scholar (comprehensive)
  • β€’ News API (current events)
  • β€’ YouTube Data API (academic videos)
πŸŽ™οΈ

Audio and Video Podcast Production

Operating Audio Podcast System: Full production and distribution platform for subscriber-generated podcasts. Users can prompt, generate, publish, and distribute audio podcasts with RSS feed support for Spotify, Apple Podcasts, and Google Podcasts.

Current Audio Capabilities (Operational):

  • βœ“ Multi-voice AI podcast generation
  • βœ“ Research-driven content creation
  • βœ“ RSS feed distribution
  • βœ“ Public and private podcast options
  • βœ“ Professional audio quality

Video Production (Future - Phase 2+):

Advanced video features planned for future development:

  • β€’ Visual Content Integration: Automated extraction from papers, web scraping, JSON database integration
  • β€’ Dynamic Visualizations: Scientific animations, real-time charts, LaTeX rendering
  • β€’ External Video Quoting: YouTube segment extraction with attribution & fair use compliance
  • β€’ Advanced Composition: Multi-layer video, auto subtitles, text overlays, professional transitions

See: Science Video Database - Companion project for research video content management.

πŸ“š

Research Papers Metadata Database (Phase 2)

A centralized metadata repository (not a file archive) providing structured JSON objects with AI-powered preprocessing.

Structured JSON Objects:

  • β€’ DOI, arXiv ID, publication info
  • β€’ Abstracts & key findings
  • β€’ Extracted entities (genes, proteins, compounds, equations)
  • β€’ Citation networks & cross-references
  • β€’ Paradigm shift indicators
  • β€’ Quality scores & relevance metrics

AI-Powered Preprocessing:

  • β€’ LLM-based entity extraction
  • β€’ Automatic categorization
  • β€’ Keyword extraction & semantic tagging
  • β€’ Citation tracking & mapping
  • β€’ Quality assessment
  • β€’ RESTful API access

πŸ”¬ Methodology & System Design

Multi-Source Validation Process

The system requires a minimum of 3 research sources per podcast episode. Each source is:

  • β€’ Retrieved from authoritative academic databases (PubMed, arXiv, NASA ADS, etc.)
  • β€’ Validated for authenticity and publication status
  • β€’ Scored for quality and relevance to the research topic
  • β€’ Cross-referenced to verify consistency and eliminate conflicting information
  • β€’ Processed through parallel API queries for comprehensive coverage

Quality Assurance Mechanisms

  • β€’ Source Verification: Automated checking of DOI, arXiv IDs, and publication metadata
  • β€’ Relevance Scoring: LLM-based assessment of paper relevance to query
  • β€’ Paradigm Shift Detection: Identification of revolutionary vs. incremental research
  • β€’ Citation Extraction: Automatic extraction and formatting of citations
  • β€’ Content Validation: Multi-model verification (Gemini, GPT-4, Claude) for accuracy

Citation Extraction & Verification

The system automatically extracts and formats citations from research papers:

  • β€’ DOI resolution and metadata enrichment
  • β€’ arXiv ID parsing and preprint identification
  • β€’ Author, title, and publication information extraction
  • β€’ Cross-reference linking between related papers
  • β€’ Citation network analysis for relationship mapping

Paradigm Shift Detection Implementation

The system uses LLM analysis to identify paradigm-shifting research by:

  • β€’ Analyzing citation patterns and impact metrics
  • β€’ Detecting novel methodologies or breakthrough discoveries
  • β€’ Comparing against established knowledge frameworks
  • β€’ Identifying interdisciplinary connections and cross-domain insights
  • β€’ Flagging research that challenges existing paradigms

βš™οΈ Technology Stack

AI & Machine Learning

  • β€’ Google Gemini 3
  • β€’ Google Vertex AI (model orchestration)
  • β€’ OpenAI GPT-4/GPT-3.5
  • β€’ Anthropic Claude 3
  • β€’ ElevenLabs TTS
  • β€’ DALL-E 3
  • β€’ Cloud Vision API
  • β€’ Video Intelligence API

Backend Infrastructure

  • β€’ FastAPI (Python)
  • β€’ Google Cloud Run
  • β€’ Firestore (NoSQL)
  • β€’ Cloud Storage
  • β€’ Cloud Functions
  • β€’ Cloud Tasks
  • β€’ Secret Manager

Frontend

  • β€’ Next.js 15.5.7
  • β€’ Alpine.js
  • β€’ Tailwind CSS
  • β€’ Vercel

πŸ” Limitations & Future Directions

Current Limitations

  • β€’ Discipline Coverage: Mathematics currently has the most complete indexing (12,000+ papers); other disciplines are being expanded
  • β€’ Source Bias: Coverage depends on database API availability and open access policies
  • β€’ LLM Accuracy: Content generation relies on LLM accuracy; multi-source validation mitigates but doesn't eliminate errors
  • β€’ Real-Time Updates: Knowledge graph updates require manual or scheduled processing cycles
  • β€’ Language: Currently optimized for English-language research papers

Future Development

  • β€’ Multi-Discipline Expansion: Expanding knowledge graph to Biology, Chemistry, Physics, Computer Science
  • β€’ Process Databases: Creating comprehensive flowchart databases for all 5 disciplines (~50 processes each)
  • β€’ Advanced Video Features: Dynamic visualizations, animations, and multi-layer composition
  • β€’ Multi-Language Support: Extending to non-English research papers
  • β€’ Enhanced Validation: Peer review mechanisms and user feedback integration
  • β€’ Real-Time Updates: Automated continuous knowledge graph updates

πŸ”¬ Collaborative Research Tools

Collaborative Research Tools

These platforms enable collective participation and collaboration across diverse user communities:

  • β€’ Researchers - Tools for hypothesis formation and testing, cross-disciplinary synthesis
  • β€’ Collaborators - Collective knowledge exploration and refinement
  • β€’ Subscribers - Prompt, generate, and share podcasts (public or private)
  • β€’ Community - User suggestions, comments, and collaborative flowchart improvement (GLMP)

Like a microscope enables observation of the microscopic world, these tools enable observation and exploration of humanity's collective knowledge.

Key Innovations

  • β€’ Multi-source validation (min 3 sources)
  • β€’ Evidence-based generation
  • β€’ Paradigm shift detection
  • β€’ Interdisciplinary connections
  • β€’ Multiple expertise levels
  • β€’ Full citation tracking

πŸ“š Prior Work & Research Contributions

Overview

This platform represents prior work that demonstrates foundational research and development achievements in AI-powered scientific knowledge synthesis, collaborative research tools, and multi-modal content generation. These contributions establish the technical foundation and proof-of-concept for the broader CopernicusAI Knowledge Engine initiative.

πŸ”¬ Research Contributions

  • β€’ AI-Powered Research Synthesis: Production system for multi-source research synthesis using LLMs
  • β€’ Multi-Model Architecture: Intelligent model selection with Gemini 3, GPT-4, Claude 3
  • β€’ Collaborative Platform: Subscriber-driven content generation with public/private sharing
  • β€’ Knowledge Engine Integration: Architecture for Research Papers DB, Video DB, GLMP, Framework

βš™οΈ Technical Achievements

  • β€’ 250+ Million Papers: Accessible via 8+ integrated academic databases
  • β€’ 64+ Episodes: Generated across 5 scientific disciplines
  • β€’ Production Deployment: Live platform with operational API and RSS distribution
  • β€’ Scalable Architecture: Serverless microservices on Google Cloud

🎯 Position Within CopernicusAI Knowledge Engine

This platform serves as the core synthesis and distribution component of the CopernicusAI Knowledge Engine. The Knowledge Engine is an integrated ecosystem of research and collaboration tools that work together to assist scientists in their workflow, from research discovery through knowledge synthesis to multi-format content generation.

Current Components:

  • 1. CopernicusAI (This platform) - Core synthesis & distribution
  • 2. Programming Framework - Foundational meta-tool
  • 3. GLMP - Biological process visualization
  • 4. Research Paper Metadata Database - Data infrastructure
  • 5. Science Video Database - Multi-modal content

Future Development:

The Knowledge Engine is designed to grow and evolve. Additional tools, databases, and collaboration components will be added as the project develops, expanding capabilities for AI-assisted scientific research and knowledge discovery.

πŸ“– Citation Information

For Grant Proposals (NSF/DOE):

Welz, G. (2025). CopernicusAI: Knowledge Engine for Scientific Discovery.

Hugging Face Space. https://huggingface.co/spaces/garywelz/copernicusai

Live Platform: https://www.copernicusai.fyi

BibTeX Format:

@misc{welz2025copernicusai,
  title={CopernicusAI: Knowledge Engine for Scientific Discovery},
  author={Welz, Gary},
  year={2025},
  url={https://huggingface.co/spaces/garywelz/copernicusai},
  note={Hugging Face Space, Live Platform: https://www.copernicusai.fyi}
}

πŸ“Š Data Availability Statement

Platform Access

Data & Code Availability

  • β€’ Hugging Face Spaces: All components accessible at https://huggingface.co/garywelz (opens in new tab)
  • β€’ Process Flowcharts (GLMP): JSON files stored in Google Cloud Storage, accessible via GLMP Database Table (opens in new tab)
  • β€’ Research Paper Metadata: 12,000+ indexed papers with metadata accessible through Knowledge Engine Dashboard
  • β€’ API Documentation: RESTful API endpoints available for programmatic access (see API Documentation section)

Reproducibility Information

  • β€’ Technology Stack: All technologies and versions documented in Technology Stack section
  • β€’ LLM Models: Google Gemini 3, OpenAI GPT-4/GPT-3.5, Anthropic Claude 3 (versions specified in documentation)
  • β€’ Source Citations: All podcast episodes include full citations to source papers
  • β€’ Metadata: Complete metadata for all generated content available through API
  • β€’ License: MIT License - see license information in space metadata

How to Cite This Work

Welz, G. (2024–2025). CopernicusAI: AI-Generated Audio Briefings as a Research Interface.
Hugging Face Spaces. https://huggingface.co/spaces/garywelz/copernicusai

BibTeX Format:

@misc{welz2025copernicusai,
  title={CopernicusAI: AI-Generated Audio Briefings as a Research Interface},
  author={Welz, Gary},
  year={2024--2025},
  url={https://huggingface.co/spaces/garywelz/copernicusai},
  note={Hugging Face Space}
}

🌐 Grant Support & Collaboration

Grant Applications Supported

This platform is designed to support grant applications to:

NSF

National Science Foundation - Science education and research infrastructure

DOE

Department of Energy - Scientific computing and data science

SAIR Foundation

AI research and development initiatives

Collaboration Opportunities

  • β€’ Integration with academic institutions
  • β€’ Partnership with research organizations
  • β€’ Open data initiatives
  • β€’ Educational program development

πŸ”— Live Platform & Resources

🧩 Knowledge Engine Components

The CopernicusAI Knowledge Engine is an integrated ecosystem of research and collaboration tools. The Knowledge Engine Dashboard is now fully operational (December 2025) with a working web interface providing unified access to all components.

βœ… Knowledge Engine Dashboard (Implemented)

Fully operational web interface with knowledge graph visualization (12,000+ papers), vector search, RAG queries, and content browsing.

Live System: https://copernicus-frontend-phzp4ie2sq-uc.a.run.app/knowledge-engine β†’ (opens in new tab)

πŸ”Œ API Documentation

Base URL: https://copernicus-podcast-api-phzp4ie2sq-uc.a.run.app

Podcast Generation

  • POST /generate-podcast-with-subscriber
  • GET /api/subscribers/podcasts/{id}
  • POST /api/subscribers/podcasts/submit-to-rss

Research Endpoints

  • POST /api/papers/upload
  • GET /api/papers/{paper_id}
  • POST /api/papers/query
  • POST /api/papers/{id}/link-podcast/{id}

Admin Endpoints

  • GET /api/admin/subscribers
  • POST /api/admin/podcasts/fix-missing-titles
  • GET /api/admin/podcasts/catalog