Spaces:
Sleeping
Sleeping
| license: mit | |
| title: Generate Knowledge Graphs | |
| sdk: streamlit | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: pink | |
| short_description: Use LLM to generate a knowledge graph from your input data. | |
| # πΈοΈ Knowledge Graph Extraction App | |
| A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions. | |
| ## π Features | |
| - **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB | |
| - **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B) | |
| - **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects | |
| - **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0 | |
| - **Interactive Visualization**: Multiple graph layout algorithms with filtering options | |
| - **Batch Processing**: Optional processing of multiple documents together | |
| - **Export Capabilities**: JSON, GraphML, and GEXF formats | |
| - **Real-time Statistics**: Graph metrics and centrality analysis | |
| ## π Project Structure | |
| ``` | |
| knowledge-graphs/ | |
| βββ app.py # Main Gradio application (legacy) | |
| βββ app_streamlit.py # Main Streamlit application (recommended) | |
| βββ run_streamlit.py # Simple launcher script | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Project documentation | |
| βββ .env.example # Environment variables template | |
| βββ config/ | |
| β βββ settings.py # Configuration management | |
| βββ src/ | |
| βββ document_processor.py # Document loading and chunking | |
| βββ llm_extractor.py # LLM-based entity extraction | |
| βββ graph_builder.py # NetworkX graph construction | |
| βββ visualizer.py # Graph visualization and export | |
| ``` | |
| ## π§ Installation & Setup | |
| ### Option 1: Streamlit Version (Recommended) | |
| The Streamlit version is more stable and has better file handling. | |
| **Quick Start:** | |
| ```bash | |
| python run_streamlit.py | |
| ``` | |
| **Manual Setup:** | |
| 1. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Run the Streamlit app**: | |
| ```bash | |
| streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501 | |
| ``` | |
| The app will be available at `http://localhost:8501` | |
| ### Option 2: Gradio Version (Legacy) | |
| The Gradio version may have some file caching issues but is provided for compatibility. | |
| 1. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Set up environment variables** (optional): | |
| ```bash | |
| cp .env.example .env | |
| # Edit .env and add your OpenRouter API key | |
| ``` | |
| 3. **Run the application**: | |
| ```bash | |
| python app.py | |
| ``` | |
| The app will be available at `http://localhost:7860` | |
| ### HuggingFace Spaces Deployment | |
| For **Streamlit deployment**: | |
| 1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces) | |
| 2. Choose "Streamlit" as the SDK | |
| 3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name) | |
| 4. Upload all other project files maintaining directory structure | |
| For **Gradio deployment**: | |
| 1. Create a new Space with "Gradio" as the SDK | |
| 2. Upload `app.py` and all other files | |
| 3. Note: May experience file handling issues | |
| ## π API Configuration | |
| ### Getting OpenRouter API Key | |
| 1. Visit [OpenRouter.ai](https://openrouter.ai) | |
| 2. Sign up for a free account | |
| 3. Navigate to API Keys section | |
| 4. Generate a new API key | |
| 5. Copy the key and use it in the application | |
| ### Free Models Used | |
| - **Primary**: `google/gemma-2-9b-it:free` | |
| - **Backup**: `meta-llama/llama-3.1-8b-instruct:free` | |
| These models are specifically chosen to minimize API costs while maintaining quality. | |
| ## π Usage Guide | |
| ### Basic Workflow | |
| 1. **Upload Documents**: | |
| - Select one or more files (PDF, TXT, DOCX, JSON) | |
| - Toggle batch mode for multiple document processing | |
| 2. **Configure API**: | |
| - Enter your OpenRouter API key | |
| - Key is stored temporarily for the session | |
| 3. **Customize Settings**: | |
| - Choose graph layout algorithm | |
| - Toggle label visibility options | |
| - Set minimum importance threshold | |
| - Select entity types to include | |
| 4. **Extract Knowledge Graph**: | |
| - Click "Extract Knowledge Graph" button | |
| - Monitor progress through the status updates | |
| - View results in multiple tabs | |
| 5. **Explore Results**: | |
| - **Graph Visualization**: Interactive graph with colored nodes by entity type | |
| - **Statistics**: Detailed metrics about the graph structure | |
| - **Entities**: Complete list of extracted entities with details | |
| - **Central Nodes**: Most important entities based on centrality measures | |
| 6. **Export Data**: | |
| - Choose export format (JSON, GraphML, GEXF) | |
| - Download structured graph data | |
| ### Advanced Features | |
| #### Entity Types | |
| - **PERSON**: Individuals mentioned in the text | |
| - **ORGANIZATION**: Companies, institutions, groups | |
| - **LOCATION**: Places, addresses, geographical entities | |
| - **CONCEPT**: Abstract ideas, theories, methodologies | |
| - **EVENT**: Specific occurrences, meetings, incidents | |
| - **OBJECT**: Physical items, products, artifacts | |
| #### Relationship Types | |
| - **works_at**: Employment relationships | |
| - **located_in**: Geographical associations | |
| - **part_of**: Hierarchical relationships | |
| - **causes**: Causal relationships | |
| - **related_to**: General associations | |
| #### Filtering Options | |
| - **Importance Threshold**: Show only entities above specified importance score | |
| - **Entity Types**: Filter by specific entity categories | |
| - **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random | |
| ## π οΈ Technical Details | |
| ### Architecture Components | |
| 1. **Document Processing**: | |
| - Multi-format file parsing | |
| - Intelligent text chunking with overlap | |
| - File size validation | |
| 2. **LLM Integration**: | |
| - OpenRouter API integration | |
| - Structured prompt engineering | |
| - Error handling and fallback models | |
| 3. **Graph Processing**: | |
| - NetworkX-based graph construction | |
| - Entity deduplication and standardization | |
| - Relationship validation | |
| 4. **Visualization**: | |
| - Matplotlib-based static graphs | |
| - Interactive HTML visualizations | |
| - Multiple export formats | |
| ### Configuration Options | |
| All settings can be modified in `config/settings.py`: | |
| - **Chunk Size**: Default 2000 characters | |
| - **Chunk Overlap**: Default 200 characters | |
| - **Max File Size**: Default 10MB | |
| - **Max Entities**: Default 100 per extraction | |
| - **Max Relationships**: Default 200 per extraction | |
| - **Importance Threshold**: Default 0.3 | |
| ### Differences Between Versions | |
| **Streamlit Version Advantages:** | |
| - More reliable file handling | |
| - Better progress indicators | |
| - Cleaner UI with sidebar configuration | |
| - More stable caching system | |
| - Built-in download functionality | |
| **Gradio Version Advantages:** | |
| - Simpler deployment to HF Spaces | |
| - More compact interface | |
| - Familiar for ML practitioners | |
| ## π Security & Privacy | |
| - API keys are not stored permanently | |
| - Files are processed temporarily and discarded | |
| - No data is retained between sessions | |
| - All processing happens server-side | |
| ## π Troubleshooting | |
| ### Common Issues | |
| 1. **"OpenRouter API key is required"**: | |
| - Ensure you've entered a valid API key | |
| - Check the key has sufficient credits | |
| 2. **"No entities extracted"**: | |
| - Document may be too short or unstructured | |
| - Try lowering the importance threshold | |
| - Check if the document contains meaningful text | |
| 3. **File upload issues (Gradio version)**: | |
| - Known issue with Gradio's file caching system | |
| - Try the Streamlit version instead | |
| - Ensure files are valid and not corrupted | |
| 4. **Segmentation fault (local development)**: | |
| - Usually related to matplotlib backend | |
| - Try setting `MPLBACKEND=Agg` environment variable | |
| - Install GUI toolkit if running locally with display | |
| 5. **Module import errors**: | |
| - Ensure all requirements are installed: `pip install -r requirements.txt` | |
| - Check Python version compatibility (3.8+) | |
| ### Performance Tips | |
| - Use batch mode for related documents | |
| - Adjust chunk size for very long documents | |
| - Lower importance threshold for sparse documents | |
| - Use simpler layout algorithms for large graphs | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Test with both Streamlit and Gradio versions if applicable | |
| 5. Add tests if applicable | |
| 6. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## π Acknowledgments | |
| - [OpenRouter](https://openrouter.ai) for LLM API access | |
| - [Streamlit](https://streamlit.io) for the modern web interface framework | |
| - [Gradio](https://gradio.app) for the ML-focused web interface | |
| - [NetworkX](https://networkx.org) for graph processing | |
| - [HuggingFace Spaces](https://huggingface.co/spaces) for hosting |