Spaces:
Runtime error
Runtime error
v0.2.8.5: Baseline 02 - upload markdown files/folder, accumulate_dir - sidebar (settings) - visualise KG - reset working folder files - updated README
Browse files- README.md +62 -192
- app.py +77 -37
- app_gradio_lightrag.py +72 -107
- utils/file_utils.py +37 -1
- utils/llm_login.py +1 -1
README.md
CHANGED
|
@@ -34,220 +34,90 @@ requires-python: ">=3.12"
|
|
| 34 |
#---
|
| 35 |
---
|
| 36 |
|
| 37 |
-
#
|
| 38 |
|
| 39 |
-
A
|
| 40 |
|
| 41 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
-
The toolkit enables intelligent document processing, semantic search, and interactive knowledge graph visualisation with support for multiple LLM backends. It supports OpenAI and Ollama LLM backends.
|
| 45 |
|
| 46 |
-
##
|
| 47 |
-
|
| 48 |
-
### 🔍 Intelligent Document processing and RAG Capabilities
|
| 49 |
-
- **Dual-level KG-RAG**: Combines traditional RAG with knowledge graph reasoning (powered by LightRAG)
|
| 50 |
-
- **Multi-modal LLM Support**: OpenAI, Ollama, and Google GenAI backends. Full GenAI support coming soon.
|
| 51 |
-
- **Semantic Search**: Vector-based document retrieval with embedding models (powered by LightRAG)
|
| 52 |
-
- **Multi-format Support**: Markdown ingestion with ParserPDF ([GitHub][3] | [HF Space][4]) integration for PDF, Word, and HTML conversion. Full integration coming soon.
|
| 53 |
-
- **Markdown Ingestion**: Process and index markdown files from specified directories
|
| 54 |
-
- **Knowledge Graph Construction**: Automatically builds entity-relationship graphs after indexing
|
| 55 |
-
- **Interactive Visualisation**: Real-time KG exploration
|
| 56 |
-
|
| 57 |
-
### ️ Technical Excellence
|
| 58 |
-
- **Modular Architecture**: Clean, maintainable code structure
|
| 59 |
-
- **Async Operations**: Efficient handling of large document collections
|
| 60 |
-
- **Robust Error Handling**: Comprehensive logging and exception management
|
| 61 |
-
|
| 62 |
-
## ️ Installation & Setup
|
| 63 |
-
|
| 64 |
-
### Method 1: Using UV (Recommended)
|
| 65 |
```bash
|
| 66 |
git clone https://github.com/semmyk-research/semmyKG
|
| 67 |
cd semmyKG
|
| 68 |
|
| 69 |
-
#
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
# .venv\Scripts\activate on Windows
|
| 73 |
|
| 74 |
-
|
| 75 |
-
uv pip sync
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
### Method 2: Traditional Python Setup
|
| 79 |
-
```bash
|
| 80 |
-
git clone https://github.com/semmyk-research/semmyKG
|
| 81 |
-
cd semmyKG
|
| 82 |
-
|
| 83 |
-
# Create virtual environment
|
| 84 |
python -m venv .venv
|
| 85 |
-
source .venv/bin/activate #
|
| 86 |
-
# .venv\Scripts\activate on Windows
|
| 87 |
-
|
| 88 |
-
# Install dependencies
|
| 89 |
pip install -r requirements.txt
|
| 90 |
```
|
| 91 |
|
| 92 |
-
##
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
Copy `.env.example` to `.env` and configure your settings:
|
| 96 |
-
|
| 97 |
-
```env
|
| 98 |
-
# API Configuration
|
| 99 |
OPENAI_API_KEY=your-openai-api-key
|
| 100 |
-
|
| 101 |
-
#
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
#
|
| 107 |
-
|
| 108 |
-
# Embedding Configuration
|
| 109 |
-
OPENAI_API_EMBED_BASE=your-embedding-provider-endpoint
|
| 110 |
-
# Note: For local embedding services, do not include /embedding in URL
|
| 111 |
-
LLM_MODEL_EMBED=your-embedding-model
|
| 112 |
-
|
| 113 |
-
# Ollama/Local hosting Configuration
|
| 114 |
OLLAMA_HOST=http://localhost:11434
|
| 115 |
-
OLLAMA_API_KEY=
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
# If .env is not set, you can enter credentials directly in the web interface
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
## Quick Start
|
| 123 |
-
|
| 124 |
-
### 1. Initialise the Application
|
| 125 |
-
```bash
|
| 126 |
-
python app.py
|
| 127 |
-
```
|
| 128 |
-
|
| 129 |
-
### 2. Web Interface Workflow
|
| 130 |
-
1. **Select Data Folder**: Choose your markdown documents directory (default: `dataset/data/docs`)
|
| 131 |
-
2. **Configure Settings**:
|
| 132 |
-
- **Choose LLM Backend**: Select between OpenAI, Ollama, or GenAI
|
| 133 |
-
- Select or input other configuration in the Settings pane,
|
| 134 |
-
3. **Activate**: Activate the lightRAG constructor
|
| 135 |
-
4. **Process Documents**: Click 'Index Documents' to process your files
|
| 136 |
-
5. **Query the System**: Enter your questions and select query mode
|
| 137 |
-
6. **Visualise Results**: Click 'Show Knowledge Graph' to finalise building Knowledge Graph and for interactive exploration
|
| 138 |
-
|
| 139 |
-
## 📁 Project Structure
|
| 140 |
-
|
| 141 |
-
```
|
| 142 |
-
semmyKG/
|
| 143 |
-
├── app_gradio_lightrag.py # Central Gradio coordinating processing
|
| 144 |
-
├── app.py # Main Gradio app entry point
|
| 145 |
-
├── requirements.txt # Project dependencies
|
| 146 |
-
├── .env.example # Environment template
|
| 147 |
-
├── dataset/
|
| 148 |
-
│ └── data/
|
| 149 |
-
│ └── docs/ # Default document directory
|
| 150 |
-
├── utils/
|
| 151 |
-
│ ├── utils.py # Utility functions
|
| 152 |
-
│ ├── file_utils.py # File operations
|
| 153 |
-
│ ├── logger.py # Logging configuration
|
| 154 |
-
└── logs/ # Application logs
|
| 155 |
-
```
|
| 156 |
|
| 157 |
-
##
|
| 158 |
-
|
| 159 |
-
### Local Development
|
| 160 |
```bash
|
| 161 |
-
python
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
### HuggingFace Spaces
|
| 165 |
-
- **Requirements**: Ensure all dependencies in `requirements.txt`
|
| 166 |
-
- **Environment**: Configure via web UI or Space secrets
|
| 167 |
-
|
| 168 |
-
### Google Colab
|
| 169 |
-
- **Quick Setup**: Install requirements and configure tokens in 'Secret'
|
| 170 |
-
- **Run**: Copy to `Files`, following folder structure and run app cells as approriate
|
| 171 |
-
|
| 172 |
-
### 📋 System Requirements
|
| 173 |
|
| 174 |
-
- **Python**: 3.12+
|
| 175 |
-
- **Memory**: 8GB+ vRAM recommended for large document sets
|
| 176 |
-
- **Storage**: Sufficient space for document collections and vector databases
|
| 177 |
-
|
| 178 |
-
### 🔌 Supported LLM Backends
|
| 179 |
-
|
| 180 |
-
#### OpenAI Compatible and Google GenAI
|
| 181 |
-
- **Models**: Frontline providers (Openai, Deepseek ...) and custom models
|
| 182 |
-
- **Gemini Models**: Access to Google's latest AI models
|
| 183 |
-
- **Endpoints**: Local inference servers (LMStudio, Jan.ai, ollama ...)
|
| 184 |
-
- **Embedding Models**: Multiple sentence transformer models and inference providers
|
| 185 |
-
|
| 186 |
-
#### Ollama Integration
|
| 187 |
-
- **Local Models**: Access to Ollama's model ecosystem
|
| 188 |
-
- **Self-hosted**: Complete data privacy and control
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
### Document Ingestion
|
| 192 |
-
- **Format Support**: Markdown files only (use ParserPDF for other formats)
|
| 193 |
```python
|
| 194 |
-
#
|
| 195 |
-
|
| 196 |
```
|
| 197 |
|
| 198 |
-
###
|
| 199 |
-
-
|
| 200 |
-
-
|
| 201 |
-
|
| 202 |
-
##
|
| 203 |
-
-
|
| 204 |
-
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
-
|
| 208 |
-
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
-
|
| 213 |
-
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
- **Module Import Errors**: Ensure all dependencies are installed
|
| 220 |
-
- **API Connection Issues**: Verify endpoint URLs and API keys
|
| 221 |
-
- **Memory Management**: Monitor resource usage during large-scale indexing
|
| 222 |
-
|
| 223 |
-
### Notes
|
| 224 |
-
- All user-facing text are in UK English
|
| 225 |
-
- For advanced configuration, see LightRAG documemntation
|
| 226 |
-
Pending full integration, use our ParserPDF tool ([GitHub][3] | [HF Space][4]) to generate markdown from documents (PDF, Word, html)
|
| 227 |
-
|
| 228 |
-
## 🤝 Contributing
|
| 229 |
-
|
| 230 |
-
We welcome contributions! Please see our contributing guidelines for more information.
|
| 231 |
-
|
| 232 |
-
## 🛣️ Roadmap (no defined timeline)
|
| 233 |
-
- Integrate Huggingface log in (in progress)
|
| 234 |
- [ParserPDF][3] integration
|
| 235 |
-
- Pre and post processing document viewer
|
| 236 |
-
- Modal platform support
|
| 237 |
-
- Conected UX refactoring
|
| 238 |
-
|
| 239 |
-
## 📄 License
|
| 240 |
-
|
| 241 |
-
This project is licensed under the [MIT License][2].
|
| 242 |
-
|
| 243 |
-
## 🔗 References
|
| 244 |
-
|
| 245 |
-
- [LightRAG Framework][1]
|
| 246 |
-
- [ParserPDF Tool][3] for document conversion
|
| 247 |
-
- [HuggingFace Space][4] for ParserPDF
|
| 248 |
|
|
|
|
|
|
|
| 249 |
|
| 250 |
-
[1]: https://github.com/HKUDS/LightRAG "LightRAG GitHub
|
| 251 |
[2]: https://opensource.org/license/mit "MIT License"
|
| 252 |
-
[3]: https://github.com/semmyk-research/parserPDF "ParserPDF GitHub
|
| 253 |
-
[4]: https://huggingface.co/spaces/semmyk/parserPDF "ParserPDF
|
|
|
|
| 34 |
#---
|
| 35 |
---
|
| 36 |
|
| 37 |
+
# LightRAG Gradio App
|
| 38 |
|
| 39 |
+
A modern, modular Gradio app for knowledge graph-based Retrieval-Augmented Generation (RAG) using [LightRAG][1]. Supports OpenAI and Ollama LLM backends, markdown document ingestion, and interactive knowledge graph visualisation. Our ParserPDF ([GitHub][3] | [HF Space][4]) pipeline generate markdown from documents (pdf, Word, html).
|
| 40 |
|
| 41 |
+
## Features
|
| 42 |
+
- LightRAG for Dual-level RAG and knowledge graph (KG)
|
| 43 |
+
- Ingest markdown files from a folder (default: `dataset/data/docs`).
|
| 44 |
+
- Query with OpenAI or Ollama backend (user-selectable)
|
| 45 |
+
- Visualise KG interactively in-browser
|
| 46 |
+
- Deployable to venv, Colab, or HuggingFace Spaces
|
| 47 |
+
- Robust, pythonic, modular code (UK English)
|
| 48 |
|
| 49 |
+
## Setup
|
|
|
|
| 50 |
|
| 51 |
+
### 1. Clone and create venv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
```bash
|
| 53 |
git clone https://github.com/semmyk-research/semmyKG
|
| 54 |
cd semmyKG
|
| 55 |
|
| 56 |
+
uv venv .venv # ensure you have the uv package
|
| 57 |
+
source .venv/bin/activate # or .venv\Scripts\activate on Windows
|
| 58 |
+
uv pip sync # or uv pip sync requirements.txt
|
|
|
|
| 59 |
|
| 60 |
+
or
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
python -m venv .venv
|
| 62 |
+
source .venv/bin/activate # or .venv\Scripts\activate on Windows
|
|
|
|
|
|
|
|
|
|
| 63 |
pip install -r requirements.txt
|
| 64 |
```
|
| 65 |
|
| 66 |
+
### 2. Configure environment
|
| 67 |
+
Copy `.env.example` to `.env` and fill in your keys:
|
| 68 |
+
```markdown
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
OPENAI_API_KEY=your-openai-api-key
|
| 70 |
+
LLM_MODEL=your-LLM-model-Name
|
| 71 |
+
##(in the format: provider/model-identifier)
|
| 72 |
+
OPENAI_API_BASE=your-LLM-inference-provider-endpoint
|
| 73 |
+
##(for locally hosted llm inference server like LMStudio or Jan.ai, follow ollama host adding /v1: http://localhost:1234/v1)
|
| 74 |
+
OPENAI_API_EMBED_BASE=your-embedding-provider-endpoint
|
| 75 |
+
##(for locally hosted, do not include /embedding)
|
| 76 |
+
LLM_MODEL_EMBED=your-embedding-model ##(in the format: provider/embedding-name)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
OLLAMA_HOST=http://localhost:11434
|
| 78 |
+
OLLAMA_API_KEY= ##(include if required)
|
| 79 |
+
```
|
| 80 |
+
If .env is not set, you can enter into the web UI directly. <br>
|
| 81 |
+
Ditto, override .env by inputting directly in web UI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
### 3. Run the app
|
|
|
|
|
|
|
| 84 |
```bash
|
| 85 |
+
python app_gradio_lightrag.py
|
| 86 |
+
```
|
| 87 |
+
For 'faster' development 'debug'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
```python
|
| 90 |
+
##SMY: assist: https://www.gradio.app/guides/developing-faster-with-reload-mode
|
| 91 |
+
gradio app_gradio_lightrag.py --demo-name=gradio_ui
|
| 92 |
```
|
| 93 |
|
| 94 |
+
### 4. Colab/Spaces
|
| 95 |
+
- For HuggingFace Spaces: ensure all dependencies are in `requirements.txt` and `.env` is set via the web UI or Space secret.
|
| 96 |
+
- For Colab: install requirements and run the app cell.
|
| 97 |
+
|
| 98 |
+
## Usage
|
| 99 |
+
- Select your data folder (default: `dataset/data/docs`)
|
| 100 |
+
- Choose LLM backend (OpenAI or Ollama). GenAI has a bug yieling error: role: 'assistant' instead of 'user' when updating history.
|
| 101 |
+
- Activate the RAG constructor
|
| 102 |
+
- Click 'Index Documents' to build the KG entities
|
| 103 |
+
- Click 'Query' to get answers
|
| 104 |
+
-- Enter your query and select query mode
|
| 105 |
+
- Click 'Show Knowledge Graph' to visualise the KG
|
| 106 |
+
|
| 107 |
+
## Notes
|
| 108 |
+
- Only markdown files are supported for ingestion (images in `/images` subfolder are ignored for now). <br>NB: other formats will be enabled later: pdf, txt, html...
|
| 109 |
+
- To generate markdown from documents (PDf, Word, html), use our ParserPDF tool [GitHub][3] | [HF Space][4].
|
| 110 |
+
- All user-facing text is in UK English
|
| 111 |
+
- For advanced configuration, see LightRAG documentation
|
| 112 |
+
|
| 113 |
+
## Roadmap (no defined timeline)
|
| 114 |
+
- HuggingFace log in
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
- [ParserPDF][3] integration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
+
## License
|
| 118 |
+
[MIT][2]
|
| 119 |
|
| 120 |
+
[1]: https://github.com/HKUDS/LightRAG "LightRAG GitHub"
|
| 121 |
[2]: https://opensource.org/license/mit "MIT License"
|
| 122 |
+
[3]: https://github.com/semmyk-research/parserPDF "ParserPDF (GitHub)"
|
| 123 |
+
[4]: https://huggingface.co/spaces/semmyk/parserPDF "ParserPDF (HF Space)"
|
app.py
CHANGED
|
@@ -6,6 +6,7 @@ import gradio as gr
|
|
| 6 |
#from watchfiles import run_process ##gradio reload watch
|
| 7 |
from app_gradio_lightrag import LightRAGApp ##SMY lightrag logging
|
| 8 |
from utils.llm_login import get_login_token
|
|
|
|
| 9 |
|
| 10 |
import asyncio
|
| 11 |
import nest_asyncio
|
|
@@ -17,6 +18,7 @@ from dotenv import load_dotenv
|
|
| 17 |
# Load environment variables
|
| 18 |
load_dotenv()
|
| 19 |
|
|
|
|
| 20 |
# Pythonic error handling decorator
|
| 21 |
def handle_errors(func):
|
| 22 |
def wrapper(*args, **kwargs):
|
|
@@ -25,6 +27,7 @@ def handle_errors(func):
|
|
| 25 |
except Exception as e:
|
| 26 |
return gr.update(value=f"Error: {e}")
|
| 27 |
return wrapper
|
|
|
|
| 28 |
|
| 29 |
# Instantiate app logic
|
| 30 |
#app_logic = LightRAGApp() ## See main()
|
|
@@ -67,39 +70,64 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 67 |
""")
|
| 68 |
|
| 69 |
# Step 0: Section 1
|
|
|
|
|
|
|
|
|
|
| 70 |
# Define openai_api textbox initial value
|
| 71 |
openai_api_key_init = os.getenv("OPENAI_API_KEY", "jan-ai")
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
# Step 1: Section 2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
with gr.Accordion("🤗 HuggingFace Client Control", open=True): #, open=False):
|
| 104 |
# HuggingFace controls
|
| 105 |
hf_login_logout_btn = gr.LoginButton(value="Sign in to HuggingFace 🤗", logout_value="Logout of HF: ({}) 🤗", variant="huggingface")
|
|
@@ -113,7 +141,7 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 113 |
gr.HTML("<hr>") #gr.Markdown("---")
|
| 114 |
|
| 115 |
with gr.Row():
|
| 116 |
-
index_btn = gr.Button("Index Documents")
|
| 117 |
stop_btn = gr.Button("Stop", variant="stop") ## Add cancel event button
|
| 118 |
query_text_tb = gr.Textbox(label="Your Query")
|
| 119 |
mode_dd = gr.Dropdown(["naive", "local", "global", "hybrid", "mix"], value="hybrid", label="Query Mode")
|
|
@@ -135,6 +163,7 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 135 |
st_openai_key = gr.State(value=openai_api_key_init) #gr.State("")
|
| 136 |
st_password1 = gr.State(value="password")
|
| 137 |
st_password2 = gr.State(value="password")
|
|
|
|
| 138 |
|
| 139 |
|
| 140 |
### Change handling
|
|
@@ -217,9 +246,9 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 217 |
|
| 218 |
|
| 219 |
# Button logic with async handling
|
| 220 |
-
async def setup_wrapper(df, wd, llm_back, embed_back, oai, base, base_embed, model, model_embed, host, embedkey):
|
| 221 |
-
return await app_logic.setup(df, wd, llm_back, embed_back, oai,
|
| 222 |
-
base, base_embed, model, model_embed, host, embedkey)
|
| 223 |
|
| 224 |
async def index_wrapper(df):
|
| 225 |
return await app_logic.index_documents(df)
|
|
@@ -241,6 +270,13 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 241 |
#hf_login_logout_btn.click(update_state_stored_value, inputs=openai_key_tb, outputs=st_openai_key)
|
| 242 |
hf_login_logout_btn.click(fn=custom_do_logout, inputs=openai_key_tb, outputs=[hf_login_logout_btn, st_openai_key])
|
| 243 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 244 |
toggle_btn_openai_key.click(
|
| 245 |
fn=toggle_password,
|
| 246 |
inputs=[st_password1],
|
|
@@ -253,10 +289,14 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 253 |
outputs=[openai_key_embed_tb, toggle_btn_openai_key_embed, st_password2],
|
| 254 |
show_progress="hidden"
|
| 255 |
)
|
| 256 |
-
|
| 257 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
openai_baseurl_tb, openai_baseurl_embed_tb, llm_model_name_tb, llm_model_embed_tb,
|
| 259 |
-
ollama_host_tb, openai_key_embed_tb]
|
| 260 |
|
| 261 |
setup_btn.click(
|
| 262 |
fn=setup_wrapper,
|
|
@@ -267,7 +307,7 @@ def gradio_ui(app_logic: LightRAGApp):
|
|
| 267 |
)
|
| 268 |
index_btn.click(
|
| 269 |
fn=index_wrapper,
|
| 270 |
-
inputs=[data_folder_tb],
|
| 271 |
outputs=[status_box, progress_tb],
|
| 272 |
show_progress=True
|
| 273 |
)
|
|
|
|
| 6 |
#from watchfiles import run_process ##gradio reload watch
|
| 7 |
from app_gradio_lightrag import LightRAGApp ##SMY lightrag logging
|
| 8 |
from utils.llm_login import get_login_token
|
| 9 |
+
from utils.file_utils import accumulate_dir
|
| 10 |
|
| 11 |
import asyncio
|
| 12 |
import nest_asyncio
|
|
|
|
| 18 |
# Load environment variables
|
| 19 |
load_dotenv()
|
| 20 |
|
| 21 |
+
'''
|
| 22 |
# Pythonic error handling decorator
|
| 23 |
def handle_errors(func):
|
| 24 |
def wrapper(*args, **kwargs):
|
|
|
|
| 27 |
except Exception as e:
|
| 28 |
return gr.update(value=f"Error: {e}")
|
| 29 |
return wrapper
|
| 30 |
+
'''
|
| 31 |
|
| 32 |
# Instantiate app logic
|
| 33 |
#app_logic = LightRAGApp() ## See main()
|
|
|
|
| 70 |
""")
|
| 71 |
|
| 72 |
# Step 0: Section 1
|
| 73 |
+
|
| 74 |
+
# Define ext type (in lieu of getting from global var)
|
| 75 |
+
#ext = (".md", "md") #SMY disused: 'tuple' object has no attribute '_id'
|
| 76 |
# Define openai_api textbox initial value
|
| 77 |
openai_api_key_init = os.getenv("OPENAI_API_KEY", "jan-ai")
|
| 78 |
+
|
| 79 |
+
with gr.Sidebar(position="right"):
|
| 80 |
+
system_prompt_tb = gr.Textbox(
|
| 81 |
+
value="You are a helpful assistant. You answer questions based on the provided context.", # If you don't know the answer, just say so. Don't make up information.",
|
| 82 |
+
label="System Prompt",
|
| 83 |
+
lines=3,
|
| 84 |
+
interactive=True,
|
| 85 |
+
show_copy_button=True,
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
with gr.Accordion(label="🛞 LLM settings", open=False):
|
| 89 |
+
with gr.Row():
|
| 90 |
+
llm_backend_cb = gr.Radio(["OpenAI", "Ollama", "GenAI"], value="OpenAI", label="LLM Backend: OpenAI, Local or GenAI")
|
| 91 |
+
llm_model_name_tb = gr.Textbox(value=os.getenv("LLM_MODEL", "openai/gpt-oss-120b"), label="LLM Model Name", show_copy_button=True) #.split('/')[1], label="LLM Model Name") "meta-llama/Llama-4-Maverick-17B-128E-Instruct")), #image-Text-to-Text #"openai/gpt-oss-120b",
|
| 92 |
+
with gr.Row():
|
| 93 |
+
with gr.Row(): #elem_classes="password-box"):
|
| 94 |
+
#openai_key_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY", "jan-ai"), label="OpenAI API Key",
|
| 95 |
+
# type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
|
| 96 |
+
openai_key_tb = gr.Textbox(value=openai_api_key_init, label="OpenAI API Key",
|
| 97 |
+
type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
|
| 98 |
+
toggle_btn_openai_key = gr.Button(
|
| 99 |
+
value="👁️", # Initial eye icon
|
| 100 |
+
elem_classes="icon-button", size="sm") #, min_width=50)
|
| 101 |
+
with gr.Row():
|
| 102 |
+
openai_baseurl_tb = gr.Textbox(value=os.getenv("OPENAI_API_BASE", "https://router.huggingface.co/v1"), label="OpenAI baseurl", show_copy_button=True)
|
| 103 |
+
ollama_host_tb = gr.Textbox(value=os.getenv("OLLAMA_HOST", "http://localhost:1234/v1"), label="Ollama Host", show_copy_button=True)
|
| 104 |
+
#ollama_host_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", ""), label="Ollama Host")
|
| 105 |
+
with gr.Row():
|
| 106 |
+
openai_baseurl_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", "http://localhost:1234/v1"), label="LLM Embed baseurl", show_copy_button=True)
|
| 107 |
+
llm_model_embed_tb = gr.Textbox(value=os.getenv("LLM_MODEL_EMBED","text-embedding-bge-m3"), label="LLM Embedding Model", show_copy_button=True) #.split('/')[1], label="Embedding Model")
|
| 108 |
+
with gr.Row():
|
| 109 |
+
embed_backend_dd = gr.Dropdown(choices=["Transformer", "Provider"], value="Provider", label="Embedding Type")
|
| 110 |
+
with gr.Row(): #elem_classes="password-box"):
|
| 111 |
+
openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="LLM API Key Embed", #lm-studio
|
| 112 |
+
type="password", elem_classes="password-box", container=False, interactive=True, info="LLM API Key Embed") #, show_copy_button=True)
|
| 113 |
+
toggle_btn_openai_key_embed = gr.Button(
|
| 114 |
+
value="👁️", # Initial eye icon
|
| 115 |
+
elem_classes="icon-button", size="sm") #, min_width=50)
|
| 116 |
+
#openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="OpenAI API Key Embed", type="password", show_copy_button=True) #("OLLAMA_API_KEY", ""), label="OpenAI API Key Embed", type="password")
|
| 117 |
|
| 118 |
# Step 1: Section 2
|
| 119 |
+
with gr.Row():
|
| 120 |
+
with gr.Column():
|
| 121 |
+
#data_folder_tb = gr.Textbox(value="dataset/data/docs2", label="Data Folder (markdown only)", show_copy_button=True)
|
| 122 |
+
dir_btn = gr.UploadButton(
|
| 123 |
+
#value='dataset/data/', #docs2 #[Errno 13] Permission denied
|
| 124 |
+
label="📁 Upload Folder",
|
| 125 |
+
#file_types=ext, #["file"],
|
| 126 |
+
file_count="directory",
|
| 127 |
+
)
|
| 128 |
+
upload_count_md = gr.Markdown(visible=False)
|
| 129 |
+
working_dir_tb = gr.Textbox(value="./working_folder1", label="lightRAG working folder", show_copy_button=True)
|
| 130 |
+
working_dir_reset_cb = gr.Checkbox(value=False, label="Reset working files?")
|
| 131 |
with gr.Accordion("🤗 HuggingFace Client Control", open=True): #, open=False):
|
| 132 |
# HuggingFace controls
|
| 133 |
hf_login_logout_btn = gr.LoginButton(value="Sign in to HuggingFace 🤗", logout_value="Logout of HF: ({}) 🤗", variant="huggingface")
|
|
|
|
| 141 |
gr.HTML("<hr>") #gr.Markdown("---")
|
| 142 |
|
| 143 |
with gr.Row():
|
| 144 |
+
index_btn = gr.Button("Index Documents", interactive=False)
|
| 145 |
stop_btn = gr.Button("Stop", variant="stop") ## Add cancel event button
|
| 146 |
query_text_tb = gr.Textbox(label="Your Query")
|
| 147 |
mode_dd = gr.Dropdown(["naive", "local", "global", "hybrid", "mix"], value="hybrid", label="Query Mode")
|
|
|
|
| 163 |
st_openai_key = gr.State(value=openai_api_key_init) #gr.State("")
|
| 164 |
st_password1 = gr.State(value="password")
|
| 165 |
st_password2 = gr.State(value="password")
|
| 166 |
+
state_uploaded_file_list = gr.State(value=[])
|
| 167 |
|
| 168 |
|
| 169 |
### Change handling
|
|
|
|
| 246 |
|
| 247 |
|
| 248 |
# Button logic with async handling
|
| 249 |
+
async def setup_wrapper(df, wd, wd_reset, llm_back, embed_back, oai, base, base_embed, model, model_embed, host, embedkey, sys_prompt):
|
| 250 |
+
return await app_logic.setup(df, wd, wd_reset, llm_back, embed_back, oai,
|
| 251 |
+
base, base_embed, model, model_embed, host, embedkey, sys_prompt)
|
| 252 |
|
| 253 |
async def index_wrapper(df):
|
| 254 |
return await app_logic.index_documents(df)
|
|
|
|
| 270 |
#hf_login_logout_btn.click(update_state_stored_value, inputs=openai_key_tb, outputs=st_openai_key)
|
| 271 |
hf_login_logout_btn.click(fn=custom_do_logout, inputs=openai_key_tb, outputs=[hf_login_logout_btn, st_openai_key])
|
| 272 |
|
| 273 |
+
dir_btn.upload(
|
| 274 |
+
fn=accumulate_dir,
|
| 275 |
+
inputs=[dir_btn, state_uploaded_file_list],
|
| 276 |
+
outputs=[state_uploaded_file_list, index_btn, upload_count_md, status_box],
|
| 277 |
+
show_progress="hidden"
|
| 278 |
+
)
|
| 279 |
+
|
| 280 |
toggle_btn_openai_key.click(
|
| 281 |
fn=toggle_password,
|
| 282 |
inputs=[st_password1],
|
|
|
|
| 289 |
outputs=[openai_key_embed_tb, toggle_btn_openai_key_embed, st_password2],
|
| 290 |
show_progress="hidden"
|
| 291 |
)
|
| 292 |
+
'''
|
| 293 |
+
async def setup(self, data_folder: str, working_dir: str, wdir_reset: bool, llm_backend: str, embed_backend: str, openai_key: str,
|
| 294 |
+
openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str, llm_model_embed: str,
|
| 295 |
+
ollama_host: str, embed_key: str, system_prompt: str) -> str:
|
| 296 |
+
'''
|
| 297 |
+
inputs_arg = [state_uploaded_file_list, working_dir_tb, working_dir_reset_cb, llm_backend_cb, embed_backend_dd, st_openai_key, #openai_key_tb,
|
| 298 |
openai_baseurl_tb, openai_baseurl_embed_tb, llm_model_name_tb, llm_model_embed_tb,
|
| 299 |
+
ollama_host_tb, openai_key_embed_tb, system_prompt_tb] #data_folder_tb,
|
| 300 |
|
| 301 |
setup_btn.click(
|
| 302 |
fn=setup_wrapper,
|
|
|
|
| 307 |
)
|
| 308 |
index_btn.click(
|
| 309 |
fn=index_wrapper,
|
| 310 |
+
inputs=state_uploaded_file_list, #[data_folder_tb],
|
| 311 |
outputs=[status_box, progress_tb],
|
| 312 |
show_progress=True
|
| 313 |
)
|
app_gradio_lightrag.py
CHANGED
|
@@ -10,7 +10,7 @@ import random
|
|
| 10 |
from functools import partial
|
| 11 |
from typing import Tuple, Optional, Any, List, Union
|
| 12 |
|
| 13 |
-
import
|
| 14 |
|
| 15 |
def install(package):
|
| 16 |
import subprocess
|
|
@@ -86,9 +86,7 @@ def configure_logging():
|
|
| 86 |
# Get log directory path from environment variable or use current directory
|
| 87 |
#log_dir = os.getenv("LOG_DIR", os.getcwd())
|
| 88 |
log_dir = os.getenv("LOG_DIR", "logs")
|
| 89 |
-
|
| 90 |
-
os.path.join(log_dir, "lightrag_compatible_demo.log")
|
| 91 |
-
)'''
|
| 92 |
if log_dir:
|
| 93 |
log_file_path = Path(log_dir) / "lightrag_logs.log"
|
| 94 |
else:
|
|
@@ -165,15 +163,25 @@ def visualise_graphml(graphml_path: str, working_dir: str) -> str:
|
|
| 165 |
## Load the GraphML file
|
| 166 |
G = nx.read_graphml(graphml_path)
|
| 167 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
## Create a Pyvis network
|
| 169 |
#net = Network(height="100vh", notebook=True)
|
| 170 |
-
net = Network(notebook=True, width="100%", height="
|
| 171 |
-
#
|
| 172 |
net.from_nx(G)
|
| 173 |
|
| 174 |
# Add colors and title to nodes
|
| 175 |
for node in net.nodes:
|
| 176 |
-
node["color"] = "#{:06x}".format(random.randint(0, 0xFFFFFF))
|
|
|
|
| 177 |
if "description" in node:
|
| 178 |
node["title"] = node["description"]
|
| 179 |
|
|
@@ -184,22 +192,32 @@ def visualise_graphml(graphml_path: str, working_dir: str) -> str:
|
|
| 184 |
|
| 185 |
## Set the 'physics' attribute to repulsion
|
| 186 |
net.repulsion(node_distance=120, spring_length=200)
|
| 187 |
-
net.show_buttons(filter_=['physics']) ##SMY: dynamically modify the network
|
| 188 |
#net.show_buttons()
|
| 189 |
|
| 190 |
## graph path
|
| 191 |
-
kg_viz_html_file = "
|
| 192 |
-
#html_path = os.path.join(working_dir, kg_viz_html_file)
|
| 193 |
html_path = Path(working_dir) / kg_viz_html_file
|
| 194 |
|
| 195 |
-
#net.save_graph(html_path)
|
| 196 |
## Save and display the generated KG network html
|
|
|
|
| 197 |
#net.show(html_path)
|
| 198 |
net.show(str(html_path), local=True, notebook=False)
|
| 199 |
|
| 200 |
-
#
|
| 201 |
-
|
| 202 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
# Utility: Get all markdown files in a folder
|
| 205 |
def get_markdown_files(folder: str) -> list[str]:
|
|
@@ -242,6 +260,7 @@ class LightRAGApp:
|
|
| 242 |
|
| 243 |
if custom_system_prompt:
|
| 244 |
self.system_prompt = custom_system_prompt
|
|
|
|
| 245 |
else:
|
| 246 |
self.system_prompt = """
|
| 247 |
You are a domain expert on Cybersecurity, the South Africa landscape and South African legislation.
|
|
@@ -269,8 +288,9 @@ class LightRAGApp:
|
|
| 269 |
- For instance, maintain a single node for Protection of Information Act, Protection of Information Act, 1982, Protection of Information Act No 84, 1982.
|
| 270 |
- However, have a separate node for Protection of Personal Information Act, 2013; as it it a separate legislation.
|
| 271 |
- Also take note that 'Republic of South Africa' is an offical geo entity while 'South Africa' is a referred to place, although also a geo entity:
|
| 272 |
-
- Always watch the context and
|
| 273 |
"""
|
|
|
|
| 274 |
|
| 275 |
return self.system_prompt
|
| 276 |
|
|
@@ -358,7 +378,7 @@ class LightRAGApp:
|
|
| 358 |
logger.debug(f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
|
| 359 |
logger_kg.log(level=20, msg=f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
|
| 360 |
|
| 361 |
-
# 2.
|
| 362 |
client = Client(api_key=self.llm_api_key) #api_key=gemini_api_key
|
| 363 |
#aclient = genai.Client(api_key=self.llm_api_key).aio # use AsyncClient
|
| 364 |
|
|
@@ -454,14 +474,30 @@ class LightRAGApp:
|
|
| 454 |
|
| 455 |
def _ensure_working_dir(self) -> str:
|
| 456 |
"""Ensure working directory exists and return status message"""
|
| 457 |
-
|
| 458 |
-
os.makedirs(self.working_dir, exist_ok=True)
|
| 459 |
-
return f"Created working directory: {self.working_dir}"'''
|
| 460 |
if not Path(self.working_dir).exists():
|
| 461 |
check_create_dir(self.working_dir)
|
| 462 |
return f"Created working directory: {self.working_dir}"
|
| 463 |
return f"Working directory exists: {self.working_dir}"
|
| 464 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 465 |
|
| 466 |
async def _initialise_storages(self) -> str:
|
| 467 |
#def _initialise_storages(self) -> str:
|
|
@@ -487,31 +523,11 @@ class LightRAGApp:
|
|
| 487 |
#print(f"_embedding_func: llm_api_key_embed: {self.llm_api_key_embed}")
|
| 488 |
#print(f"_embedding_func: llm_baseurl_embed: {self.llm_baseurl_embed}")
|
| 489 |
|
| 490 |
-
# Clear old data files
|
| 491 |
-
#wrap_async(self._clear_old_data_files)
|
| 492 |
-
#await self._clear_old_data_files()
|
| 493 |
-
"""Clear old data files"""
|
| 494 |
-
files_to_delete = [
|
| 495 |
-
"graph_chunk_entity_relation.graphml",
|
| 496 |
-
"kv_store_doc_status.json",
|
| 497 |
-
"kv_store_full_docs.json",
|
| 498 |
-
"kv_store_text_chunks.json",
|
| 499 |
-
"vdb_chunks.json",
|
| 500 |
-
"vdb_entities.json",
|
| 501 |
-
"vdb_relationships.json",
|
| 502 |
-
]
|
| 503 |
-
|
| 504 |
-
for file in files_to_delete:
|
| 505 |
-
'''file_path = os.path.join(self.working_dir, file)
|
| 506 |
-
if os.path.exists(file_path):
|
| 507 |
-
os.remove(file_path)
|
| 508 |
-
print(f"Deleting old file:: {file_path}")'''
|
| 509 |
-
file_path = Path(self.working_dir) / file
|
| 510 |
-
if file_path.exists():
|
| 511 |
-
file_path.unlink()
|
| 512 |
-
logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})
|
| 513 |
-
|
| 514 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 515 |
# Get embedding
|
| 516 |
if self.embed_backend == "Transformer" or self.embed_backend[0] == "Transformer":
|
| 517 |
logger_kg.log(level=20, msg=f"Getting embeddings dynamically through _embedding_func: ",
|
|
@@ -549,7 +565,7 @@ class LightRAGApp:
|
|
| 549 |
await self._initialise_storages()
|
| 550 |
|
| 551 |
#await rag.initialize_storages()
|
| 552 |
-
#await initialize_pipeline_status() ##SMY: still relevant in updated lightRAG? - """Asynchronously
|
| 553 |
|
| 554 |
self.status = f"Storages and pipeline initialised successfully" ##SMY: debug
|
| 555 |
logger_kg.log(level=20, msg=f"Storages and pipeline initialised successfully")
|
|
@@ -561,9 +577,9 @@ class LightRAGApp:
|
|
| 561 |
|
| 562 |
@handle_errors
|
| 563 |
#def setup(self, data_folder: str, working_dir: str, llm_backend: str,
|
| 564 |
-
async def setup(self, data_folder: str, working_dir: str, llm_backend: str, embed_backend: str,
|
| 565 |
openai_key: str, openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str,
|
| 566 |
-
llm_model_embed: str, ollama_host: str, embed_key: str) -> str:
|
| 567 |
"""Set up LightRAG with specified configuration"""
|
| 568 |
# Configure environment
|
| 569 |
#os.environ["OPENAI_API_KEY"] = openai_key or os.getenv("OPENAI_API_KEY", "")
|
|
@@ -573,8 +589,9 @@ class LightRAGApp:
|
|
| 573 |
#os.environ["OPENAI_API_EMBED_BASE"] = openai_baseurl_embed or os.getenv("OPENAI_API_EMBED_BASE") #, "http://localhost:1234/v1/embeddings")
|
| 574 |
|
| 575 |
# Update instance state
|
| 576 |
-
self.data_folder = data_folder
|
| 577 |
self.working_dir = working_dir
|
|
|
|
| 578 |
self.llm_backend = llm_backend
|
| 579 |
self.embed_backend = embed_backend if isinstance(embed_backend, str) else embed_backend[0],
|
| 580 |
self.llm_model_name = llm_model_name
|
|
@@ -592,7 +609,7 @@ class LightRAGApp:
|
|
| 592 |
except Exception as e:
|
| 593 |
self.status = f"LightRAG initialisation.setup: working dir err | {str(e)}"
|
| 594 |
|
| 595 |
-
#
|
| 596 |
try:
|
| 597 |
#self.rag = wrap_async( self._initialise_rag)
|
| 598 |
self.rag = await self._initialise_rag()
|
|
@@ -628,15 +645,18 @@ class LightRAGApp:
|
|
| 628 |
'''
|
| 629 |
|
| 630 |
@handle_errors
|
| 631 |
-
async def index_documents(self, data_folder: str) -> Tuple[str, str]:
|
| 632 |
#def index_documents(self, data_folder: str) -> Tuple[str, str]:
|
| 633 |
"""Index markdown documents with progress tracking"""
|
| 634 |
if not self._is_initialised or self.rag is None:
|
| 635 |
return "Please initialise LightRAG first using the 'Initialise App' button.", "Not started"
|
| 636 |
|
| 637 |
-
md_files = get_markdown_files(data_folder)
|
|
|
|
|
|
|
|
|
|
| 638 |
if not md_files:
|
| 639 |
-
return f"No markdown files
|
| 640 |
|
| 641 |
try:
|
| 642 |
total_files = len(md_files)
|
|
@@ -739,9 +759,7 @@ class LightRAGApp:
|
|
| 739 |
"""Display knowledge graph visualisation"""
|
| 740 |
## graphml_path: defaults to lightRAG's generated graph_chunk_entity_relation.graphml
|
| 741 |
## working_dir: lightRAG's working directory set by user
|
| 742 |
-
|
| 743 |
-
if not os.path.exists(graphml_path):
|
| 744 |
-
return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."'''
|
| 745 |
graphml_path = Path(self.working_dir) / "graph_chunk_entity_relation.graphml"
|
| 746 |
if not Path(graphml_path).exists():
|
| 747 |
return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."
|
|
@@ -759,59 +777,6 @@ class LightRAGApp:
|
|
| 759 |
|
| 760 |
|
| 761 |
############
|
| 762 |
-
'''
|
| 763 |
-
##SMY: //TODO: Gradio toggle button
|
| 764 |
-
def _clear_old_data_files(self):
|
| 765 |
-
"""Clear old data files"""
|
| 766 |
-
files_to_delete = [
|
| 767 |
-
"graph_chunk_entity_relation.graphml",
|
| 768 |
-
"kv_store_doc_status.json",
|
| 769 |
-
"kv_store_full_docs.json",
|
| 770 |
-
"kv_store_text_chunks.json",
|
| 771 |
-
"vdb_chunks.json",
|
| 772 |
-
"vdb_entities.json",
|
| 773 |
-
"vdb_relationships.json",
|
| 774 |
-
]
|
| 775 |
-
|
| 776 |
-
for file in files_to_delete:
|
| 777 |
-
file_path = Path(self.working_dir) / file
|
| 778 |
-
if file_path.exists():
|
| 779 |
-
file_path.unlink()
|
| 780 |
-
logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})'''
|
| 781 |
-
'''
|
| 782 |
-
|
| 783 |
-
async def _get_llm_functions(self) -> Tuple[callable, callable]:
|
| 784 |
-
#def _get_llm_functions(self) -> Tuple[callable, callable]:
|
| 785 |
-
"""Get LLM and embedding functions based on backend"""
|
| 786 |
-
try:
|
| 787 |
-
# Get embedding dimension dynamically
|
| 788 |
-
try:
|
| 789 |
-
embedding_dimension = await self._get_embedding_dim()
|
| 790 |
-
self.status = f"Using embedding dimension: {embedding_dimension}"
|
| 791 |
-
logger_kg.log(level=20, msg=f"Using embedding dimension: {embedding_dimension}")
|
| 792 |
-
except Exception as e:
|
| 793 |
-
# feedback dimensions error
|
| 794 |
-
self.status = f"_get_llm_function: embedding_dim error with fallback: {str(e)}"
|
| 795 |
-
|
| 796 |
-
# Create embedding function wrapper: # Wrap with EmbeddingFunc to provide required attributes
|
| 797 |
-
embed_func = EmbeddingFunc(
|
| 798 |
-
embedding_dim=embedding_dimension,
|
| 799 |
-
max_token_size=8192, #4096, #8192, # Conservative default | #ollama
|
| 800 |
-
func=self._embedding_func
|
| 801 |
-
)
|
| 802 |
-
|
| 803 |
-
# Get LLM function
|
| 804 |
-
#llm_func = await self._llm_model_func ##SMY: not used
|
| 805 |
-
|
| 806 |
-
# return LLM and embed functions
|
| 807 |
-
#return llm_func, embed_func
|
| 808 |
-
return await self._llm_model_func(), embed_func
|
| 809 |
-
|
| 810 |
-
except Exception as e:
|
| 811 |
-
self.status = f"{self.status} \n| _get_llm_functions error: {str(e)}"
|
| 812 |
-
logger_kg.log(level=30, msg=f"{self.status} \n| _get_llm_functions error: {str(e)}")
|
| 813 |
-
raise # Re-raise to be caught by the setup method
|
| 814 |
-
'''
|
| 815 |
|
| 816 |
'''
|
| 817 |
##SMY: record only. for deletion
|
|
|
|
| 10 |
from functools import partial
|
| 11 |
from typing import Tuple, Optional, Any, List, Union
|
| 12 |
|
| 13 |
+
from utils.utils import get_time_now_str ##SMY lightrag_openai_compatible_demo.py
|
| 14 |
|
| 15 |
def install(package):
|
| 16 |
import subprocess
|
|
|
|
| 86 |
# Get log directory path from environment variable or use current directory
|
| 87 |
#log_dir = os.getenv("LOG_DIR", os.getcwd())
|
| 88 |
log_dir = os.getenv("LOG_DIR", "logs")
|
| 89 |
+
|
|
|
|
|
|
|
| 90 |
if log_dir:
|
| 91 |
log_file_path = Path(log_dir) / "lightrag_logs.log"
|
| 92 |
else:
|
|
|
|
| 163 |
## Load the GraphML file
|
| 164 |
G = nx.read_graphml(graphml_path)
|
| 165 |
|
| 166 |
+
## Dynamically size nodes
|
| 167 |
+
# Calculate note attributes for sizing
|
| 168 |
+
node_degrees = dict(G.degree())
|
| 169 |
+
|
| 170 |
+
# Scale node degrees for better visual differentiation
|
| 171 |
+
max_degree = max(node_degrees.values())
|
| 172 |
+
for node, degree in node_degrees.items():
|
| 173 |
+
G.nodes[node]['size'] = 10 + (degree / max_degree) * 80 #40 # scaling
|
| 174 |
+
|
| 175 |
## Create a Pyvis network
|
| 176 |
#net = Network(height="100vh", notebook=True)
|
| 177 |
+
net = Network(notebook=True, width="100%", height="100vh") #, heading=f"Knowledge Graph Visualisation") #(noteboot=False) height="600px",
|
| 178 |
+
# Convert NetworkX graph to Pyvis network
|
| 179 |
net.from_nx(G)
|
| 180 |
|
| 181 |
# Add colors and title to nodes
|
| 182 |
for node in net.nodes:
|
| 183 |
+
#node["color"] = "#{:06x}".format(random.randint(0, 0xFFFFFF))
|
| 184 |
+
node["color"] = "#{:01x}".format(random.randint(0, 0xFFFFFF))
|
| 185 |
if "description" in node:
|
| 186 |
node["title"] = node["description"]
|
| 187 |
|
|
|
|
| 192 |
|
| 193 |
## Set the 'physics' attribute to repulsion
|
| 194 |
net.repulsion(node_distance=120, spring_length=200)
|
| 195 |
+
net.show_buttons(filter_=['physics', 'layout']) ##SMY: dynamically modify the network
|
| 196 |
#net.show_buttons()
|
| 197 |
|
| 198 |
## graph path
|
| 199 |
+
kg_viz_html_file = f"kg_viz_{get_time_now_str(date_format='%Y-%m-%d')}.html"
|
|
|
|
| 200 |
html_path = Path(working_dir) / kg_viz_html_file
|
| 201 |
|
|
|
|
| 202 |
## Save and display the generated KG network html
|
| 203 |
+
#net.save_graph(html_path)
|
| 204 |
#net.show(html_path)
|
| 205 |
net.show(str(html_path), local=True, notebook=False)
|
| 206 |
|
| 207 |
+
# get HTML content
|
| 208 |
+
html_iframe = net.generate_html(str(html_path), local=True, notebook=False)
|
| 209 |
+
## need to remove ' from HTML ##assist: https://huggingface.co/spaces/simonduerr/pyvisdemo/blob/main/app.py
|
| 210 |
+
html_iframe = html_iframe.replace("'", "\"")
|
| 211 |
+
|
| 212 |
+
##SMY display generated KG html
|
| 213 |
+
#'''
|
| 214 |
+
return gr.update(show_label=True, container=True, value=f"""<iframe style="width: 100%; height: 100vh;margin:0 auto" name="result" allow="midi; geolocation; microphone; camera;
|
| 215 |
+
display-capture; encrypted-media;" sandbox="allow-modals allow-forms
|
| 216 |
+
allow-scripts allow-same-origin allow-popups
|
| 217 |
+
allow-top-navigation-by-user-activation allow-downloads" allowfullscreen=""
|
| 218 |
+
allowpaymentrequest="" frameborder="0" srcdoc='{html_iframe}'></iframe>"""
|
| 219 |
+
)
|
| 220 |
+
#'''
|
| 221 |
|
| 222 |
# Utility: Get all markdown files in a folder
|
| 223 |
def get_markdown_files(folder: str) -> list[str]:
|
|
|
|
| 260 |
|
| 261 |
if custom_system_prompt:
|
| 262 |
self.system_prompt = custom_system_prompt
|
| 263 |
+
''' ## system_prompt now in gradio ui
|
| 264 |
else:
|
| 265 |
self.system_prompt = """
|
| 266 |
You are a domain expert on Cybersecurity, the South Africa landscape and South African legislation.
|
|
|
|
| 288 |
- For instance, maintain a single node for Protection of Information Act, Protection of Information Act, 1982, Protection of Information Act No 84, 1982.
|
| 289 |
- However, have a separate node for Protection of Personal Information Act, 2013; as it it a separate legislation.
|
| 290 |
- Also take note that 'Republic of South Africa' is an offical geo entity while 'South Africa' is a referred to place, although also a geo entity:
|
| 291 |
+
- Always watch the context and be careful of lumping them together.
|
| 292 |
"""
|
| 293 |
+
'''
|
| 294 |
|
| 295 |
return self.system_prompt
|
| 296 |
|
|
|
|
| 378 |
logger.debug(f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
|
| 379 |
logger_kg.log(level=20, msg=f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
|
| 380 |
|
| 381 |
+
# 2. Initialise the GenAI Client with Gemini API Key
|
| 382 |
client = Client(api_key=self.llm_api_key) #api_key=gemini_api_key
|
| 383 |
#aclient = genai.Client(api_key=self.llm_api_key).aio # use AsyncClient
|
| 384 |
|
|
|
|
| 474 |
|
| 475 |
def _ensure_working_dir(self) -> str:
|
| 476 |
"""Ensure working directory exists and return status message"""
|
| 477 |
+
|
|
|
|
|
|
|
| 478 |
if not Path(self.working_dir).exists():
|
| 479 |
check_create_dir(self.working_dir)
|
| 480 |
return f"Created working directory: {self.working_dir}"
|
| 481 |
return f"Working directory exists: {self.working_dir}"
|
| 482 |
|
| 483 |
+
##SMY: //TODO: Gradio toggle button
|
| 484 |
+
async def _clear_old_data_files(self):
|
| 485 |
+
"""Clear old data files"""
|
| 486 |
+
files_to_delete = [
|
| 487 |
+
"graph_chunk_entity_relation.graphml",
|
| 488 |
+
"kv_store_doc_status.json",
|
| 489 |
+
"kv_store_full_docs.json",
|
| 490 |
+
"kv_store_text_chunks.json",
|
| 491 |
+
"vdb_chunks.json",
|
| 492 |
+
"vdb_entities.json",
|
| 493 |
+
"vdb_relationships.json",
|
| 494 |
+
]
|
| 495 |
+
|
| 496 |
+
for file in files_to_delete:
|
| 497 |
+
file_path = Path(self.working_dir) / file
|
| 498 |
+
if file_path.exists():
|
| 499 |
+
file_path.unlink()
|
| 500 |
+
logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})
|
| 501 |
|
| 502 |
async def _initialise_storages(self) -> str:
|
| 503 |
#def _initialise_storages(self) -> str:
|
|
|
|
| 523 |
#print(f"_embedding_func: llm_api_key_embed: {self.llm_api_key_embed}")
|
| 524 |
#print(f"_embedding_func: llm_baseurl_embed: {self.llm_baseurl_embed}")
|
| 525 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 526 |
|
| 527 |
+
if self.working_dir_reset:
|
| 528 |
+
# Clear old data files
|
| 529 |
+
await self._clear_old_data_files()
|
| 530 |
+
|
| 531 |
# Get embedding
|
| 532 |
if self.embed_backend == "Transformer" or self.embed_backend[0] == "Transformer":
|
| 533 |
logger_kg.log(level=20, msg=f"Getting embeddings dynamically through _embedding_func: ",
|
|
|
|
| 565 |
await self._initialise_storages()
|
| 566 |
|
| 567 |
#await rag.initialize_storages()
|
| 568 |
+
#await initialize_pipeline_status() ##SMY: still relevant in updated lightRAG? - """Asynchronously finalise the storages"""
|
| 569 |
|
| 570 |
self.status = f"Storages and pipeline initialised successfully" ##SMY: debug
|
| 571 |
logger_kg.log(level=20, msg=f"Storages and pipeline initialised successfully")
|
|
|
|
| 577 |
|
| 578 |
@handle_errors
|
| 579 |
#def setup(self, data_folder: str, working_dir: str, llm_backend: str,
|
| 580 |
+
async def setup(self, data_folder: str, working_dir: str, wdir_reset: bool, llm_backend: str, embed_backend: str,
|
| 581 |
openai_key: str, openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str,
|
| 582 |
+
llm_model_embed: str, ollama_host: str, embed_key: str, system_prompt: str) -> str:
|
| 583 |
"""Set up LightRAG with specified configuration"""
|
| 584 |
# Configure environment
|
| 585 |
#os.environ["OPENAI_API_KEY"] = openai_key or os.getenv("OPENAI_API_KEY", "")
|
|
|
|
| 589 |
#os.environ["OPENAI_API_EMBED_BASE"] = openai_baseurl_embed or os.getenv("OPENAI_API_EMBED_BASE") #, "http://localhost:1234/v1/embeddings")
|
| 590 |
|
| 591 |
# Update instance state
|
| 592 |
+
self.data_folder = data_folder ##SMY: redundant
|
| 593 |
self.working_dir = working_dir
|
| 594 |
+
self.working_dir_reset = wdir_reset
|
| 595 |
self.llm_backend = llm_backend
|
| 596 |
self.embed_backend = embed_backend if isinstance(embed_backend, str) else embed_backend[0],
|
| 597 |
self.llm_model_name = llm_model_name
|
|
|
|
| 609 |
except Exception as e:
|
| 610 |
self.status = f"LightRAG initialisation.setup: working dir err | {str(e)}"
|
| 611 |
|
| 612 |
+
# Initialise lightRAG with storages
|
| 613 |
try:
|
| 614 |
#self.rag = wrap_async( self._initialise_rag)
|
| 615 |
self.rag = await self._initialise_rag()
|
|
|
|
| 645 |
'''
|
| 646 |
|
| 647 |
@handle_errors
|
| 648 |
+
async def index_documents(self, data_folder: Union[list[str], str]) -> Tuple[str, str]:
|
| 649 |
#def index_documents(self, data_folder: str) -> Tuple[str, str]:
|
| 650 |
"""Index markdown documents with progress tracking"""
|
| 651 |
if not self._is_initialised or self.rag is None:
|
| 652 |
return "Please initialise LightRAG first using the 'Initialise App' button.", "Not started"
|
| 653 |
|
| 654 |
+
#md_files = get_markdown_files(data_folder) #data_folder is now list of ploaded files
|
| 655 |
+
#if not md_files:
|
| 656 |
+
# return f"No markdown files found in {data_folder}:", "No files"
|
| 657 |
+
md_files = data_folder
|
| 658 |
if not md_files:
|
| 659 |
+
return f"No markdown files uploaded {data_folder}:", "No files"
|
| 660 |
|
| 661 |
try:
|
| 662 |
total_files = len(md_files)
|
|
|
|
| 759 |
"""Display knowledge graph visualisation"""
|
| 760 |
## graphml_path: defaults to lightRAG's generated graph_chunk_entity_relation.graphml
|
| 761 |
## working_dir: lightRAG's working directory set by user
|
| 762 |
+
|
|
|
|
|
|
|
| 763 |
graphml_path = Path(self.working_dir) / "graph_chunk_entity_relation.graphml"
|
| 764 |
if not Path(graphml_path).exists():
|
| 765 |
return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."
|
|
|
|
| 777 |
|
| 778 |
|
| 779 |
############
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 780 |
|
| 781 |
'''
|
| 782 |
##SMY: record only. for deletion
|
utils/file_utils.py
CHANGED
|
@@ -131,6 +131,40 @@ def create_temp_folder(tempfolder: Optional[str | Path] = '', program_name: str
|
|
| 131 |
|
| 132 |
return output_dir
|
| 133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
##=========
|
| 136 |
def find_file(file_name: str) -> Path: #configparser.ConfigParser:
|
|
@@ -194,6 +228,8 @@ def resolve_grandparent_object(gp_object:str):
|
|
| 194 |
###
|
| 195 |
# Create a Path object based on current file's location, resolve it to an absolute path,
|
| 196 |
# and then get its parent's parent using chained .parent calls or the parents[] attribute.
|
|
|
|
|
|
|
| 197 |
|
| 198 |
# 1. Get the current script's path, its parent and its grandparent directory
|
| 199 |
try:
|
|
@@ -339,7 +375,7 @@ def accumulate_files(uploaded_files, current_state):
|
|
| 339 |
|
| 340 |
from globals_config import config_load
|
| 341 |
import gradio as gr
|
| 342 |
-
#
|
| 343 |
if current_state is None:
|
| 344 |
current_state = []
|
| 345 |
|
|
|
|
| 131 |
|
| 132 |
return output_dir
|
| 133 |
|
| 134 |
+
def accumulate_dir(uploaded_files, current_state, ext: Union[str, tuple] = (".md", "md")):
|
| 135 |
+
""" accumulate uploaded files in dir based on ext with the existing state """
|
| 136 |
+
|
| 137 |
+
import gradio as gr
|
| 138 |
+
|
| 139 |
+
# Initialise state if it's the first run
|
| 140 |
+
if current_state is None:
|
| 141 |
+
current_state = []
|
| 142 |
+
|
| 143 |
+
# Check if files were uploaded in the current iteration, return the current state.
|
| 144 |
+
if not uploaded_files:
|
| 145 |
+
return current_state, gr.update(), gr.update(visible=True, value="No new files uploaded"), gr.update(value="No new files uploaded")
|
| 146 |
+
|
| 147 |
+
# call is_file_with_extension to check if pathlib.Path object is a file and has a non-empty extension
|
| 148 |
+
#new_file_paths = [f.name for f in uploaded_files if is_file_with_extension(Path(f.name))] #Path(f.name) and Path(f.name).is_file() and bool(Path(f.name).suffix)] #Path(f.name).suffix.lower() !=""]
|
| 149 |
+
new_file_paths = [f.name for f in uploaded_files if is_file_with_extension(Path(f.name)) and f.name.endswith(ext)]
|
| 150 |
+
|
| 151 |
+
# Concatenate the new files with the existing ones in the state
|
| 152 |
+
updated_files = current_state + new_file_paths
|
| 153 |
+
updated_filenames = [Path(f).name for f in updated_files] ##SMY: filenames only
|
| 154 |
+
|
| 155 |
+
updated_files_count = len(updated_files)
|
| 156 |
+
|
| 157 |
+
# Return the updated state and a message to the user
|
| 158 |
+
filename_info = "\n".join(updated_filenames) ##SMY: not used(updated_filenames)
|
| 159 |
+
#message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
|
| 160 |
+
message_count = f"Accumulated {updated_files_count} file(s) total."
|
| 161 |
+
message = f"Accumulated {updated_files_count} file(s) total: \n{filename_info}"
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
#outputs=[state_uploaded_file_list, dir_btn, upload_count_md, status_box],
|
| 165 |
+
#return updated_files, updated_files_count, message, gr.update(interactive=True), gr.update(interactive=True)
|
| 166 |
+
return updated_files, gr.update(interactive=True,), gr.update(visible=True, value=message_count), gr.update(value=message)
|
| 167 |
+
|
| 168 |
|
| 169 |
##=========
|
| 170 |
def find_file(file_name: str) -> Path: #configparser.ConfigParser:
|
|
|
|
| 228 |
###
|
| 229 |
# Create a Path object based on current file's location, resolve it to an absolute path,
|
| 230 |
# and then get its parent's parent using chained .parent calls or the parents[] attribute.
|
| 231 |
+
|
| 232 |
+
#import sys
|
| 233 |
|
| 234 |
# 1. Get the current script's path, its parent and its grandparent directory
|
| 235 |
try:
|
|
|
|
| 375 |
|
| 376 |
from globals_config import config_load
|
| 377 |
import gradio as gr
|
| 378 |
+
# Initialise state if it's the first run
|
| 379 |
if current_state is None:
|
| 380 |
current_state = []
|
| 381 |
|
utils/llm_login.py
CHANGED
|
@@ -29,7 +29,7 @@ def get_login_token( api_token_arg, oauth_token):
|
|
| 29 |
|
| 30 |
def login_huggingface(token: Optional[str] = None):
|
| 31 |
"""
|
| 32 |
-
Login to Hugging Face account.
|
| 33 |
|
| 34 |
Attempts to log in to Hugging Face Hub.
|
| 35 |
First, it tries to log in interactively via the Hugging Face CLI.
|
|
|
|
| 29 |
|
| 30 |
def login_huggingface(token: Optional[str] = None):
|
| 31 |
"""
|
| 32 |
+
Login to Hugging Face account. Prioritise CLI login for privacy and determinism.
|
| 33 |
|
| 34 |
Attempts to log in to Hugging Face Hub.
|
| 35 |
First, it tries to log in interactively via the Hugging Face CLI.
|