semmyk commited on
Commit
dc56f4d
·
1 Parent(s): 3ff6af7

v0.2.8.5: Baseline 02 - upload markdown files/folder, accumulate_dir - sidebar (settings) - visualise KG - reset working folder files - updated README

Browse files
Files changed (5) hide show
  1. README.md +62 -192
  2. app.py +77 -37
  3. app_gradio_lightrag.py +72 -107
  4. utils/file_utils.py +37 -1
  5. utils/llm_login.py +1 -1
README.md CHANGED
@@ -34,220 +34,90 @@ requires-python: ">=3.12"
34
  #---
35
  ---
36
 
37
- # semmyKG[lightrag] - LightRAG-based Knowledge Graph Toolkit
38
 
39
- A modular, sophisticated Gradio application for Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) using the [LightRAG][1] framework.
40
 
41
- ## Overview
 
 
 
 
 
 
42
 
43
- semmyKG gears towards a comprehensive solution that combines the power of LightRAG with modern web interfaces to create, query, and visualise knowledge graphs from markdown documents.
44
- The toolkit enables intelligent document processing, semantic search, and interactive knowledge graph visualisation with support for multiple LLM backends. It supports OpenAI and Ollama LLM backends.
45
 
46
- ## Key Features
47
-
48
- ### 🔍 Intelligent Document processing and RAG Capabilities
49
- - **Dual-level KG-RAG**: Combines traditional RAG with knowledge graph reasoning (powered by LightRAG)
50
- - **Multi-modal LLM Support**: OpenAI, Ollama, and Google GenAI backends. Full GenAI support coming soon.
51
- - **Semantic Search**: Vector-based document retrieval with embedding models (powered by LightRAG)
52
- - **Multi-format Support**: Markdown ingestion with ParserPDF ([GitHub][3] | [HF Space][4]) integration for PDF, Word, and HTML conversion. Full integration coming soon.
53
- - **Markdown Ingestion**: Process and index markdown files from specified directories
54
- - **Knowledge Graph Construction**: Automatically builds entity-relationship graphs after indexing
55
- - **Interactive Visualisation**: Real-time KG exploration
56
-
57
- ### ️ Technical Excellence
58
- - **Modular Architecture**: Clean, maintainable code structure
59
- - **Async Operations**: Efficient handling of large document collections
60
- - **Robust Error Handling**: Comprehensive logging and exception management
61
-
62
- ## ️ Installation & Setup
63
-
64
- ### Method 1: Using UV (Recommended)
65
  ```bash
66
  git clone https://github.com/semmyk-research/semmyKG
67
  cd semmyKG
68
 
69
- # Create virtual environment and install dependencies
70
- uv venv .venv
71
- source .venv/bin/activate # Linux/MacOS
72
- # .venv\Scripts\activate on Windows
73
 
74
- # Sync dependencies
75
- uv pip sync
76
- ```
77
-
78
- ### Method 2: Traditional Python Setup
79
- ```bash
80
- git clone https://github.com/semmyk-research/semmyKG
81
- cd semmyKG
82
-
83
- # Create virtual environment
84
  python -m venv .venv
85
- source .venv/bin/activate # Linux/MacOS
86
- # .venv\Scripts\activate on Windows
87
-
88
- # Install dependencies
89
  pip install -r requirements.txt
90
  ```
91
 
92
- ## 🔧 Configuration
93
-
94
- ### Environment Variables Setup
95
- Copy `.env.example` to `.env` and configure your settings:
96
-
97
- ```env
98
- # API Configuration
99
  OPENAI_API_KEY=your-openai-api-key
100
-
101
- # Model Selection (format: provider/model-identifier)
102
- LLM_MODEL=openai/gpt-oss-120b
103
-
104
- # LLM Inference Endpoints
105
- OPENAI_API_BASE=your-llm-provider-endpoint
106
- # For local inference servers: http://localhost:1234/v1
107
-
108
- # Embedding Configuration
109
- OPENAI_API_EMBED_BASE=your-embedding-provider-endpoint
110
- # Note: For local embedding services, do not include /embedding in URL
111
- LLM_MODEL_EMBED=your-embedding-model
112
-
113
- # Ollama/Local hosting Configuration
114
  OLLAMA_HOST=http://localhost:11434
115
- OLLAMA_API_KEY=your-ollama-api-key-if-required
116
- #[For LMStudio] OLLAMA_API_KEY=lmstudio
117
-
118
- ## Alternative: Direct Web UI Configuration
119
- # If .env is not set, you can enter credentials directly in the web interface
120
- ```
121
-
122
- ## Quick Start
123
-
124
- ### 1. Initialise the Application
125
- ```bash
126
- python app.py
127
- ```
128
-
129
- ### 2. Web Interface Workflow
130
- 1. **Select Data Folder**: Choose your markdown documents directory (default: `dataset/data/docs`)
131
- 2. **Configure Settings**:
132
- - **Choose LLM Backend**: Select between OpenAI, Ollama, or GenAI
133
- - Select or input other configuration in the Settings pane,
134
- 3. **Activate**: Activate the lightRAG constructor
135
- 4. **Process Documents**: Click 'Index Documents' to process your files
136
- 5. **Query the System**: Enter your questions and select query mode
137
- 6. **Visualise Results**: Click 'Show Knowledge Graph' to finalise building Knowledge Graph and for interactive exploration
138
-
139
- ## 📁 Project Structure
140
-
141
- ```
142
- semmyKG/
143
- ├── app_gradio_lightrag.py # Central Gradio coordinating processing
144
- ├── app.py # Main Gradio app entry point
145
- ├── requirements.txt # Project dependencies
146
- ├── .env.example # Environment template
147
- ├── dataset/
148
- │ └── data/
149
- │ └── docs/ # Default document directory
150
- ├── utils/
151
- │ ├── utils.py # Utility functions
152
- │ ├── file_utils.py # File operations
153
- │ ├── logger.py # Logging configuration
154
- └── logs/ # Application logs
155
- ```
156
 
157
- ## Deployment Options
158
-
159
- ### Local Development
160
  ```bash
161
- python app.py
162
- ```
163
-
164
- ### HuggingFace Spaces
165
- - **Requirements**: Ensure all dependencies in `requirements.txt`
166
- - **Environment**: Configure via web UI or Space secrets
167
-
168
- ### Google Colab
169
- - **Quick Setup**: Install requirements and configure tokens in 'Secret'
170
- - **Run**: Copy to `Files`, following folder structure and run app cells as approriate
171
-
172
- ### 📋 System Requirements
173
 
174
- - **Python**: 3.12+
175
- - **Memory**: 8GB+ vRAM recommended for large document sets
176
- - **Storage**: Sufficient space for document collections and vector databases
177
-
178
- ### 🔌 Supported LLM Backends
179
-
180
- #### OpenAI Compatible and Google GenAI
181
- - **Models**: Frontline providers (Openai, Deepseek ...) and custom models
182
- - **Gemini Models**: Access to Google's latest AI models
183
- - **Endpoints**: Local inference servers (LMStudio, Jan.ai, ollama ...)
184
- - **Embedding Models**: Multiple sentence transformer models and inference providers
185
-
186
- #### Ollama Integration
187
- - **Local Models**: Access to Ollama's model ecosystem
188
- - **Self-hosted**: Complete data privacy and control
189
-
190
-
191
- ### Document Ingestion
192
- - **Format Support**: Markdown files only (use ParserPDF for other formats)
193
  ```python
194
- # The system automatically processes markdown files from:
195
- # - dataset/data/docs/ (default)
196
  ```
197
 
198
- ### Query Modes
199
- - **Semantic Search**: Vector-based similarity matching
200
- - **KG-enhanced RAG**: Combines traditional RAG with graph reasoning
201
-
202
- ### Interactive Visualisation
203
- - **Real-time Exploration**: Dynamic graph manipulation
204
- - **Entity Highlighting**: Focus on specific nodes and relationships
205
-
206
- ### 📈 Performance Optimisation: Batch Processing
207
- - **Parallel Insertion**: Configurable batch sizes
208
- - **Rate Limiting**: Built-in delays to prevent API throttling
209
-
210
- ### 📊 Custom System Prompts: Domain-Specific Expertise
211
- - **Domain Adaptation**: Modify prompts for specific use cases and customised NER (Named Entity Recognition)domain-specific entities rules
212
- - **Specialised Processing**: Tailored entity recognition for security domains
213
- - **Legislation Awareness**: Built-in understanding of legal frameworks
214
-
215
-
216
- ## 🔍 Troubleshooting
217
-
218
- ### Common Issues
219
- - **Module Import Errors**: Ensure all dependencies are installed
220
- - **API Connection Issues**: Verify endpoint URLs and API keys
221
- - **Memory Management**: Monitor resource usage during large-scale indexing
222
-
223
- ### Notes
224
- - All user-facing text are in UK English
225
- - For advanced configuration, see LightRAG documemntation
226
- Pending full integration, use our ParserPDF tool ([GitHub][3] | [HF Space][4]) to generate markdown from documents (PDF, Word, html)
227
-
228
- ## 🤝 Contributing
229
-
230
- We welcome contributions! Please see our contributing guidelines for more information.
231
-
232
- ## 🛣️ Roadmap (no defined timeline)
233
- - Integrate Huggingface log in (in progress)
234
  - [ParserPDF][3] integration
235
- - Pre and post processing document viewer
236
- - Modal platform support
237
- - Conected UX refactoring
238
-
239
- ## 📄 License
240
-
241
- This project is licensed under the [MIT License][2].
242
-
243
- ## 🔗 References
244
-
245
- - [LightRAG Framework][1]
246
- - [ParserPDF Tool][3] for document conversion
247
- - [HuggingFace Space][4] for ParserPDF
248
 
 
 
249
 
250
- [1]: https://github.com/HKUDS/LightRAG "LightRAG GitHub Repository"
251
  [2]: https://opensource.org/license/mit "MIT License"
252
- [3]: https://github.com/semmyk-research/parserPDF "ParserPDF GitHub Repository"
253
- [4]: https://huggingface.co/spaces/semmyk/parserPDF "ParserPDF HuggingFace Space"
 
34
  #---
35
  ---
36
 
37
+ # LightRAG Gradio App
38
 
39
+ A modern, modular Gradio app for knowledge graph-based Retrieval-Augmented Generation (RAG) using [LightRAG][1]. Supports OpenAI and Ollama LLM backends, markdown document ingestion, and interactive knowledge graph visualisation. Our ParserPDF ([GitHub][3] | [HF Space][4]) pipeline generate markdown from documents (pdf, Word, html).
40
 
41
+ ## Features
42
+ - LightRAG for Dual-level RAG and knowledge graph (KG)
43
+ - Ingest markdown files from a folder (default: `dataset/data/docs`).
44
+ - Query with OpenAI or Ollama backend (user-selectable)
45
+ - Visualise KG interactively in-browser
46
+ - Deployable to venv, Colab, or HuggingFace Spaces
47
+ - Robust, pythonic, modular code (UK English)
48
 
49
+ ## Setup
 
50
 
51
+ ### 1. Clone and create venv
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```bash
53
  git clone https://github.com/semmyk-research/semmyKG
54
  cd semmyKG
55
 
56
+ uv venv .venv # ensure you have the uv package
57
+ source .venv/bin/activate # or .venv\Scripts\activate on Windows
58
+ uv pip sync # or uv pip sync requirements.txt
 
59
 
60
+ or
 
 
 
 
 
 
 
 
 
61
  python -m venv .venv
62
+ source .venv/bin/activate # or .venv\Scripts\activate on Windows
 
 
 
63
  pip install -r requirements.txt
64
  ```
65
 
66
+ ### 2. Configure environment
67
+ Copy `.env.example` to `.env` and fill in your keys:
68
+ ```markdown
 
 
 
 
69
  OPENAI_API_KEY=your-openai-api-key
70
+ LLM_MODEL=your-LLM-model-Name
71
+ ##(in the format: provider/model-identifier)
72
+ OPENAI_API_BASE=your-LLM-inference-provider-endpoint
73
+ ##(for locally hosted llm inference server like LMStudio or Jan.ai, follow ollama host adding /v1: http://localhost:1234/v1)
74
+ OPENAI_API_EMBED_BASE=your-embedding-provider-endpoint
75
+ ##(for locally hosted, do not include /embedding)
76
+ LLM_MODEL_EMBED=your-embedding-model ##(in the format: provider/embedding-name)
 
 
 
 
 
 
 
77
  OLLAMA_HOST=http://localhost:11434
78
+ OLLAMA_API_KEY= ##(include if required)
79
+ ```
80
+ If .env is not set, you can enter into the web UI directly. <br>
81
+ Ditto, override .env by inputting directly in web UI.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
+ ### 3. Run the app
 
 
84
  ```bash
85
+ python app_gradio_lightrag.py
86
+ ```
87
+ For 'faster' development 'debug'
 
 
 
 
 
 
 
 
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ```python
90
+ ##SMY: assist: https://www.gradio.app/guides/developing-faster-with-reload-mode
91
+ gradio app_gradio_lightrag.py --demo-name=gradio_ui
92
  ```
93
 
94
+ ### 4. Colab/Spaces
95
+ - For HuggingFace Spaces: ensure all dependencies are in `requirements.txt` and `.env` is set via the web UI or Space secret.
96
+ - For Colab: install requirements and run the app cell.
97
+
98
+ ## Usage
99
+ - Select your data folder (default: `dataset/data/docs`)
100
+ - Choose LLM backend (OpenAI or Ollama). GenAI has a bug yieling error: role: 'assistant' instead of 'user' when updating history.
101
+ - Activate the RAG constructor
102
+ - Click 'Index Documents' to build the KG entities
103
+ - Click 'Query' to get answers
104
+ -- Enter your query and select query mode
105
+ - Click 'Show Knowledge Graph' to visualise the KG
106
+
107
+ ## Notes
108
+ - Only markdown files are supported for ingestion (images in `/images` subfolder are ignored for now). <br>NB: other formats will be enabled later: pdf, txt, html...
109
+ - To generate markdown from documents (PDf, Word, html), use our ParserPDF tool [GitHub][3] | [HF Space][4].
110
+ - All user-facing text is in UK English
111
+ - For advanced configuration, see LightRAG documentation
112
+
113
+ ## Roadmap (no defined timeline)
114
+ - HuggingFace log in
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  - [ParserPDF][3] integration
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
+ ## License
118
+ [MIT][2]
119
 
120
+ [1]: https://github.com/HKUDS/LightRAG "LightRAG GitHub"
121
  [2]: https://opensource.org/license/mit "MIT License"
122
+ [3]: https://github.com/semmyk-research/parserPDF "ParserPDF (GitHub)"
123
+ [4]: https://huggingface.co/spaces/semmyk/parserPDF "ParserPDF (HF Space)"
app.py CHANGED
@@ -6,6 +6,7 @@ import gradio as gr
6
  #from watchfiles import run_process ##gradio reload watch
7
  from app_gradio_lightrag import LightRAGApp ##SMY lightrag logging
8
  from utils.llm_login import get_login_token
 
9
 
10
  import asyncio
11
  import nest_asyncio
@@ -17,6 +18,7 @@ from dotenv import load_dotenv
17
  # Load environment variables
18
  load_dotenv()
19
 
 
20
  # Pythonic error handling decorator
21
  def handle_errors(func):
22
  def wrapper(*args, **kwargs):
@@ -25,6 +27,7 @@ def handle_errors(func):
25
  except Exception as e:
26
  return gr.update(value=f"Error: {e}")
27
  return wrapper
 
28
 
29
  # Instantiate app logic
30
  #app_logic = LightRAGApp() ## See main()
@@ -67,39 +70,64 @@ def gradio_ui(app_logic: LightRAGApp):
67
  """)
68
 
69
  # Step 0: Section 1
 
 
 
70
  # Define openai_api textbox initial value
71
  openai_api_key_init = os.getenv("OPENAI_API_KEY", "jan-ai")
72
- with gr.Accordion(label="🛞 LLM settings", open=False):
73
- with gr.Row():
74
- data_folder_tb = gr.Textbox(value="dataset/data/docs2", label="Data Folder (markdown only)", show_copy_button=True)
75
- working_dir_tb = gr.Textbox(value="./working_folder1", label="lightRAG working folder", show_copy_button=True)
76
- llm_backend_cb = gr.Radio(["OpenAI", "Ollama", "GenAI"], value="OpenAI", label="LLM Backend: OpenAI, Local or GenAI")
77
- llm_model_name_tb = gr.Textbox(value=os.getenv("LLM_MODEL", "openai/gpt-oss-120b"), label="LLM Model Name", show_copy_button=True) #.split('/')[1], label="LLM Model Name") "meta-llama/Llama-4-Maverick-17B-128E-Instruct")), #image-Text-to-Text #"openai/gpt-oss-120b",
78
- with gr.Row():
79
- with gr.Row(): #elem_classes="password-box"):
80
- #openai_key_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY", "jan-ai"), label="OpenAI API Key",
81
- # type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
82
- openai_key_tb = gr.Textbox(value=openai_api_key_init, label="OpenAI API Key",
83
- type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
84
- toggle_btn_openai_key = gr.Button(
85
- value="👁️", # Initial eye icon
86
- elem_classes="icon-button", size="sm") #, min_width=50)
87
- openai_baseurl_tb = gr.Textbox(value=os.getenv("OPENAI_API_BASE", "https://router.huggingface.co/v1"), label="OpenAI baseurl", show_copy_button=True)
88
- ollama_host_tb = gr.Textbox(value=os.getenv("OLLAMA_HOST", "http://localhost:1234/v1"), label="Ollama Host", show_copy_button=True)
89
- #ollama_host_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", ""), label="Ollama Host")
90
- with gr.Row():
91
- embed_backend_dd = gr.Dropdown(choices=["Transformer", "Provider"], value="Provider", label="Embedding Type")
92
- openai_baseurl_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", "http://localhost:1234/v1"), label="LLM Embed baseurl", show_copy_button=True)
93
- llm_model_embed_tb = gr.Textbox(value=os.getenv("LLM_MODEL_EMBED","text-embedding-bge-m3"), label="LLM Embedding Model", show_copy_button=True) #.split('/')[1], label="Embedding Model")
94
- with gr.Row(): #elem_classes="password-box"):
95
- openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="LLM API Key Embed", #lm-studio
96
- type="password", elem_classes="password-box", container=False, interactive=True, info="LLM API Key Embed") #, show_copy_button=True)
97
- toggle_btn_openai_key_embed = gr.Button(
98
- value="👁️", # Initial eye icon
99
- elem_classes="icon-button", size="sm") #, min_width=50)
100
- #openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="OpenAI API Key Embed", type="password", show_copy_button=True) #("OLLAMA_API_KEY", ""), label="OpenAI API Key Embed", type="password")
 
 
 
 
 
 
 
 
 
 
101
 
102
  # Step 1: Section 2
 
 
 
 
 
 
 
 
 
 
 
 
103
  with gr.Accordion("🤗 HuggingFace Client Control", open=True): #, open=False):
104
  # HuggingFace controls
105
  hf_login_logout_btn = gr.LoginButton(value="Sign in to HuggingFace 🤗", logout_value="Logout of HF: ({}) 🤗", variant="huggingface")
@@ -113,7 +141,7 @@ def gradio_ui(app_logic: LightRAGApp):
113
  gr.HTML("<hr>") #gr.Markdown("---")
114
 
115
  with gr.Row():
116
- index_btn = gr.Button("Index Documents")
117
  stop_btn = gr.Button("Stop", variant="stop") ## Add cancel event button
118
  query_text_tb = gr.Textbox(label="Your Query")
119
  mode_dd = gr.Dropdown(["naive", "local", "global", "hybrid", "mix"], value="hybrid", label="Query Mode")
@@ -135,6 +163,7 @@ def gradio_ui(app_logic: LightRAGApp):
135
  st_openai_key = gr.State(value=openai_api_key_init) #gr.State("")
136
  st_password1 = gr.State(value="password")
137
  st_password2 = gr.State(value="password")
 
138
 
139
 
140
  ### Change handling
@@ -217,9 +246,9 @@ def gradio_ui(app_logic: LightRAGApp):
217
 
218
 
219
  # Button logic with async handling
220
- async def setup_wrapper(df, wd, llm_back, embed_back, oai, base, base_embed, model, model_embed, host, embedkey):
221
- return await app_logic.setup(df, wd, llm_back, embed_back, oai,
222
- base, base_embed, model, model_embed, host, embedkey)
223
 
224
  async def index_wrapper(df):
225
  return await app_logic.index_documents(df)
@@ -241,6 +270,13 @@ def gradio_ui(app_logic: LightRAGApp):
241
  #hf_login_logout_btn.click(update_state_stored_value, inputs=openai_key_tb, outputs=st_openai_key)
242
  hf_login_logout_btn.click(fn=custom_do_logout, inputs=openai_key_tb, outputs=[hf_login_logout_btn, st_openai_key])
243
 
 
 
 
 
 
 
 
244
  toggle_btn_openai_key.click(
245
  fn=toggle_password,
246
  inputs=[st_password1],
@@ -253,10 +289,14 @@ def gradio_ui(app_logic: LightRAGApp):
253
  outputs=[openai_key_embed_tb, toggle_btn_openai_key_embed, st_password2],
254
  show_progress="hidden"
255
  )
256
-
257
- inputs_arg = [data_folder_tb, working_dir_tb, llm_backend_cb, embed_backend_dd, st_openai_key, #openai_key_tb,
 
 
 
 
258
  openai_baseurl_tb, openai_baseurl_embed_tb, llm_model_name_tb, llm_model_embed_tb,
259
- ollama_host_tb, openai_key_embed_tb]
260
 
261
  setup_btn.click(
262
  fn=setup_wrapper,
@@ -267,7 +307,7 @@ def gradio_ui(app_logic: LightRAGApp):
267
  )
268
  index_btn.click(
269
  fn=index_wrapper,
270
- inputs=[data_folder_tb],
271
  outputs=[status_box, progress_tb],
272
  show_progress=True
273
  )
 
6
  #from watchfiles import run_process ##gradio reload watch
7
  from app_gradio_lightrag import LightRAGApp ##SMY lightrag logging
8
  from utils.llm_login import get_login_token
9
+ from utils.file_utils import accumulate_dir
10
 
11
  import asyncio
12
  import nest_asyncio
 
18
  # Load environment variables
19
  load_dotenv()
20
 
21
+ '''
22
  # Pythonic error handling decorator
23
  def handle_errors(func):
24
  def wrapper(*args, **kwargs):
 
27
  except Exception as e:
28
  return gr.update(value=f"Error: {e}")
29
  return wrapper
30
+ '''
31
 
32
  # Instantiate app logic
33
  #app_logic = LightRAGApp() ## See main()
 
70
  """)
71
 
72
  # Step 0: Section 1
73
+
74
+ # Define ext type (in lieu of getting from global var)
75
+ #ext = (".md", "md") #SMY disused: 'tuple' object has no attribute '_id'
76
  # Define openai_api textbox initial value
77
  openai_api_key_init = os.getenv("OPENAI_API_KEY", "jan-ai")
78
+
79
+ with gr.Sidebar(position="right"):
80
+ system_prompt_tb = gr.Textbox(
81
+ value="You are a helpful assistant. You answer questions based on the provided context.", # If you don't know the answer, just say so. Don't make up information.",
82
+ label="System Prompt",
83
+ lines=3,
84
+ interactive=True,
85
+ show_copy_button=True,
86
+ )
87
+
88
+ with gr.Accordion(label="🛞 LLM settings", open=False):
89
+ with gr.Row():
90
+ llm_backend_cb = gr.Radio(["OpenAI", "Ollama", "GenAI"], value="OpenAI", label="LLM Backend: OpenAI, Local or GenAI")
91
+ llm_model_name_tb = gr.Textbox(value=os.getenv("LLM_MODEL", "openai/gpt-oss-120b"), label="LLM Model Name", show_copy_button=True) #.split('/')[1], label="LLM Model Name") "meta-llama/Llama-4-Maverick-17B-128E-Instruct")), #image-Text-to-Text #"openai/gpt-oss-120b",
92
+ with gr.Row():
93
+ with gr.Row(): #elem_classes="password-box"):
94
+ #openai_key_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY", "jan-ai"), label="OpenAI API Key",
95
+ # type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
96
+ openai_key_tb = gr.Textbox(value=openai_api_key_init, label="OpenAI API Key",
97
+ type="password", elem_classes="password-box", container=False, interactive=True, info="OpenAI API Key") #, show_copy_button=True)
98
+ toggle_btn_openai_key = gr.Button(
99
+ value="👁️", # Initial eye icon
100
+ elem_classes="icon-button", size="sm") #, min_width=50)
101
+ with gr.Row():
102
+ openai_baseurl_tb = gr.Textbox(value=os.getenv("OPENAI_API_BASE", "https://router.huggingface.co/v1"), label="OpenAI baseurl", show_copy_button=True)
103
+ ollama_host_tb = gr.Textbox(value=os.getenv("OLLAMA_HOST", "http://localhost:1234/v1"), label="Ollama Host", show_copy_button=True)
104
+ #ollama_host_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", ""), label="Ollama Host")
105
+ with gr.Row():
106
+ openai_baseurl_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_EMBED_BASE", "http://localhost:1234/v1"), label="LLM Embed baseurl", show_copy_button=True)
107
+ llm_model_embed_tb = gr.Textbox(value=os.getenv("LLM_MODEL_EMBED","text-embedding-bge-m3"), label="LLM Embedding Model", show_copy_button=True) #.split('/')[1], label="Embedding Model")
108
+ with gr.Row():
109
+ embed_backend_dd = gr.Dropdown(choices=["Transformer", "Provider"], value="Provider", label="Embedding Type")
110
+ with gr.Row(): #elem_classes="password-box"):
111
+ openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="LLM API Key Embed", #lm-studio
112
+ type="password", elem_classes="password-box", container=False, interactive=True, info="LLM API Key Embed") #, show_copy_button=True)
113
+ toggle_btn_openai_key_embed = gr.Button(
114
+ value="👁️", # Initial eye icon
115
+ elem_classes="icon-button", size="sm") #, min_width=50)
116
+ #openai_key_embed_tb = gr.Textbox(value=os.getenv("OPENAI_API_KEY_EMBED", "jan-ai"), label="OpenAI API Key Embed", type="password", show_copy_button=True) #("OLLAMA_API_KEY", ""), label="OpenAI API Key Embed", type="password")
117
 
118
  # Step 1: Section 2
119
+ with gr.Row():
120
+ with gr.Column():
121
+ #data_folder_tb = gr.Textbox(value="dataset/data/docs2", label="Data Folder (markdown only)", show_copy_button=True)
122
+ dir_btn = gr.UploadButton(
123
+ #value='dataset/data/', #docs2 #[Errno 13] Permission denied
124
+ label="📁 Upload Folder",
125
+ #file_types=ext, #["file"],
126
+ file_count="directory",
127
+ )
128
+ upload_count_md = gr.Markdown(visible=False)
129
+ working_dir_tb = gr.Textbox(value="./working_folder1", label="lightRAG working folder", show_copy_button=True)
130
+ working_dir_reset_cb = gr.Checkbox(value=False, label="Reset working files?")
131
  with gr.Accordion("🤗 HuggingFace Client Control", open=True): #, open=False):
132
  # HuggingFace controls
133
  hf_login_logout_btn = gr.LoginButton(value="Sign in to HuggingFace 🤗", logout_value="Logout of HF: ({}) 🤗", variant="huggingface")
 
141
  gr.HTML("<hr>") #gr.Markdown("---")
142
 
143
  with gr.Row():
144
+ index_btn = gr.Button("Index Documents", interactive=False)
145
  stop_btn = gr.Button("Stop", variant="stop") ## Add cancel event button
146
  query_text_tb = gr.Textbox(label="Your Query")
147
  mode_dd = gr.Dropdown(["naive", "local", "global", "hybrid", "mix"], value="hybrid", label="Query Mode")
 
163
  st_openai_key = gr.State(value=openai_api_key_init) #gr.State("")
164
  st_password1 = gr.State(value="password")
165
  st_password2 = gr.State(value="password")
166
+ state_uploaded_file_list = gr.State(value=[])
167
 
168
 
169
  ### Change handling
 
246
 
247
 
248
  # Button logic with async handling
249
+ async def setup_wrapper(df, wd, wd_reset, llm_back, embed_back, oai, base, base_embed, model, model_embed, host, embedkey, sys_prompt):
250
+ return await app_logic.setup(df, wd, wd_reset, llm_back, embed_back, oai,
251
+ base, base_embed, model, model_embed, host, embedkey, sys_prompt)
252
 
253
  async def index_wrapper(df):
254
  return await app_logic.index_documents(df)
 
270
  #hf_login_logout_btn.click(update_state_stored_value, inputs=openai_key_tb, outputs=st_openai_key)
271
  hf_login_logout_btn.click(fn=custom_do_logout, inputs=openai_key_tb, outputs=[hf_login_logout_btn, st_openai_key])
272
 
273
+ dir_btn.upload(
274
+ fn=accumulate_dir,
275
+ inputs=[dir_btn, state_uploaded_file_list],
276
+ outputs=[state_uploaded_file_list, index_btn, upload_count_md, status_box],
277
+ show_progress="hidden"
278
+ )
279
+
280
  toggle_btn_openai_key.click(
281
  fn=toggle_password,
282
  inputs=[st_password1],
 
289
  outputs=[openai_key_embed_tb, toggle_btn_openai_key_embed, st_password2],
290
  show_progress="hidden"
291
  )
292
+ '''
293
+ async def setup(self, data_folder: str, working_dir: str, wdir_reset: bool, llm_backend: str, embed_backend: str, openai_key: str,
294
+ openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str, llm_model_embed: str,
295
+ ollama_host: str, embed_key: str, system_prompt: str) -> str:
296
+ '''
297
+ inputs_arg = [state_uploaded_file_list, working_dir_tb, working_dir_reset_cb, llm_backend_cb, embed_backend_dd, st_openai_key, #openai_key_tb,
298
  openai_baseurl_tb, openai_baseurl_embed_tb, llm_model_name_tb, llm_model_embed_tb,
299
+ ollama_host_tb, openai_key_embed_tb, system_prompt_tb] #data_folder_tb,
300
 
301
  setup_btn.click(
302
  fn=setup_wrapper,
 
307
  )
308
  index_btn.click(
309
  fn=index_wrapper,
310
+ inputs=state_uploaded_file_list, #[data_folder_tb],
311
  outputs=[status_box, progress_tb],
312
  show_progress=True
313
  )
app_gradio_lightrag.py CHANGED
@@ -10,7 +10,7 @@ import random
10
  from functools import partial
11
  from typing import Tuple, Optional, Any, List, Union
12
 
13
- import inspect ##SMY lightrag_openai_compatible_demo.py
14
 
15
  def install(package):
16
  import subprocess
@@ -86,9 +86,7 @@ def configure_logging():
86
  # Get log directory path from environment variable or use current directory
87
  #log_dir = os.getenv("LOG_DIR", os.getcwd())
88
  log_dir = os.getenv("LOG_DIR", "logs")
89
- '''log_file_path = os.path.abspath(
90
- os.path.join(log_dir, "lightrag_compatible_demo.log")
91
- )'''
92
  if log_dir:
93
  log_file_path = Path(log_dir) / "lightrag_logs.log"
94
  else:
@@ -165,15 +163,25 @@ def visualise_graphml(graphml_path: str, working_dir: str) -> str:
165
  ## Load the GraphML file
166
  G = nx.read_graphml(graphml_path)
167
 
 
 
 
 
 
 
 
 
 
168
  ## Create a Pyvis network
169
  #net = Network(height="100vh", notebook=True)
170
- net = Network(notebook=True, width="100%", height="600px") #, heading=f"Knowledge Graph Visualisation") #(noteboot=False)
171
- ## Convert NetworkX graph to Pyvis network
172
  net.from_nx(G)
173
 
174
  # Add colors and title to nodes
175
  for node in net.nodes:
176
- node["color"] = "#{:06x}".format(random.randint(0, 0xFFFFFF))
 
177
  if "description" in node:
178
  node["title"] = node["description"]
179
 
@@ -184,22 +192,32 @@ def visualise_graphml(graphml_path: str, working_dir: str) -> str:
184
 
185
  ## Set the 'physics' attribute to repulsion
186
  net.repulsion(node_distance=120, spring_length=200)
187
- net.show_buttons(filter_=['physics']) ##SMY: dynamically modify the network
188
  #net.show_buttons()
189
 
190
  ## graph path
191
- kg_viz_html_file = "kg_viz.html"
192
- #html_path = os.path.join(working_dir, kg_viz_html_file)
193
  html_path = Path(working_dir) / kg_viz_html_file
194
 
195
- #net.save_graph(html_path)
196
  ## Save and display the generated KG network html
 
197
  #net.show(html_path)
198
  net.show(str(html_path), local=True, notebook=False)
199
 
200
- ##SMY read and display generated KG html
201
- #with open(html_path, "r", encoding="utf-8") as f:
202
- # return f.read() ## html
 
 
 
 
 
 
 
 
 
 
 
203
 
204
  # Utility: Get all markdown files in a folder
205
  def get_markdown_files(folder: str) -> list[str]:
@@ -242,6 +260,7 @@ class LightRAGApp:
242
 
243
  if custom_system_prompt:
244
  self.system_prompt = custom_system_prompt
 
245
  else:
246
  self.system_prompt = """
247
  You are a domain expert on Cybersecurity, the South Africa landscape and South African legislation.
@@ -269,8 +288,9 @@ class LightRAGApp:
269
  - For instance, maintain a single node for Protection of Information Act, Protection of Information Act, 1982, Protection of Information Act No 84, 1982.
270
  - However, have a separate node for Protection of Personal Information Act, 2013; as it it a separate legislation.
271
  - Also take note that 'Republic of South Africa' is an offical geo entity while 'South Africa' is a referred to place, although also a geo entity:
272
- - Always watch the context and becareful of lumping them together.
273
  """
 
274
 
275
  return self.system_prompt
276
 
@@ -358,7 +378,7 @@ class LightRAGApp:
358
  logger.debug(f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
359
  logger_kg.log(level=20, msg=f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
360
 
361
- # 2. Initialize the GenAI Client with Gemini API Key
362
  client = Client(api_key=self.llm_api_key) #api_key=gemini_api_key
363
  #aclient = genai.Client(api_key=self.llm_api_key).aio # use AsyncClient
364
 
@@ -454,14 +474,30 @@ class LightRAGApp:
454
 
455
  def _ensure_working_dir(self) -> str:
456
  """Ensure working directory exists and return status message"""
457
- '''if not os.path.exists(self.working_dir):
458
- os.makedirs(self.working_dir, exist_ok=True)
459
- return f"Created working directory: {self.working_dir}"'''
460
  if not Path(self.working_dir).exists():
461
  check_create_dir(self.working_dir)
462
  return f"Created working directory: {self.working_dir}"
463
  return f"Working directory exists: {self.working_dir}"
464
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465
 
466
  async def _initialise_storages(self) -> str:
467
  #def _initialise_storages(self) -> str:
@@ -487,31 +523,11 @@ class LightRAGApp:
487
  #print(f"_embedding_func: llm_api_key_embed: {self.llm_api_key_embed}")
488
  #print(f"_embedding_func: llm_baseurl_embed: {self.llm_baseurl_embed}")
489
 
490
- # Clear old data files
491
- #wrap_async(self._clear_old_data_files)
492
- #await self._clear_old_data_files()
493
- """Clear old data files"""
494
- files_to_delete = [
495
- "graph_chunk_entity_relation.graphml",
496
- "kv_store_doc_status.json",
497
- "kv_store_full_docs.json",
498
- "kv_store_text_chunks.json",
499
- "vdb_chunks.json",
500
- "vdb_entities.json",
501
- "vdb_relationships.json",
502
- ]
503
-
504
- for file in files_to_delete:
505
- '''file_path = os.path.join(self.working_dir, file)
506
- if os.path.exists(file_path):
507
- os.remove(file_path)
508
- print(f"Deleting old file:: {file_path}")'''
509
- file_path = Path(self.working_dir) / file
510
- if file_path.exists():
511
- file_path.unlink()
512
- logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})
513
-
514
 
 
 
 
 
515
  # Get embedding
516
  if self.embed_backend == "Transformer" or self.embed_backend[0] == "Transformer":
517
  logger_kg.log(level=20, msg=f"Getting embeddings dynamically through _embedding_func: ",
@@ -549,7 +565,7 @@ class LightRAGApp:
549
  await self._initialise_storages()
550
 
551
  #await rag.initialize_storages()
552
- #await initialize_pipeline_status() ##SMY: still relevant in updated lightRAG? - """Asynchronously finalize the storages"""
553
 
554
  self.status = f"Storages and pipeline initialised successfully" ##SMY: debug
555
  logger_kg.log(level=20, msg=f"Storages and pipeline initialised successfully")
@@ -561,9 +577,9 @@ class LightRAGApp:
561
 
562
  @handle_errors
563
  #def setup(self, data_folder: str, working_dir: str, llm_backend: str,
564
- async def setup(self, data_folder: str, working_dir: str, llm_backend: str, embed_backend: str,
565
  openai_key: str, openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str,
566
- llm_model_embed: str, ollama_host: str, embed_key: str) -> str:
567
  """Set up LightRAG with specified configuration"""
568
  # Configure environment
569
  #os.environ["OPENAI_API_KEY"] = openai_key or os.getenv("OPENAI_API_KEY", "")
@@ -573,8 +589,9 @@ class LightRAGApp:
573
  #os.environ["OPENAI_API_EMBED_BASE"] = openai_baseurl_embed or os.getenv("OPENAI_API_EMBED_BASE") #, "http://localhost:1234/v1/embeddings")
574
 
575
  # Update instance state
576
- self.data_folder = data_folder
577
  self.working_dir = working_dir
 
578
  self.llm_backend = llm_backend
579
  self.embed_backend = embed_backend if isinstance(embed_backend, str) else embed_backend[0],
580
  self.llm_model_name = llm_model_name
@@ -592,7 +609,7 @@ class LightRAGApp:
592
  except Exception as e:
593
  self.status = f"LightRAG initialisation.setup: working dir err | {str(e)}"
594
 
595
- # Initialize lightRAG with storages
596
  try:
597
  #self.rag = wrap_async( self._initialise_rag)
598
  self.rag = await self._initialise_rag()
@@ -628,15 +645,18 @@ class LightRAGApp:
628
  '''
629
 
630
  @handle_errors
631
- async def index_documents(self, data_folder: str) -> Tuple[str, str]:
632
  #def index_documents(self, data_folder: str) -> Tuple[str, str]:
633
  """Index markdown documents with progress tracking"""
634
  if not self._is_initialised or self.rag is None:
635
  return "Please initialise LightRAG first using the 'Initialise App' button.", "Not started"
636
 
637
- md_files = get_markdown_files(data_folder)
 
 
 
638
  if not md_files:
639
- return f"No markdown files found in {data_folder}:", "No files"
640
 
641
  try:
642
  total_files = len(md_files)
@@ -739,9 +759,7 @@ class LightRAGApp:
739
  """Display knowledge graph visualisation"""
740
  ## graphml_path: defaults to lightRAG's generated graph_chunk_entity_relation.graphml
741
  ## working_dir: lightRAG's working directory set by user
742
- '''graphml_path = os.path.join(self.working_dir, "graph_chunk_entity_relation.graphml")
743
- if not os.path.exists(graphml_path):
744
- return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."'''
745
  graphml_path = Path(self.working_dir) / "graph_chunk_entity_relation.graphml"
746
  if not Path(graphml_path).exists():
747
  return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."
@@ -759,59 +777,6 @@ class LightRAGApp:
759
 
760
 
761
  ############
762
- '''
763
- ##SMY: //TODO: Gradio toggle button
764
- def _clear_old_data_files(self):
765
- """Clear old data files"""
766
- files_to_delete = [
767
- "graph_chunk_entity_relation.graphml",
768
- "kv_store_doc_status.json",
769
- "kv_store_full_docs.json",
770
- "kv_store_text_chunks.json",
771
- "vdb_chunks.json",
772
- "vdb_entities.json",
773
- "vdb_relationships.json",
774
- ]
775
-
776
- for file in files_to_delete:
777
- file_path = Path(self.working_dir) / file
778
- if file_path.exists():
779
- file_path.unlink()
780
- logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})'''
781
- '''
782
-
783
- async def _get_llm_functions(self) -> Tuple[callable, callable]:
784
- #def _get_llm_functions(self) -> Tuple[callable, callable]:
785
- """Get LLM and embedding functions based on backend"""
786
- try:
787
- # Get embedding dimension dynamically
788
- try:
789
- embedding_dimension = await self._get_embedding_dim()
790
- self.status = f"Using embedding dimension: {embedding_dimension}"
791
- logger_kg.log(level=20, msg=f"Using embedding dimension: {embedding_dimension}")
792
- except Exception as e:
793
- # feedback dimensions error
794
- self.status = f"_get_llm_function: embedding_dim error with fallback: {str(e)}"
795
-
796
- # Create embedding function wrapper: # Wrap with EmbeddingFunc to provide required attributes
797
- embed_func = EmbeddingFunc(
798
- embedding_dim=embedding_dimension,
799
- max_token_size=8192, #4096, #8192, # Conservative default | #ollama
800
- func=self._embedding_func
801
- )
802
-
803
- # Get LLM function
804
- #llm_func = await self._llm_model_func ##SMY: not used
805
-
806
- # return LLM and embed functions
807
- #return llm_func, embed_func
808
- return await self._llm_model_func(), embed_func
809
-
810
- except Exception as e:
811
- self.status = f"{self.status} \n| _get_llm_functions error: {str(e)}"
812
- logger_kg.log(level=30, msg=f"{self.status} \n| _get_llm_functions error: {str(e)}")
813
- raise # Re-raise to be caught by the setup method
814
- '''
815
 
816
  '''
817
  ##SMY: record only. for deletion
 
10
  from functools import partial
11
  from typing import Tuple, Optional, Any, List, Union
12
 
13
+ from utils.utils import get_time_now_str ##SMY lightrag_openai_compatible_demo.py
14
 
15
  def install(package):
16
  import subprocess
 
86
  # Get log directory path from environment variable or use current directory
87
  #log_dir = os.getenv("LOG_DIR", os.getcwd())
88
  log_dir = os.getenv("LOG_DIR", "logs")
89
+
 
 
90
  if log_dir:
91
  log_file_path = Path(log_dir) / "lightrag_logs.log"
92
  else:
 
163
  ## Load the GraphML file
164
  G = nx.read_graphml(graphml_path)
165
 
166
+ ## Dynamically size nodes
167
+ # Calculate note attributes for sizing
168
+ node_degrees = dict(G.degree())
169
+
170
+ # Scale node degrees for better visual differentiation
171
+ max_degree = max(node_degrees.values())
172
+ for node, degree in node_degrees.items():
173
+ G.nodes[node]['size'] = 10 + (degree / max_degree) * 80 #40 # scaling
174
+
175
  ## Create a Pyvis network
176
  #net = Network(height="100vh", notebook=True)
177
+ net = Network(notebook=True, width="100%", height="100vh") #, heading=f"Knowledge Graph Visualisation") #(noteboot=False) height="600px",
178
+ # Convert NetworkX graph to Pyvis network
179
  net.from_nx(G)
180
 
181
  # Add colors and title to nodes
182
  for node in net.nodes:
183
+ #node["color"] = "#{:06x}".format(random.randint(0, 0xFFFFFF))
184
+ node["color"] = "#{:01x}".format(random.randint(0, 0xFFFFFF))
185
  if "description" in node:
186
  node["title"] = node["description"]
187
 
 
192
 
193
  ## Set the 'physics' attribute to repulsion
194
  net.repulsion(node_distance=120, spring_length=200)
195
+ net.show_buttons(filter_=['physics', 'layout']) ##SMY: dynamically modify the network
196
  #net.show_buttons()
197
 
198
  ## graph path
199
+ kg_viz_html_file = f"kg_viz_{get_time_now_str(date_format='%Y-%m-%d')}.html"
 
200
  html_path = Path(working_dir) / kg_viz_html_file
201
 
 
202
  ## Save and display the generated KG network html
203
+ #net.save_graph(html_path)
204
  #net.show(html_path)
205
  net.show(str(html_path), local=True, notebook=False)
206
 
207
+ # get HTML content
208
+ html_iframe = net.generate_html(str(html_path), local=True, notebook=False)
209
+ ## need to remove ' from HTML ##assist: https://huggingface.co/spaces/simonduerr/pyvisdemo/blob/main/app.py
210
+ html_iframe = html_iframe.replace("'", "\"")
211
+
212
+ ##SMY display generated KG html
213
+ #'''
214
+ return gr.update(show_label=True, container=True, value=f"""<iframe style="width: 100%; height: 100vh;margin:0 auto" name="result" allow="midi; geolocation; microphone; camera;
215
+ display-capture; encrypted-media;" sandbox="allow-modals allow-forms
216
+ allow-scripts allow-same-origin allow-popups
217
+ allow-top-navigation-by-user-activation allow-downloads" allowfullscreen=""
218
+ allowpaymentrequest="" frameborder="0" srcdoc='{html_iframe}'></iframe>"""
219
+ )
220
+ #'''
221
 
222
  # Utility: Get all markdown files in a folder
223
  def get_markdown_files(folder: str) -> list[str]:
 
260
 
261
  if custom_system_prompt:
262
  self.system_prompt = custom_system_prompt
263
+ ''' ## system_prompt now in gradio ui
264
  else:
265
  self.system_prompt = """
266
  You are a domain expert on Cybersecurity, the South Africa landscape and South African legislation.
 
288
  - For instance, maintain a single node for Protection of Information Act, Protection of Information Act, 1982, Protection of Information Act No 84, 1982.
289
  - However, have a separate node for Protection of Personal Information Act, 2013; as it it a separate legislation.
290
  - Also take note that 'Republic of South Africa' is an offical geo entity while 'South Africa' is a referred to place, although also a geo entity:
291
+ - Always watch the context and be careful of lumping them together.
292
  """
293
+ '''
294
 
295
  return self.system_prompt
296
 
 
378
  logger.debug(f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
379
  logger_kg.log(level=20, msg=f"Sending messages to Gemini: Model: {self.llm_model_name.rpartition('/')[-1]} \n~ Message: {prompt}")
380
 
381
+ # 2. Initialise the GenAI Client with Gemini API Key
382
  client = Client(api_key=self.llm_api_key) #api_key=gemini_api_key
383
  #aclient = genai.Client(api_key=self.llm_api_key).aio # use AsyncClient
384
 
 
474
 
475
  def _ensure_working_dir(self) -> str:
476
  """Ensure working directory exists and return status message"""
477
+
 
 
478
  if not Path(self.working_dir).exists():
479
  check_create_dir(self.working_dir)
480
  return f"Created working directory: {self.working_dir}"
481
  return f"Working directory exists: {self.working_dir}"
482
 
483
+ ##SMY: //TODO: Gradio toggle button
484
+ async def _clear_old_data_files(self):
485
+ """Clear old data files"""
486
+ files_to_delete = [
487
+ "graph_chunk_entity_relation.graphml",
488
+ "kv_store_doc_status.json",
489
+ "kv_store_full_docs.json",
490
+ "kv_store_text_chunks.json",
491
+ "vdb_chunks.json",
492
+ "vdb_entities.json",
493
+ "vdb_relationships.json",
494
+ ]
495
+
496
+ for file in files_to_delete:
497
+ file_path = Path(self.working_dir) / file
498
+ if file_path.exists():
499
+ file_path.unlink()
500
+ logger_kg.log(level=20, msg=f"LightRAG class: Deleting old files", extra={"filepath": file_path.name})
501
 
502
  async def _initialise_storages(self) -> str:
503
  #def _initialise_storages(self) -> str:
 
523
  #print(f"_embedding_func: llm_api_key_embed: {self.llm_api_key_embed}")
524
  #print(f"_embedding_func: llm_baseurl_embed: {self.llm_baseurl_embed}")
525
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
526
 
527
+ if self.working_dir_reset:
528
+ # Clear old data files
529
+ await self._clear_old_data_files()
530
+
531
  # Get embedding
532
  if self.embed_backend == "Transformer" or self.embed_backend[0] == "Transformer":
533
  logger_kg.log(level=20, msg=f"Getting embeddings dynamically through _embedding_func: ",
 
565
  await self._initialise_storages()
566
 
567
  #await rag.initialize_storages()
568
+ #await initialize_pipeline_status() ##SMY: still relevant in updated lightRAG? - """Asynchronously finalise the storages"""
569
 
570
  self.status = f"Storages and pipeline initialised successfully" ##SMY: debug
571
  logger_kg.log(level=20, msg=f"Storages and pipeline initialised successfully")
 
577
 
578
  @handle_errors
579
  #def setup(self, data_folder: str, working_dir: str, llm_backend: str,
580
+ async def setup(self, data_folder: str, working_dir: str, wdir_reset: bool, llm_backend: str, embed_backend: str,
581
  openai_key: str, openai_baseurl: str, openai_baseurl_embed: str, llm_model_name: str,
582
+ llm_model_embed: str, ollama_host: str, embed_key: str, system_prompt: str) -> str:
583
  """Set up LightRAG with specified configuration"""
584
  # Configure environment
585
  #os.environ["OPENAI_API_KEY"] = openai_key or os.getenv("OPENAI_API_KEY", "")
 
589
  #os.environ["OPENAI_API_EMBED_BASE"] = openai_baseurl_embed or os.getenv("OPENAI_API_EMBED_BASE") #, "http://localhost:1234/v1/embeddings")
590
 
591
  # Update instance state
592
+ self.data_folder = data_folder ##SMY: redundant
593
  self.working_dir = working_dir
594
+ self.working_dir_reset = wdir_reset
595
  self.llm_backend = llm_backend
596
  self.embed_backend = embed_backend if isinstance(embed_backend, str) else embed_backend[0],
597
  self.llm_model_name = llm_model_name
 
609
  except Exception as e:
610
  self.status = f"LightRAG initialisation.setup: working dir err | {str(e)}"
611
 
612
+ # Initialise lightRAG with storages
613
  try:
614
  #self.rag = wrap_async( self._initialise_rag)
615
  self.rag = await self._initialise_rag()
 
645
  '''
646
 
647
  @handle_errors
648
+ async def index_documents(self, data_folder: Union[list[str], str]) -> Tuple[str, str]:
649
  #def index_documents(self, data_folder: str) -> Tuple[str, str]:
650
  """Index markdown documents with progress tracking"""
651
  if not self._is_initialised or self.rag is None:
652
  return "Please initialise LightRAG first using the 'Initialise App' button.", "Not started"
653
 
654
+ #md_files = get_markdown_files(data_folder) #data_folder is now list of ploaded files
655
+ #if not md_files:
656
+ # return f"No markdown files found in {data_folder}:", "No files"
657
+ md_files = data_folder
658
  if not md_files:
659
+ return f"No markdown files uploaded {data_folder}:", "No files"
660
 
661
  try:
662
  total_files = len(md_files)
 
759
  """Display knowledge graph visualisation"""
760
  ## graphml_path: defaults to lightRAG's generated graph_chunk_entity_relation.graphml
761
  ## working_dir: lightRAG's working directory set by user
762
+
 
 
763
  graphml_path = Path(self.working_dir) / "graph_chunk_entity_relation.graphml"
764
  if not Path(graphml_path).exists():
765
  return "Knowledge graph file not found. Please index documents first to generate Knowledge Graph."
 
777
 
778
 
779
  ############
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
780
 
781
  '''
782
  ##SMY: record only. for deletion
utils/file_utils.py CHANGED
@@ -131,6 +131,40 @@ def create_temp_folder(tempfolder: Optional[str | Path] = '', program_name: str
131
 
132
  return output_dir
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ##=========
136
  def find_file(file_name: str) -> Path: #configparser.ConfigParser:
@@ -194,6 +228,8 @@ def resolve_grandparent_object(gp_object:str):
194
  ###
195
  # Create a Path object based on current file's location, resolve it to an absolute path,
196
  # and then get its parent's parent using chained .parent calls or the parents[] attribute.
 
 
197
 
198
  # 1. Get the current script's path, its parent and its grandparent directory
199
  try:
@@ -339,7 +375,7 @@ def accumulate_files(uploaded_files, current_state):
339
 
340
  from globals_config import config_load
341
  import gradio as gr
342
- # Initialize state if it's the first run
343
  if current_state is None:
344
  current_state = []
345
 
 
131
 
132
  return output_dir
133
 
134
+ def accumulate_dir(uploaded_files, current_state, ext: Union[str, tuple] = (".md", "md")):
135
+ """ accumulate uploaded files in dir based on ext with the existing state """
136
+
137
+ import gradio as gr
138
+
139
+ # Initialise state if it's the first run
140
+ if current_state is None:
141
+ current_state = []
142
+
143
+ # Check if files were uploaded in the current iteration, return the current state.
144
+ if not uploaded_files:
145
+ return current_state, gr.update(), gr.update(visible=True, value="No new files uploaded"), gr.update(value="No new files uploaded")
146
+
147
+ # call is_file_with_extension to check if pathlib.Path object is a file and has a non-empty extension
148
+ #new_file_paths = [f.name for f in uploaded_files if is_file_with_extension(Path(f.name))] #Path(f.name) and Path(f.name).is_file() and bool(Path(f.name).suffix)] #Path(f.name).suffix.lower() !=""]
149
+ new_file_paths = [f.name for f in uploaded_files if is_file_with_extension(Path(f.name)) and f.name.endswith(ext)]
150
+
151
+ # Concatenate the new files with the existing ones in the state
152
+ updated_files = current_state + new_file_paths
153
+ updated_filenames = [Path(f).name for f in updated_files] ##SMY: filenames only
154
+
155
+ updated_files_count = len(updated_files)
156
+
157
+ # Return the updated state and a message to the user
158
+ filename_info = "\n".join(updated_filenames) ##SMY: not used(updated_filenames)
159
+ #message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
160
+ message_count = f"Accumulated {updated_files_count} file(s) total."
161
+ message = f"Accumulated {updated_files_count} file(s) total: \n{filename_info}"
162
+
163
+
164
+ #outputs=[state_uploaded_file_list, dir_btn, upload_count_md, status_box],
165
+ #return updated_files, updated_files_count, message, gr.update(interactive=True), gr.update(interactive=True)
166
+ return updated_files, gr.update(interactive=True,), gr.update(visible=True, value=message_count), gr.update(value=message)
167
+
168
 
169
  ##=========
170
  def find_file(file_name: str) -> Path: #configparser.ConfigParser:
 
228
  ###
229
  # Create a Path object based on current file's location, resolve it to an absolute path,
230
  # and then get its parent's parent using chained .parent calls or the parents[] attribute.
231
+
232
+ #import sys
233
 
234
  # 1. Get the current script's path, its parent and its grandparent directory
235
  try:
 
375
 
376
  from globals_config import config_load
377
  import gradio as gr
378
+ # Initialise state if it's the first run
379
  if current_state is None:
380
  current_state = []
381
 
utils/llm_login.py CHANGED
@@ -29,7 +29,7 @@ def get_login_token( api_token_arg, oauth_token):
29
 
30
  def login_huggingface(token: Optional[str] = None):
31
  """
32
- Login to Hugging Face account. Prioritize CLI login for privacy and determinism.
33
 
34
  Attempts to log in to Hugging Face Hub.
35
  First, it tries to log in interactively via the Hugging Face CLI.
 
29
 
30
  def login_huggingface(token: Optional[str] = None):
31
  """
32
+ Login to Hugging Face account. Prioritise CLI login for privacy and determinism.
33
 
34
  Attempts to log in to Hugging Face Hub.
35
  First, it tries to log in interactively via the Hugging Face CLI.