lailaelkoussy commited on
Commit
f766c60
·
2 Parent(s): 7823b94 f62ca2a

Merge branch 'main' of https://huggingface.co/spaces/lailaelkoussy/transformers-library-knowledge-graph

Browse files
Files changed (1) hide show
  1. README.md +274 -67
README.md CHANGED
@@ -6,45 +6,205 @@ colorTo: purple
6
  sdk: docker
7
  app_port: 7860
8
  pinned: false
9
- tags:
10
- - building-mcp-track-enterprise
 
11
  ---
12
 
13
- # Knowledge Graph MCP Explorer
14
 
15
- This is a Gradio-based interactive tool for exploring code repository knowledge graphs. It provides a web interface to search, navigate, and analyze code relationships using the Model Context Protocol (MCP).
16
 
17
- ## Features
18
 
19
- - **Search Nodes**: Search for code entities, functions, classes, and more using semantic search
20
- - **Graph Navigation**: Explore relationships between code elements
21
- - **Entity Tracking**: View declared and called entities within code chunks
22
- - **Path Finding**: Find paths between different nodes in the knowledge graph
23
- - **Subgraph Extraction**: Extract and visualize subgraphs around specific nodes
24
- - **File Structure**: View the hierarchical structure of the repository
25
 
26
- ## Usage
27
 
28
- The application loads a pre-built knowledge graph from the HuggingFace Transformers repository. You can:
29
 
30
- 1. **Search**: Use the search tab to find relevant code snippets and entities
31
- 2. **Explore**: Navigate through the graph using node IDs
32
- 3. **Analyze**: Get statistics about the code structure and relationships
33
 
34
- ## Technical Details
35
 
36
- - Built with Gradio for the web interface
37
- - Uses LanceDB for efficient code indexing and search
38
- - Supports hybrid search (keyword + semantic embeddings)
39
- - Pre-computed embeddings using Salesforce/SFR-Embedding-Code-400M_R model
40
 
41
- ## Data Sources
42
 
43
- The application supports loading knowledge graphs from:
44
 
45
- ### 1. HuggingFace Hub Dataset (Recommended)
 
 
 
 
 
46
 
47
- Load directly from a HuggingFace dataset:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ```bash
50
  python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
@@ -58,9 +218,17 @@ Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):
58
  python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
59
  ```
60
 
61
- ### Creating and Publishing a Dataset
 
 
 
 
 
 
 
 
62
 
63
- You can save an existing knowledge graph to HuggingFace Hub:
64
 
65
  ```python
66
  from RepoKnowledgeGraphLib import RepoKnowledgeGraph
@@ -75,63 +243,102 @@ kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=F
75
  kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
76
  ```
77
 
78
- ## Docker Configuration
79
 
80
- The default Dockerfile uses a local JSON file. To use HuggingFace datasets instead, modify the CMD line in `Dockerfile`:
81
 
82
- ```dockerfile
83
- # Using HuggingFace dataset (recommended for smaller Docker image)
84
- CMD ["python", "-u", "gradio_mcp.py", "--host", "0.0.0.0", "--port", "7860", "--hf-dataset", "username/dataset-name"]
85
-
86
- # Using local file (requires large data file in image)
87
- CMD ["python", "-u", "gradio_mcp.py", "--host", "0.0.0.0", "--port", "7860", "--graph-file", "/app/data/multihop_knowledge_graph_with_embeddings.json"]
 
 
 
 
 
 
 
88
  ```
89
 
90
- ## Local Development
91
 
92
- To run locally:
93
 
94
- ```bash
95
- docker build -t gradio-mcp-space .
96
- docker run -p 7860:7860 gradio-mcp-space
97
- ```
98
 
99
- Or without Docker:
100
 
101
- ```bash
102
- pip install -r requirements.txt
103
- python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
104
- ```
105
 
106
- ## Deployment to HuggingFace Spaces
107
 
108
- ### Option 1: Using HuggingFace Dataset (Recommended)
 
 
109
 
110
- 1. First, push your knowledge graph to a HuggingFace dataset
111
- 2. Update the Dockerfile CMD to use `--hf-dataset`
112
- 3. Push to the Space repository (no large files needed)
113
 
114
- ### Option 2: Using Local JSON File
115
 
116
- 1. Create a new Space on HuggingFace with Docker SDK
117
- 2. Enable Git LFS in your Space repository
118
- 3. Push this directory to the Space repository:
119
- ```bash
120
- git lfs install
121
- git lfs track "data/*.json"
122
- git add .
123
- git commit -m "Initial commit"
124
- git push
125
- ```
126
 
 
127
 
 
 
 
 
128
 
 
129
 
 
130
 
131
- ## 👥 Team
132
 
133
- **Team Name:** CEPIA Ionis Team
134
 
135
- **Team Members:**
136
- - **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
137
- - **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) -
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  sdk: docker
7
  app_port: 7860
8
  pinned: false
9
+ tags:
10
+ - building-mcp-track-enterprise
11
+ short_description: MCP server for big code — explore Transformers
12
  ---
13
 
14
+ # 🎓 Code Knowledge Graph MCP Server
15
 
16
+ > **Helping LLM-based agents navigate and understand large codebases**
17
 
18
+ ## 📚 What is this project?
19
 
20
+ This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases — a critical capability for modern software engineering education and practice.
 
 
 
 
 
21
 
22
+ ## 🔬 Use Case: EPITA Coding Courses
23
 
24
+ This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:
25
 
26
+ ### 🔍 Enhanced Code Discovery for Agents
 
 
27
 
28
+ LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:
29
 
30
+ - Query the knowledge graph to understand the overall architecture
31
+ - Follow relationships between modules, classes, and functions
32
+ - Identify entry points and critical code paths
33
+ - Understand how different parts of the codebase interact
34
 
35
+ ### 📈 Detecting Areas for Code Improvement
36
 
37
+ For EPITA courses, this tool helps agents **identify areas where student code can be improved**:
38
 
39
+ - **Dead Code Detection**: Find unused functions, classes, or variables
40
+ - **Circular Dependencies**: Detect problematic import cycles between modules
41
+ - **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
42
+ - **Missing Documentation**: Find undocumented public APIs and complex functions
43
+ - **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
44
+ - **Orphan Code**: Detect code that is declared but never called
45
 
46
+ ### 🎓 EPITA Course Integration
47
+
48
+ - **Project Reviews**: Quickly understand student project architectures before grading
49
+ - **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
50
+ - **Code Quality Assessment**: Consistent evaluation criteria across student submissions
51
+ - **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
52
+ - **Research**: Study code organization patterns across student projects
53
+
54
+ The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.
55
+
56
+ ---
57
+
58
+ ### 🎯 The Problem We Solve
59
+
60
+ At **EPITA** (École pour l'informatique et les techniques avancées), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases — whether their own, their teammates', or open-source libraries — is a fundamental skill for any computer science engineer.
61
+
62
+ However, LLM-based coding assistants face significant challenges when working with large repositories:
63
+
64
+ - **Context window limitations**: LLMs cannot process entire codebases at once
65
+ - **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
66
+ - **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
67
+ - **Inefficient search**: Simple keyword search fails to capture semantic meaning
68
+
69
+ ### 💡 Our Solution: Knowledge Graphs + MCP
70
+
71
+ This project addresses these challenges by:
72
+
73
+ 1. **Parsing repositories** into a structured knowledge graph (files → chunks → entities)
74
+ 2. **Extracting relationships** between code elements (calls, contains, declares, imports)
75
+ 3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
76
+ 4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently
77
+
78
+ ```
79
+ ┌─────────────────────────────────────────────────────────────────┐
80
+ │ CODE REPOSITORY │
81
+ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
82
+ │ │ File A │ │ File B │ │ File C │ │ File D │ ... │
83
+ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
84
+ └───────┼─────────────┼─────────────┼─────────────┼───────────────┘
85
+ ▼ ▼ ▼ ▼
86
+ ┌─────────────────────────────────────────────────────────────────┐
87
+ │ KNOWLEDGE GRAPH CONSTRUCTION │
88
+ │ • AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML) │
89
+ │ • Entity Extraction (classes, functions, variables, methods) │
90
+ │ • Relationship Detection (calls, inheritance, imports) │
91
+ │ • Code Chunking & Embedding (semantic vectors) │
92
+ └───────────────────────────────┬─────────────────────────────────┘
93
+
94
+ ┌─────────────────────────────────────────────────────────────────┐
95
+ │ MCP SERVER (Gradio) │
96
+ │ ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
97
+ │ │search_nodes │ │go_to_def │ │find_usages │ │get_neighbors│ │
98
+ │ └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
99
+ │ ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
100
+ │ │get_file_ │ │get_related │ │find_path │ │print_tree │ │
101
+ │ │structure │ │_chunks │ │ │ │ │ │
102
+ │ └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
103
+ └───────────────────────────────┬─────────────────────────────────┘
104
+
105
+ ┌─────────────────────────────────────────────────────────────────┐
106
+ │ LLM-BASED AGENT │
107
+ │ • Can search for relevant code using natural language │
108
+ │ • Navigate from function calls to their definitions │
109
+ │ • Understand the structure of files and directories │
110
+ │ • Trace dependencies and relationships across the codebase │
111
+ └─────────────────────────────────────────────────────────────────┘
112
+ ```
113
+
114
+ ## 🛠️ MCP Tools Available
115
+
116
+ The MCP server exposes the following tools for LLM agents:
117
+
118
+ | Tool | Description |
119
+ | ------------------------- | --------------------------------------------------------- |
120
+ | `search_nodes` | Semantic + keyword search for code chunks |
121
+ | `get_node_info` | Detailed information about any node (file, chunk, entity) |
122
+ | `get_node_edges` | Incoming and outgoing relationships of a node |
123
+ | `go_to_definition` | Find where a function/class/variable is declared |
124
+ | `find_usages` | Find all places where an entity is called/used |
125
+ | `get_neighbors` | Get all directly connected nodes |
126
+ | `get_file_structure` | Overview of a file's chunks and entities |
127
+ | `get_related_chunks` | Find chunks related by a specific relationship type |
128
+ | `list_all_entities` | List all tracked entities in the codebase |
129
+ | `get_graph_stats` | Statistics about the knowledge graph |
130
+ | `find_path` | Find shortest path between two nodes |
131
+ | `get_subgraph` | Extract a subgraph around a node |
132
+ | `print_tree` | Display repository structure as a tree |
133
+ | `diff_chunks` | Compare content between two code chunks |
134
+ | `search_by_type_and_name` | Search entities by type (class, function, etc.) and name |
135
+ | `get_chunk_context` | Get a chunk with its surrounding context |
136
+
137
+ ## 🌐 Supported Languages
138
+
139
+ The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:
140
+
141
+ | Language | Parser | Entity Types |
142
+ | --------------------- | --------------- | ----------------------------------------------- |
143
+ | Python | `ast` module | classes, functions, methods, variables, imports |
144
+ | C | `libclang` | functions, structs, typedefs, variables |
145
+ | C++ | `libclang` | classes, namespaces, methods, templates |
146
+ | Java | `javalang` | classes, interfaces, methods, fields |
147
+ | JavaScript/TypeScript | `esprima` | classes, functions, variables, imports |
148
+ | Rust | `tree-sitter` | structs, enums, traits, functions, modules |
149
+ | HTML | `BeautifulSoup` | DOM elements, inline JS extraction |
150
+
151
+ The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).
152
+
153
+ ## 🚀 Getting Started
154
+
155
+ ### Prerequisites
156
+
157
+ - Docker & Docker Compose
158
+ - Python 3.10+ (for local development)
159
+ - CUDA-capable GPU (optional, for faster embeddings)
160
+
161
+ ### Quick Start with Docker
162
+
163
+ ```bash
164
+ # Start the MCP server with a sample knowledge graph
165
+ docker-compose up
166
+ ```
167
+
168
+ ### Building a Knowledge Graph from Your Repository
169
+
170
+ ```python
171
+ from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
172
+
173
+ # From a local path
174
+ kg = RepoKnowledgeGraph.from_path(
175
+ "/path/to/your/repo",
176
+ skip_dirs=["node_modules", ".git", "__pycache__"],
177
+ extract_entities=True,
178
+ index_nodes=True
179
+ )
180
+
181
+ # Save for later use
182
+ kg.save_graph_to_file("my_knowledge_graph.json")
183
+ ```
184
+
185
+ ### Running the MCP using Gradio
186
+
187
+ ```bash
188
+ python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
189
+
190
+ ```
191
+
192
+ ## 📊 Interactive Explorer (Gradio UI)
193
+
194
+ The project includes a Gradio-based web interface for exploring knowledge graphs interactively:
195
+
196
+ - **Search**: Use natural language or keywords to find relevant code
197
+ - **Navigate**: Click through nodes to explore relationships
198
+ - **Analyze**: Get statistics about code structure and dependencies
199
+ - **Visualize**: View the repository tree and entity relationships
200
+
201
+ ## 📁 Data Sources
202
+
203
+ The application supports loading knowledge graphs from multiple sources:
204
+
205
+ ### 1. HuggingFace Hub Dataset (Recommended for Sharing)
206
+
207
+ Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):
208
 
209
  ```bash
210
  python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
 
218
  python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
219
  ```
220
 
221
+ ### 3. Direct from Git Repository
222
+
223
+ Clone and analyze a repository on-the-fly:
224
+
225
+ ```bash
226
+ python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
227
+ ```
228
+
229
+ ### Publishing to HuggingFace Hub
230
 
231
+ You can save an existing knowledge graph to HuggingFace Hub for sharing:
232
 
233
  ```python
234
  from RepoKnowledgeGraphLib import RepoKnowledgeGraph
 
243
  kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
244
  ```
245
 
 
246
 
247
+ ## 🏗️ Architecture Overview
248
 
249
+ ```
250
+ root/
251
+ ├── Dockerfile # Docker configuration
252
+ ├── requirements.txt # Python dependencies
253
+ ├── RepoKnowledgeGraphLib/ # Knowledge graph implementation
254
+ │ ├── RepoKnowledgeGraph.py # Main graph class
255
+ │ ├── KnowledgeGraphMCPServer.py # MCP server implementation
256
+ │ ├── EntityExtractor.py # AST-based entity extraction
257
+ │ ├── CodeParser.py # Code chunking
258
+ │ ├── CodeIndex.py # Hybrid search (LanceDB/Weaviate)
259
+ │ ├── ModelService.py # Embedding generation
260
+ │ └── Node.py # Graph node types
261
+ └── gradio_mcp_space.py # Main Gradio web interface
262
  ```
263
 
 
264
 
 
265
 
 
 
 
 
266
 
 
267
 
268
+ ## 👥 Team
 
 
 
269
 
270
+ **Team Name:** CEPIA Ionis Team
271
 
272
+ **Team Members:**
273
+ - **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
274
+ - **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
275
 
276
+ ---
 
 
277
 
278
+ ## 📄 License
279
 
280
+ This project is developed as part of research at EPITA / Ionis Group.
 
 
 
 
 
 
 
 
 
281
 
282
+ ## 🔗 Related Resources
283
 
284
+ - [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
285
+ - [Gradio](https://gradio.app/) - Python web interface framework with MCP support
286
+ - [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
287
+ - [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model
288
 
289
+ ## 🆚 VS Code Integration
290
 
291
+ To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.
292
 
293
+ ### Configuration File Location
294
 
295
+ Create or edit the file at `.vscode/mcp.json` in your workspace root:
296
 
297
+ ```
298
+ your-workspace/
299
+ ├── .vscode/
300
+ │ └── mcp.json ← Place the configuration here
301
+ ├── src/
302
+ └── ...
303
+ ```
304
+
305
+ ### Configuration Content
306
+
307
+ Add the following content to `.vscode/mcp.json`:
308
+
309
+ ```jsonc
310
+ {
311
+ "servers": {
312
+ "transformers-code-graph": {
313
+ "url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
314
+ "type": "http"
315
+ }
316
+ },
317
+ "inputs": []
318
+ }
319
+ ```
320
+
321
+ ### What This Does
322
+
323
+ - **`servers`**: Defines the MCP servers available to VS Code
324
+ - **`transformers-code-graph`**: A custom name for this server connection
325
+ - **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
326
+ - **`type`**: Set to `"http"` for remote HTTP-based MCP servers
327
+
328
+ ### Using with Your Own Server
329
+
330
+ If you're running your own MCP server locally, update the URL accordingly:
331
+
332
+ ```jsonc
333
+ {
334
+ "servers": {
335
+ "my-code-graph": {
336
+ "url": "http://localhost:7860/gradio_api/mcp/",
337
+ "type": "http"
338
+ }
339
+ },
340
+ "inputs": []
341
+ }
342
+ ```
343
+
344
+ Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.