lailaelkoussy commited on
Commit
b1ddffc
Β·
verified Β·
1 Parent(s): f1dcdb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +332 -5
README.md CHANGED
@@ -1,10 +1,337 @@
1
  ---
2
- title: Code Knowledge Graph Explorer Transformers Library
3
- emoji: 🏒
4
- colorFrom: red
5
- colorTo: gray
6
  sdk: docker
 
7
  pinned: false
 
 
 
8
  ---
 
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Code Knowledge Graph Explorer β€” πŸ€— Transformers Library
3
+ emoji: πŸ”
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
9
+ tags:
10
+ - building-mcp-track-enterprise
11
+ short_description: MCP server for big code β€” explore Transformers
12
  ---
13
+ # πŸŽ“ Code Knowledge Graph MCP Server
14
 
15
+ > **Helping LLM-based agents navigate and understand large codebases**
16
+
17
+ ## πŸ“š What is this project?
18
+
19
+ This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β€” a critical capability for modern software engineering education and practice.
20
+
21
+ ## πŸ”¬ Use Case: EPITA Coding Courses
22
+
23
+ This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:
24
+
25
+ ### πŸ” Enhanced Code Discovery for Agents
26
+
27
+ LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:
28
+
29
+ - Query the knowledge graph to understand the overall architecture
30
+ - Follow relationships between modules, classes, and functions
31
+ - Identify entry points and critical code paths
32
+ - Understand how different parts of the codebase interact
33
+
34
+ ### πŸ“ˆ Detecting Areas for Code Improvement
35
+
36
+ For EPITA courses, this tool helps agents **identify areas where student code can be improved**:
37
+
38
+ - **Dead Code Detection**: Find unused functions, classes, or variables
39
+ - **Circular Dependencies**: Detect problematic import cycles between modules
40
+ - **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
41
+ - **Missing Documentation**: Find undocumented public APIs and complex functions
42
+ - **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
43
+ - **Orphan Code**: Detect code that is declared but never called
44
+
45
+ ### πŸŽ“ EPITA Course Integration
46
+
47
+ - **Project Reviews**: Quickly understand student project architectures before grading
48
+ - **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
49
+ - **Code Quality Assessment**: Consistent evaluation criteria across student submissions
50
+ - **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
51
+ - **Research**: Study code organization patterns across student projects
52
+
53
+ The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.
54
+
55
+ ---
56
+
57
+ ### 🎯 The Problem We Solve
58
+
59
+ At **EPITA** (Γ‰cole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β€” whether their own, their teammates', or open-source libraries β€” is a fundamental skill for any computer science engineer.
60
+
61
+ However, LLM-based coding assistants face significant challenges when working with large repositories:
62
+
63
+ - **Context window limitations**: LLMs cannot process entire codebases at once
64
+ - **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
65
+ - **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
66
+ - **Inefficient search**: Simple keyword search fails to capture semantic meaning
67
+
68
+ ### πŸ’‘ Our Solution: Knowledge Graphs + MCP
69
+
70
+ This project addresses these challenges by:
71
+
72
+ 1. **Parsing repositories** into a structured knowledge graph (files β†’ chunks β†’ entities)
73
+ 2. **Extracting relationships** between code elements (calls, contains, declares, imports)
74
+ 3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
75
+ 4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently
76
+
77
+ ```
78
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
79
+ β”‚ CODE REPOSITORY β”‚
80
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
81
+ β”‚ β”‚ File A β”‚ β”‚ File B β”‚ β”‚ File C β”‚ β”‚ File D β”‚ ... β”‚
82
+ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
83
+ β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
84
+ β–Ό β–Ό β–Ό β–Ό
85
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
86
+ β”‚ KNOWLEDGE GRAPH CONSTRUCTION β”‚
87
+ β”‚ β€’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML) β”‚
88
+ β”‚ β€’ Entity Extraction (classes, functions, variables, methods) β”‚
89
+ β”‚ β€’ Relationship Detection (calls, inheritance, imports) β”‚
90
+ β”‚ β€’ Code Chunking & Embedding (semantic vectors) β”‚
91
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
92
+ β–Ό
93
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
94
+ β”‚ MCP SERVER (Gradio) β”‚
95
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
96
+ β”‚ β”‚search_nodes β”‚ β”‚go_to_def β”‚ β”‚find_usages β”‚ β”‚get_neighborsβ”‚ β”‚
97
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
98
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
99
+ β”‚ β”‚get_file_ β”‚ β”‚get_related β”‚ β”‚find_path β”‚ β”‚print_tree β”‚ β”‚
100
+ β”‚ β”‚structure β”‚ β”‚_chunks β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
101
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
102
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
103
+ β–Ό
104
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
105
+ β”‚ LLM-BASED AGENT β”‚
106
+ β”‚ β€’ Can search for relevant code using natural language β”‚
107
+ β”‚ β€’ Navigate from function calls to their definitions β”‚
108
+ β”‚ β€’ Understand the structure of files and directories β”‚
109
+ β”‚ β€’ Trace dependencies and relationships across the codebase β”‚
110
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
111
+ ```
112
+
113
+ ## πŸ› οΈ MCP Tools Available
114
+
115
+ The MCP server exposes the following tools for LLM agents:
116
+
117
+ | Tool | Description |
118
+ | ------------------------- | --------------------------------------------------------- |
119
+ | `search_nodes` | Semantic + keyword search for code chunks |
120
+ | `get_node_info` | Detailed information about any node (file, chunk, entity) |
121
+ | `get_node_edges` | Incoming and outgoing relationships of a node |
122
+ | `go_to_definition` | Find where a function/class/variable is declared |
123
+ | `find_usages` | Find all places where an entity is called/used |
124
+ | `get_neighbors` | Get all directly connected nodes |
125
+ | `get_file_structure` | Overview of a file's chunks and entities |
126
+ | `get_related_chunks` | Find chunks related by a specific relationship type |
127
+ | `list_all_entities` | List all tracked entities in the codebase |
128
+ | `get_graph_stats` | Statistics about the knowledge graph |
129
+ | `find_path` | Find shortest path between two nodes |
130
+ | `get_subgraph` | Extract a subgraph around a node |
131
+ | `print_tree` | Display repository structure as a tree |
132
+ | `diff_chunks` | Compare content between two code chunks |
133
+ | `search_by_type_and_name` | Search entities by type (class, function, etc.) and name |
134
+ | `get_chunk_context` | Get a chunk with its surrounding context |
135
+
136
+ ## 🌐 Supported Languages
137
+
138
+ The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:
139
+
140
+ | Language | Parser | Entity Types |
141
+ | --------------------- | --------------- | ----------------------------------------------- |
142
+ | Python | `ast` module | classes, functions, methods, variables, imports |
143
+ | C | `libclang` | functions, structs, typedefs, variables |
144
+ | C++ | `libclang` | classes, namespaces, methods, templates |
145
+ | Java | `javalang` | classes, interfaces, methods, fields |
146
+ | JavaScript/TypeScript | `esprima` | classes, functions, variables, imports |
147
+ | Rust | `tree-sitter` | structs, enums, traits, functions, modules |
148
+ | HTML | `BeautifulSoup` | DOM elements, inline JS extraction |
149
+
150
+ The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).
151
+
152
+ ## πŸš€ Getting Started
153
+
154
+ ### Prerequisites
155
+
156
+ - Docker & Docker Compose
157
+ - Python 3.10+ (for local development)
158
+ - CUDA-capable GPU (optional, for faster embeddings)
159
+
160
+ ### Quick Start with Docker
161
+
162
+ ```bash
163
+ # Start the MCP server with a sample knowledge graph
164
+ docker-compose up
165
+ ```
166
+
167
+ ### Building a Knowledge Graph from Your Repository
168
+
169
+ ```python
170
+ from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
171
+ # From a local path
172
+ kg = RepoKnowledgeGraph.from_path(
173
+ "/path/to/your/repo",
174
+ skip_dirs=["node_modules", ".git", "__pycache__"],
175
+ extract_entities=True,
176
+ index_nodes=True
177
+ )
178
+ # Save for later use
179
+ kg.save_graph_to_file("my_knowledge_graph.json")
180
+ ```
181
+
182
+ ### Running the MCP using Gradio
183
+
184
+ ```bash
185
+ python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
186
+ ```
187
+
188
+ ## πŸ“Š Interactive Explorer (Gradio UI)
189
+
190
+ The project includes a Gradio-based web interface for exploring knowledge graphs interactively:
191
+
192
+ - **Search**: Use natural language or keywords to find relevant code
193
+ - **Navigate**: Click through nodes to explore relationships
194
+ - **Analyze**: Get statistics about code structure and dependencies
195
+ - **Visualize**: View the repository tree and entity relationships
196
+
197
+ ## πŸ“ Data Sources
198
+
199
+ The application supports loading knowledge graphs from multiple sources:
200
+
201
+ ### 1. HuggingFace Hub Dataset (Recommended for Sharing)
202
+
203
+ Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):
204
+
205
+ ```bash
206
+ python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
207
+ ```
208
+
209
+ ### 2. Local JSON File
210
+
211
+ Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):
212
+
213
+ ```bash
214
+ python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
215
+ ```
216
+
217
+ ### 3. Direct from Git Repository
218
+
219
+ Clone and analyze a repository on-the-fly:
220
+
221
+ ```bash
222
+ python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
223
+ ```
224
+
225
+ ### Publishing to HuggingFace Hub
226
+
227
+ You can save an existing knowledge graph to HuggingFace Hub for sharing:
228
+
229
+ ```python
230
+ from RepoKnowledgeGraphLib import RepoKnowledgeGraph
231
+ # Load from local file
232
+ kg = RepoKnowledgeGraph.load("path/to/graph.json")
233
+ # Push to HuggingFace Hub (without embeddings to reduce size)
234
+ kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
235
+ # Or with embeddings (larger dataset)
236
+ kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
237
+ ```
238
+
239
+
240
+ ## πŸ—οΈ Architecture Overview
241
+
242
+ ```
243
+ root/
244
+ β”œβ”€β”€ Dockerfile # Docker configuration
245
+ β”œβ”€β”€ requirements.txt # Python dependencies
246
+ β”œβ”€β”€ RepoKnowledgeGraphLib/ # Knowledge graph implementation
247
+ β”‚ β”œβ”€β”€ RepoKnowledgeGraph.py # Main graph class
248
+ β”‚ β”œβ”€β”€ KnowledgeGraphMCPServer.py # MCP server implementation
249
+ β”‚ β”œβ”€β”€ EntityExtractor.py # AST-based entity extraction
250
+ β”‚ β”œβ”€β”€ CodeParser.py # Code chunking
251
+ β”‚ β”œβ”€β”€ CodeIndex.py # Hybrid search (LanceDB/Weaviate)
252
+ β”‚ β”œβ”€β”€ ModelService.py # Embedding generation
253
+ β”‚ └── Node.py # Graph node types
254
+ └── gradio_mcp_space.py # Main Gradio web interface
255
+ ```
256
+
257
+
258
+
259
+
260
+
261
+ ## πŸ‘₯ Team
262
+
263
+ **Team Name:** CEPIA Ionis Team
264
+
265
+ **Team Members:**
266
+ - **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
267
+ - **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
268
+
269
+ ---
270
+
271
+ ## πŸ“„ License
272
+
273
+ This project is developed as part of research at EPITA / Ionis Group.
274
+
275
+ ## πŸ”— Related Resources
276
+
277
+ - [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
278
+ - [Gradio](https://gradio.app/) - Python web interface framework with MCP support
279
+ - [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
280
+ - [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model
281
+
282
+ ## πŸ†š VS Code Integration
283
+
284
+ To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.
285
+
286
+ ### Configuration File Location
287
+
288
+ Create or edit the file at `.vscode/mcp.json` in your workspace root:
289
+
290
+ ```
291
+ your-workspace/
292
+ β”œβ”€β”€ .vscode/
293
+ β”‚ └── mcp.json ← Place the configuration here
294
+ β”œβ”€β”€ src/
295
+ └── ...
296
+ ```
297
+
298
+ ### Configuration Content
299
+
300
+ Add the following content to `.vscode/mcp.json`:
301
+
302
+ ```jsonc
303
+ {
304
+ "servers": {
305
+ "transformers-code-graph": {
306
+ "url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
307
+ "type": "http"
308
+ }
309
+ },
310
+ "inputs": []
311
+ }
312
+ ```
313
+
314
+ ### What This Does
315
+
316
+ - **`servers`**: Defines the MCP servers available to VS Code
317
+ - **`transformers-code-graph`**: A custom name for this server connection
318
+ - **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
319
+ - **`type`**: Set to `"http"` for remote HTTP-based MCP servers
320
+
321
+ ### Using with Your Own Server
322
+
323
+ If you're running your own MCP server locally, update the URL accordingly:
324
+
325
+ ```jsonc
326
+ {
327
+ "servers": {
328
+ "my-code-graph": {
329
+ "url": "http://localhost:7860/gradio_api/mcp/",
330
+ "type": "http"
331
+ }
332
+ },
333
+ "inputs": []
334
+ }
335
+ ```
336
+
337
+ Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.