File size: 16,299 Bytes
f1dcdb0
b1ddffc
 
 
 
f1dcdb0
b1ddffc
f1dcdb0
b1ddffc
 
 
f1dcdb0
b1ddffc
f1dcdb0
b1ddffc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: Code Knowledge Graph Explorer β€” πŸ€— Transformers Library
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
tags:
- building-mcp-track-enterprise
short_description: MCP server for big code β€” explore Transformers
---
# πŸŽ“ Code Knowledge Graph MCP Server

> **Helping LLM-based agents navigate and understand large codebases**

## πŸ“š What is this project?

This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β€” a critical capability for modern software engineering education and practice.

## πŸ”¬ Use Case: EPITA Coding Courses

This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:

### πŸ” Enhanced Code Discovery for Agents

LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:

- Query the knowledge graph to understand the overall architecture
- Follow relationships between modules, classes, and functions
- Identify entry points and critical code paths
- Understand how different parts of the codebase interact

### πŸ“ˆ Detecting Areas for Code Improvement

For EPITA courses, this tool helps agents **identify areas where student code can be improved**:

- **Dead Code Detection**: Find unused functions, classes, or variables
- **Circular Dependencies**: Detect problematic import cycles between modules
- **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
- **Missing Documentation**: Find undocumented public APIs and complex functions
- **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
- **Orphan Code**: Detect code that is declared but never called

### πŸŽ“ EPITA Course Integration

- **Project Reviews**: Quickly understand student project architectures before grading
- **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
- **Code Quality Assessment**: Consistent evaluation criteria across student submissions
- **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
- **Research**: Study code organization patterns across student projects

The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.

---

### 🎯 The Problem We Solve

At **EPITA** (Γ‰cole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β€” whether their own, their teammates', or open-source libraries β€” is a fundamental skill for any computer science engineer.

However, LLM-based coding assistants face significant challenges when working with large repositories:

- **Context window limitations**: LLMs cannot process entire codebases at once
- **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
- **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
- **Inefficient search**: Simple keyword search fails to capture semantic meaning

### πŸ’‘ Our Solution: Knowledge Graphs + MCP

This project addresses these challenges by:

1. **Parsing repositories** into a structured knowledge graph (files β†’ chunks β†’ entities)
2. **Extracting relationships** between code elements (calls, contains, declares, imports)
3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     CODE REPOSITORY                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  File A  β”‚  β”‚  File B  β”‚  β”‚  File C  β”‚  β”‚  File D  β”‚   ...   β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β–Ό             β–Ό             β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               KNOWLEDGE GRAPH CONSTRUCTION                       β”‚
β”‚  β€’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML)    β”‚
β”‚  β€’ Entity Extraction (classes, functions, variables, methods)   β”‚
β”‚  β€’ Relationship Detection (calls, inheritance, imports)         β”‚
β”‚  β€’ Code Chunking & Embedding (semantic vectors)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MCP SERVER (Gradio)                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚search_nodes β”‚ β”‚go_to_def   β”‚ β”‚find_usages   β”‚ β”‚get_neighborsβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚get_file_    β”‚ β”‚get_related β”‚ β”‚find_path     β”‚ β”‚print_tree  β”‚ β”‚
β”‚  β”‚structure    β”‚ β”‚_chunks     β”‚ β”‚              β”‚ β”‚            β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLM-BASED AGENT                               β”‚
β”‚  β€’ Can search for relevant code using natural language          β”‚
β”‚  β€’ Navigate from function calls to their definitions            β”‚
β”‚  β€’ Understand the structure of files and directories            β”‚
β”‚  β€’ Trace dependencies and relationships across the codebase     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ› οΈ MCP Tools Available

The MCP server exposes the following tools for LLM agents:

| Tool                      | Description                                               |
| ------------------------- | --------------------------------------------------------- |
| `search_nodes`            | Semantic + keyword search for code chunks                 |
| `get_node_info`           | Detailed information about any node (file, chunk, entity) |
| `get_node_edges`          | Incoming and outgoing relationships of a node             |
| `go_to_definition`        | Find where a function/class/variable is declared          |
| `find_usages`             | Find all places where an entity is called/used            |
| `get_neighbors`           | Get all directly connected nodes                          |
| `get_file_structure`      | Overview of a file's chunks and entities                  |
| `get_related_chunks`      | Find chunks related by a specific relationship type       |
| `list_all_entities`       | List all tracked entities in the codebase                 |
| `get_graph_stats`         | Statistics about the knowledge graph                      |
| `find_path`               | Find shortest path between two nodes                      |
| `get_subgraph`            | Extract a subgraph around a node                          |
| `print_tree`              | Display repository structure as a tree                    |
| `diff_chunks`             | Compare content between two code chunks                   |
| `search_by_type_and_name` | Search entities by type (class, function, etc.) and name  |
| `get_chunk_context`       | Get a chunk with its surrounding context                  |

## 🌐 Supported Languages

The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:

| Language              | Parser          | Entity Types                                    |
| --------------------- | --------------- | ----------------------------------------------- |
| Python                | `ast` module    | classes, functions, methods, variables, imports |
| C                     | `libclang`      | functions, structs, typedefs, variables         |
| C++                   | `libclang`      | classes, namespaces, methods, templates         |
| Java                  | `javalang`      | classes, interfaces, methods, fields            |
| JavaScript/TypeScript | `esprima`       | classes, functions, variables, imports          |
| Rust                  | `tree-sitter`   | structs, enums, traits, functions, modules      |
| HTML                  | `BeautifulSoup` | DOM elements, inline JS extraction              |

The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).

## πŸš€ Getting Started

### Prerequisites

- Docker & Docker Compose
- Python 3.10+ (for local development)
- CUDA-capable GPU (optional, for faster embeddings)

### Quick Start with Docker

```bash
# Start the MCP server with a sample knowledge graph
docker-compose up
```

### Building a Knowledge Graph from Your Repository

```python
from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
# From a local path
kg = RepoKnowledgeGraph.from_path(
    "/path/to/your/repo",
    skip_dirs=["node_modules", ".git", "__pycache__"],
    extract_entities=True,
    index_nodes=True
)
# Save for later use
kg.save_graph_to_file("my_knowledge_graph.json")
```

### Running the MCP using Gradio

```bash
python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
```

## πŸ“Š Interactive Explorer (Gradio UI)

The project includes a Gradio-based web interface for exploring knowledge graphs interactively:

- **Search**: Use natural language or keywords to find relevant code
- **Navigate**: Click through nodes to explore relationships  
- **Analyze**: Get statistics about code structure and dependencies
- **Visualize**: View the repository tree and entity relationships

## πŸ“ Data Sources

The application supports loading knowledge graphs from multiple sources:

### 1. HuggingFace Hub Dataset (Recommended for Sharing)

Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):

```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
```

### 2. Local JSON File

Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):

```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
```

### 3. Direct from Git Repository

Clone and analyze a repository on-the-fly:

```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
```

### Publishing to HuggingFace Hub

You can save an existing knowledge graph to HuggingFace Hub for sharing:

```python
from RepoKnowledgeGraphLib import RepoKnowledgeGraph
# Load from local file
kg = RepoKnowledgeGraph.load("path/to/graph.json")
# Push to HuggingFace Hub (without embeddings to reduce size)
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
# Or with embeddings (larger dataset)
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
```


## πŸ—οΈ Architecture Overview

```
root/
β”œβ”€β”€ Dockerfile                  # Docker configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ RepoKnowledgeGraphLib/  # Knowledge graph implementation
β”‚   β”œβ”€β”€ RepoKnowledgeGraph.py    # Main graph class
β”‚   β”œβ”€β”€ KnowledgeGraphMCPServer.py # MCP server implementation
β”‚   β”œβ”€β”€ EntityExtractor.py       # AST-based entity extraction
β”‚   β”œβ”€β”€ CodeParser.py            # Code chunking
β”‚   β”œβ”€β”€ CodeIndex.py             # Hybrid search (LanceDB/Weaviate)
β”‚   β”œβ”€β”€ ModelService.py          # Embedding generation
β”‚   └── Node.py                  # Graph node types
└── gradio_mcp_space.py              # Main Gradio web interface
```





## πŸ‘₯ Team

**Team Name:** CEPIA Ionis Team

**Team Members:**
- **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
- **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director

---

## πŸ“„ License

This project is developed as part of research at EPITA / Ionis Group.

## πŸ”— Related Resources

- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
- [Gradio](https://gradio.app/) - Python web interface framework with MCP support
- [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
- [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model

## πŸ†š VS Code Integration

To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.

### Configuration File Location

Create or edit the file at `.vscode/mcp.json` in your workspace root:

```
your-workspace/
β”œβ”€β”€ .vscode/
β”‚   └── mcp.json    ← Place the configuration here
β”œβ”€β”€ src/
└── ...
```

### Configuration Content

Add the following content to `.vscode/mcp.json`:

```jsonc
{
    "servers": {
        "transformers-code-graph": {
            "url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}
```

### What This Does

- **`servers`**: Defines the MCP servers available to VS Code
- **`transformers-code-graph`**: A custom name for this server connection
- **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
- **`type`**: Set to `"http"` for remote HTTP-based MCP servers

### Using with Your Own Server

If you're running your own MCP server locally, update the URL accordingly:

```jsonc
{
    "servers": {
        "my-code-graph": {
            "url": "http://localhost:7860/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}
```

Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.