File size: 16,299 Bytes
f1dcdb0 b1ddffc f1dcdb0 b1ddffc f1dcdb0 b1ddffc f1dcdb0 b1ddffc f1dcdb0 b1ddffc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 |
---
title: Code Knowledge Graph Explorer β π€ Transformers Library
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
tags:
- building-mcp-track-enterprise
short_description: MCP server for big code β explore Transformers
---
# π Code Knowledge Graph MCP Server
> **Helping LLM-based agents navigate and understand large codebases**
## π What is this project?
This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β a critical capability for modern software engineering education and practice.
## π¬ Use Case: EPITA Coding Courses
This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:
### π Enhanced Code Discovery for Agents
LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:
- Query the knowledge graph to understand the overall architecture
- Follow relationships between modules, classes, and functions
- Identify entry points and critical code paths
- Understand how different parts of the codebase interact
### π Detecting Areas for Code Improvement
For EPITA courses, this tool helps agents **identify areas where student code can be improved**:
- **Dead Code Detection**: Find unused functions, classes, or variables
- **Circular Dependencies**: Detect problematic import cycles between modules
- **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
- **Missing Documentation**: Find undocumented public APIs and complex functions
- **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
- **Orphan Code**: Detect code that is declared but never called
### π EPITA Course Integration
- **Project Reviews**: Quickly understand student project architectures before grading
- **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
- **Code Quality Assessment**: Consistent evaluation criteria across student submissions
- **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
- **Research**: Study code organization patterns across student projects
The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.
---
### π― The Problem We Solve
At **EPITA** (Γcole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β whether their own, their teammates', or open-source libraries β is a fundamental skill for any computer science engineer.
However, LLM-based coding assistants face significant challenges when working with large repositories:
- **Context window limitations**: LLMs cannot process entire codebases at once
- **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
- **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
- **Inefficient search**: Simple keyword search fails to capture semantic meaning
### π‘ Our Solution: Knowledge Graphs + MCP
This project addresses these challenges by:
1. **Parsing repositories** into a structured knowledge graph (files β chunks β entities)
2. **Extracting relationships** between code elements (calls, contains, declares, imports)
3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CODE REPOSITORY β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β File A β β File B β β File C β β File D β ... β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
βββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KNOWLEDGE GRAPH CONSTRUCTION β
β β’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML) β
β β’ Entity Extraction (classes, functions, variables, methods) β
β β’ Relationship Detection (calls, inheritance, imports) β
β β’ Code Chunking & Embedding (semantic vectors) β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP SERVER (Gradio) β
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β βsearch_nodes β βgo_to_def β βfind_usages β βget_neighborsβ β
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β βget_file_ β βget_related β βfind_path β βprint_tree β β
β βstructure β β_chunks β β β β β β
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM-BASED AGENT β
β β’ Can search for relevant code using natural language β
β β’ Navigate from function calls to their definitions β
β β’ Understand the structure of files and directories β
β β’ Trace dependencies and relationships across the codebase β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## π οΈ MCP Tools Available
The MCP server exposes the following tools for LLM agents:
| Tool | Description |
| ------------------------- | --------------------------------------------------------- |
| `search_nodes` | Semantic + keyword search for code chunks |
| `get_node_info` | Detailed information about any node (file, chunk, entity) |
| `get_node_edges` | Incoming and outgoing relationships of a node |
| `go_to_definition` | Find where a function/class/variable is declared |
| `find_usages` | Find all places where an entity is called/used |
| `get_neighbors` | Get all directly connected nodes |
| `get_file_structure` | Overview of a file's chunks and entities |
| `get_related_chunks` | Find chunks related by a specific relationship type |
| `list_all_entities` | List all tracked entities in the codebase |
| `get_graph_stats` | Statistics about the knowledge graph |
| `find_path` | Find shortest path between two nodes |
| `get_subgraph` | Extract a subgraph around a node |
| `print_tree` | Display repository structure as a tree |
| `diff_chunks` | Compare content between two code chunks |
| `search_by_type_and_name` | Search entities by type (class, function, etc.) and name |
| `get_chunk_context` | Get a chunk with its surrounding context |
## π Supported Languages
The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:
| Language | Parser | Entity Types |
| --------------------- | --------------- | ----------------------------------------------- |
| Python | `ast` module | classes, functions, methods, variables, imports |
| C | `libclang` | functions, structs, typedefs, variables |
| C++ | `libclang` | classes, namespaces, methods, templates |
| Java | `javalang` | classes, interfaces, methods, fields |
| JavaScript/TypeScript | `esprima` | classes, functions, variables, imports |
| Rust | `tree-sitter` | structs, enums, traits, functions, modules |
| HTML | `BeautifulSoup` | DOM elements, inline JS extraction |
The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).
## π Getting Started
### Prerequisites
- Docker & Docker Compose
- Python 3.10+ (for local development)
- CUDA-capable GPU (optional, for faster embeddings)
### Quick Start with Docker
```bash
# Start the MCP server with a sample knowledge graph
docker-compose up
```
### Building a Knowledge Graph from Your Repository
```python
from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
# From a local path
kg = RepoKnowledgeGraph.from_path(
"/path/to/your/repo",
skip_dirs=["node_modules", ".git", "__pycache__"],
extract_entities=True,
index_nodes=True
)
# Save for later use
kg.save_graph_to_file("my_knowledge_graph.json")
```
### Running the MCP using Gradio
```bash
python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
```
## π Interactive Explorer (Gradio UI)
The project includes a Gradio-based web interface for exploring knowledge graphs interactively:
- **Search**: Use natural language or keywords to find relevant code
- **Navigate**: Click through nodes to explore relationships
- **Analyze**: Get statistics about code structure and dependencies
- **Visualize**: View the repository tree and entity relationships
## π Data Sources
The application supports loading knowledge graphs from multiple sources:
### 1. HuggingFace Hub Dataset (Recommended for Sharing)
Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
```
### 2. Local JSON File
Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
```
### 3. Direct from Git Repository
Clone and analyze a repository on-the-fly:
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
```
### Publishing to HuggingFace Hub
You can save an existing knowledge graph to HuggingFace Hub for sharing:
```python
from RepoKnowledgeGraphLib import RepoKnowledgeGraph
# Load from local file
kg = RepoKnowledgeGraph.load("path/to/graph.json")
# Push to HuggingFace Hub (without embeddings to reduce size)
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
# Or with embeddings (larger dataset)
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
```
## ποΈ Architecture Overview
```
root/
βββ Dockerfile # Docker configuration
βββ requirements.txt # Python dependencies
βββ RepoKnowledgeGraphLib/ # Knowledge graph implementation
β βββ RepoKnowledgeGraph.py # Main graph class
β βββ KnowledgeGraphMCPServer.py # MCP server implementation
β βββ EntityExtractor.py # AST-based entity extraction
β βββ CodeParser.py # Code chunking
β βββ CodeIndex.py # Hybrid search (LanceDB/Weaviate)
β βββ ModelService.py # Embedding generation
β βββ Node.py # Graph node types
βββ gradio_mcp_space.py # Main Gradio web interface
```
## π₯ Team
**Team Name:** CEPIA Ionis Team
**Team Members:**
- **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
- **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
---
## π License
This project is developed as part of research at EPITA / Ionis Group.
## π Related Resources
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
- [Gradio](https://gradio.app/) - Python web interface framework with MCP support
- [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
- [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model
## π VS Code Integration
To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.
### Configuration File Location
Create or edit the file at `.vscode/mcp.json` in your workspace root:
```
your-workspace/
βββ .vscode/
β βββ mcp.json β Place the configuration here
βββ src/
βββ ...
```
### Configuration Content
Add the following content to `.vscode/mcp.json`:
```jsonc
{
"servers": {
"transformers-code-graph": {
"url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
"type": "http"
}
},
"inputs": []
}
```
### What This Does
- **`servers`**: Defines the MCP servers available to VS Code
- **`transformers-code-graph`**: A custom name for this server connection
- **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
- **`type`**: Set to `"http"` for remote HTTP-based MCP servers
### Using with Your Own Server
If you're running your own MCP server locally, update the URL accordingly:
```jsonc
{
"servers": {
"my-code-graph": {
"url": "http://localhost:7860/gradio_api/mcp/",
"type": "http"
}
},
"inputs": []
}
```
Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase. |