lailaelkoussy's picture
Update README.md
506458f verified
---
title: Code Knowledge Graph Explorer β€” πŸ€— Transformers Library
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
tags:
- building-mcp-track-enterprise
short_description: MCP server for big code β€” explore Transformers
---
## πŸ‘₯ Team
**Team Name:** CEPIA Ionis Team
**Team Members:**
- **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
- **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
---
## πŸŽ₯ Demo Video
[Available in Repo](https://huggingface.co/spaces/MCP-1st-Birthday/code-knowledge-graph-explorer-transformers-library/blob/main/video-mcp-server.mp4)
---
## Social Media Post
[Available here](https://www.linkedin.com/posts/julien-perez-5492b883_mcp-aiagents-codeanalysis-activity-7400953387044990976-U8Vf/?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAABGp_AkBa02nkJK1i19ORjznehQOMgsidm8)
---
# πŸŽ“ Code Knowledge Graph MCP Server
> **Helping LLM-based agents navigate and understand large codebases**
## πŸ“š What is this project?
This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β€” a critical capability for modern software engineering education and practice.
## πŸ”¬ Use Case: EPITA Coding Courses
This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:
### πŸ” Enhanced Code Discovery for Agents
LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:
- Query the knowledge graph to understand the overall architecture
- Follow relationships between modules, classes, and functions
- Identify entry points and critical code paths
- Understand how different parts of the codebase interact
### πŸ“ˆ Detecting Areas for Code Improvement
For EPITA courses, this tool helps agents **identify areas where student code can be improved**:
- **Dead Code Detection**: Find unused functions, classes, or variables
- **Circular Dependencies**: Detect problematic import cycles between modules
- **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
- **Missing Documentation**: Find undocumented public APIs and complex functions
- **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
- **Orphan Code**: Detect code that is declared but never called
### πŸŽ“ EPITA Course Integration
- **Project Reviews**: Quickly understand student project architectures before grading
- **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
- **Code Quality Assessment**: Consistent evaluation criteria across student submissions
- **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
- **Research**: Study code organization patterns across student projects
The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.
---
### 🎯 The Problem We Solve
At **EPITA** (Γ‰cole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β€” whether their own, their teammates', or open-source libraries β€” is a fundamental skill for any computer science engineer.
However, LLM-based coding assistants face significant challenges when working with large repositories:
- **Context window limitations**: LLMs cannot process entire codebases at once
- **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
- **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
- **Inefficient search**: Simple keyword search fails to capture semantic meaning
### πŸ’‘ Our Solution: Knowledge Graphs + MCP
This project addresses these challenges by:
1. **Parsing repositories** into a structured knowledge graph (files β†’ chunks β†’ entities)
2. **Extracting relationships** between code elements (calls, contains, declares, imports)
3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CODE REPOSITORY β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ File A β”‚ β”‚ File B β”‚ β”‚ File C β”‚ β”‚ File D β”‚ ... β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ KNOWLEDGE GRAPH CONSTRUCTION β”‚
β”‚ β€’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML) β”‚
β”‚ β€’ Entity Extraction (classes, functions, variables, methods) β”‚
β”‚ β€’ Relationship Detection (calls, inheritance, imports) β”‚
β”‚ β€’ Code Chunking & Embedding (semantic vectors) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MCP SERVER (Gradio) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚search_nodes β”‚ β”‚go_to_def β”‚ β”‚find_usages β”‚ β”‚get_neighborsβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚get_file_ β”‚ β”‚get_related β”‚ β”‚find_path β”‚ β”‚print_tree β”‚ β”‚
β”‚ β”‚structure β”‚ β”‚_chunks β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM-BASED AGENT β”‚
β”‚ β€’ Can search for relevant code using natural language β”‚
β”‚ β€’ Navigate from function calls to their definitions β”‚
β”‚ β€’ Understand the structure of files and directories β”‚
β”‚ β€’ Trace dependencies and relationships across the codebase β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## πŸ› οΈ MCP Tools Available
The MCP server exposes the following tools for LLM agents:
| Tool | Description |
| ------------------------- | --------------------------------------------------------- |
| `search_nodes` | Semantic + keyword search for code chunks |
| `get_node_info` | Detailed information about any node (file, chunk, entity) |
| `get_node_edges` | Incoming and outgoing relationships of a node |
| `go_to_definition` | Find where a function/class/variable is declared |
| `find_usages` | Find all places where an entity is called/used |
| `get_neighbors` | Get all directly connected nodes |
| `get_file_structure` | Overview of a file's chunks and entities |
| `get_related_chunks` | Find chunks related by a specific relationship type |
| `list_all_entities` | List all tracked entities in the codebase |
| `get_graph_stats` | Statistics about the knowledge graph |
| `find_path` | Find shortest path between two nodes |
| `get_subgraph` | Extract a subgraph around a node |
| `print_tree` | Display repository structure as a tree |
| `diff_chunks` | Compare content between two code chunks |
| `search_by_type_and_name` | Search entities by type (class, function, etc.) and name |
| `get_chunk_context` | Get a chunk with its surrounding context |
## 🌐 Supported Languages
The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:
| Language | Parser | Entity Types |
| --------------------- | --------------- | ----------------------------------------------- |
| Python | `ast` module | classes, functions, methods, variables, imports |
| C | `libclang` | functions, structs, typedefs, variables |
| C++ | `libclang` | classes, namespaces, methods, templates |
| Java | `javalang` | classes, interfaces, methods, fields |
| JavaScript/TypeScript | `esprima` | classes, functions, variables, imports |
| Rust | `tree-sitter` | structs, enums, traits, functions, modules |
| HTML | `BeautifulSoup` | DOM elements, inline JS extraction |
The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).
## πŸš€ Getting Started
### Prerequisites
- Docker & Docker Compose
- Python 3.10+ (for local development)
- CUDA-capable GPU (optional, for faster embeddings)
### Quick Start with Docker
```bash
# Start the MCP server with a sample knowledge graph
docker-compose up
```
### Building a Knowledge Graph from Your Repository
```python
from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
# From a local path
kg = RepoKnowledgeGraph.from_path(
"/path/to/your/repo",
skip_dirs=["node_modules", ".git", "__pycache__"],
extract_entities=True,
index_nodes=True
)
# Save for later use
kg.save_graph_to_file("my_knowledge_graph.json")
```
### Running the MCP using Gradio
```bash
python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
```
## πŸ“Š Interactive Explorer (Gradio UI)
The project includes a Gradio-based web interface for exploring knowledge graphs interactively:
- **Search**: Use natural language or keywords to find relevant code
- **Navigate**: Click through nodes to explore relationships
- **Analyze**: Get statistics about code structure and dependencies
- **Visualize**: View the repository tree and entity relationships
## πŸ“ Data Sources
The application supports loading knowledge graphs from multiple sources:
### 1. HuggingFace Hub Dataset (Recommended for Sharing)
Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
```
### 2. Local JSON File
Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
```
### 3. Direct from Git Repository
Clone and analyze a repository on-the-fly:
```bash
python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
```
### Publishing to HuggingFace Hub
You can save an existing knowledge graph to HuggingFace Hub for sharing:
```python
from RepoKnowledgeGraphLib import RepoKnowledgeGraph
# Load from local file
kg = RepoKnowledgeGraph.load("path/to/graph.json")
# Push to HuggingFace Hub (without embeddings to reduce size)
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
# Or with embeddings (larger dataset)
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
```
## πŸ—οΈ Architecture Overview
```
root/
β”œβ”€β”€ Dockerfile # Docker configuration
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ RepoKnowledgeGraphLib/ # Knowledge graph implementation
β”‚ β”œβ”€β”€ RepoKnowledgeGraph.py # Main graph class
β”‚ β”œβ”€β”€ KnowledgeGraphMCPServer.py # MCP server implementation
β”‚ β”œβ”€β”€ EntityExtractor.py # AST-based entity extraction
β”‚ β”œβ”€β”€ CodeParser.py # Code chunking
β”‚ β”œβ”€β”€ CodeIndex.py # Hybrid search (LanceDB/Weaviate)
β”‚ β”œβ”€β”€ ModelService.py # Embedding generation
β”‚ └── Node.py # Graph node types
└── gradio_mcp_space.py # Main Gradio web interface
```
## πŸ“„ License
This project is developed as part of research at EPITA / Ionis Group.
## πŸ”— Related Resources
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
- [Gradio](https://gradio.app/) - Python web interface framework with MCP support
- [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
- [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model
## πŸ†š VS Code Integration
To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.
### Configuration File Location
Create or edit the file at `.vscode/mcp.json` in your workspace root:
```
your-workspace/
β”œβ”€β”€ .vscode/
β”‚ └── mcp.json ← Place the configuration here
β”œβ”€β”€ src/
└── ...
```
### Configuration Content
Add the following content to `.vscode/mcp.json`:
```jsonc
{
"servers": {
"transformers-code-graph": {
"url": "https://mcp-1st-birthday-code-knowledge-graph-explorer-t-327857c.hf.space/gradio_api/mcp/",
"type": "http"
}
},
"inputs": []
}
```
### What This Does
- **`servers`**: Defines the MCP servers available to VS Code
- **`transformers-code-graph`**: A custom name for this server connection
- **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
- **`type`**: Set to `"http"` for remote HTTP-based MCP servers
### Using with Your Own Server
If you're running your own MCP server locally, update the URL accordingly:
```jsonc
{
"servers": {
"my-code-graph": {
"url": "http://localhost:7860/gradio_api/mcp/",
"type": "http"
}
},
"inputs": []
}
```
Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.