|
|
--- |
|
|
title: Code Knowledge Graph Explorer β π€ Transformers Library |
|
|
emoji: π |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: docker |
|
|
app_port: 7860 |
|
|
pinned: false |
|
|
tags: |
|
|
- building-mcp-track-enterprise |
|
|
short_description: MCP server for big code β explore Transformers |
|
|
--- |
|
|
|
|
|
## π₯ Team |
|
|
|
|
|
**Team Name:** CEPIA Ionis Team |
|
|
|
|
|
**Team Members:** |
|
|
- **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist |
|
|
- **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director |
|
|
|
|
|
--- |
|
|
|
|
|
## π₯ Demo Video |
|
|
|
|
|
[Available in Repo](https://huggingface.co/spaces/MCP-1st-Birthday/code-knowledge-graph-explorer-transformers-library/blob/main/video-mcp-server.mp4) |
|
|
|
|
|
--- |
|
|
## Social Media Post |
|
|
|
|
|
[Available here](https://www.linkedin.com/posts/julien-perez-5492b883_mcp-aiagents-codeanalysis-activity-7400953387044990976-U8Vf/?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAABGp_AkBa02nkJK1i19ORjznehQOMgsidm8) |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
# π Code Knowledge Graph MCP Server |
|
|
|
|
|
> **Helping LLM-based agents navigate and understand large codebases** |
|
|
|
|
|
## π What is this project? |
|
|
|
|
|
This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β a critical capability for modern software engineering education and practice. |
|
|
|
|
|
## π¬ Use Case: EPITA Coding Courses |
|
|
|
|
|
This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**: |
|
|
|
|
|
### π Enhanced Code Discovery for Agents |
|
|
|
|
|
LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can: |
|
|
|
|
|
- Query the knowledge graph to understand the overall architecture |
|
|
- Follow relationships between modules, classes, and functions |
|
|
- Identify entry points and critical code paths |
|
|
- Understand how different parts of the codebase interact |
|
|
|
|
|
### π Detecting Areas for Code Improvement |
|
|
|
|
|
For EPITA courses, this tool helps agents **identify areas where student code can be improved**: |
|
|
|
|
|
- **Dead Code Detection**: Find unused functions, classes, or variables |
|
|
- **Circular Dependencies**: Detect problematic import cycles between modules |
|
|
- **Code Coupling Analysis**: Identify tightly coupled components that should be refactored |
|
|
- **Missing Documentation**: Find undocumented public APIs and complex functions |
|
|
- **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling) |
|
|
- **Orphan Code**: Detect code that is declared but never called |
|
|
|
|
|
### π EPITA Course Integration |
|
|
|
|
|
- **Project Reviews**: Quickly understand student project architectures before grading |
|
|
- **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions |
|
|
- **Code Quality Assessment**: Consistent evaluation criteria across student submissions |
|
|
- **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects) |
|
|
- **Research**: Study code organization patterns across student projects |
|
|
|
|
|
The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses. |
|
|
|
|
|
--- |
|
|
|
|
|
### π― The Problem We Solve |
|
|
|
|
|
At **EPITA** (Γcole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β whether their own, their teammates', or open-source libraries β is a fundamental skill for any computer science engineer. |
|
|
|
|
|
However, LLM-based coding assistants face significant challenges when working with large repositories: |
|
|
|
|
|
- **Context window limitations**: LLMs cannot process entire codebases at once |
|
|
- **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files |
|
|
- **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible |
|
|
- **Inefficient search**: Simple keyword search fails to capture semantic meaning |
|
|
|
|
|
### π‘ Our Solution: Knowledge Graphs + MCP |
|
|
|
|
|
This project addresses these challenges by: |
|
|
|
|
|
1. **Parsing repositories** into a structured knowledge graph (files β chunks β entities) |
|
|
2. **Extracting relationships** between code elements (calls, contains, declares, imports) |
|
|
3. **Indexing content** with hybrid search (semantic embeddings + keyword matching) |
|
|
4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β CODE REPOSITORY β |
|
|
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β |
|
|
β β File A β β File B β β File C β β File D β ... β |
|
|
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β |
|
|
βββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββββ |
|
|
βΌ βΌ βΌ βΌ |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β KNOWLEDGE GRAPH CONSTRUCTION β |
|
|
β β’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML) β |
|
|
β β’ Entity Extraction (classes, functions, variables, methods) β |
|
|
β β’ Relationship Detection (calls, inheritance, imports) β |
|
|
β β’ Code Chunking & Embedding (semantic vectors) β |
|
|
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ |
|
|
βΌ |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β MCP SERVER (Gradio) β |
|
|
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β |
|
|
β βsearch_nodes β βgo_to_def β βfind_usages β βget_neighborsβ β |
|
|
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β |
|
|
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β |
|
|
β βget_file_ β βget_related β βfind_path β βprint_tree β β |
|
|
β βstructure β β_chunks β β β β β β |
|
|
β βββββββββββββββ ββββββββββββββ ββββββββββββββββ ββββββββββββββ β |
|
|
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ |
|
|
βΌ |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β LLM-BASED AGENT β |
|
|
β β’ Can search for relevant code using natural language β |
|
|
β β’ Navigate from function calls to their definitions β |
|
|
β β’ Understand the structure of files and directories β |
|
|
β β’ Trace dependencies and relationships across the codebase β |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
``` |
|
|
|
|
|
## π οΈ MCP Tools Available |
|
|
|
|
|
The MCP server exposes the following tools for LLM agents: |
|
|
|
|
|
| Tool | Description | |
|
|
| ------------------------- | --------------------------------------------------------- | |
|
|
| `search_nodes` | Semantic + keyword search for code chunks | |
|
|
| `get_node_info` | Detailed information about any node (file, chunk, entity) | |
|
|
| `get_node_edges` | Incoming and outgoing relationships of a node | |
|
|
| `go_to_definition` | Find where a function/class/variable is declared | |
|
|
| `find_usages` | Find all places where an entity is called/used | |
|
|
| `get_neighbors` | Get all directly connected nodes | |
|
|
| `get_file_structure` | Overview of a file's chunks and entities | |
|
|
| `get_related_chunks` | Find chunks related by a specific relationship type | |
|
|
| `list_all_entities` | List all tracked entities in the codebase | |
|
|
| `get_graph_stats` | Statistics about the knowledge graph | |
|
|
| `find_path` | Find shortest path between two nodes | |
|
|
| `get_subgraph` | Extract a subgraph around a node | |
|
|
| `print_tree` | Display repository structure as a tree | |
|
|
| `diff_chunks` | Compare content between two code chunks | |
|
|
| `search_by_type_and_name` | Search entities by type (class, function, etc.) and name | |
|
|
| `get_chunk_context` | Get a chunk with its surrounding context | |
|
|
|
|
|
## π Supported Languages |
|
|
|
|
|
The knowledge graph builder uses **AST-based entity extraction** for accurate parsing: |
|
|
|
|
|
| Language | Parser | Entity Types | |
|
|
| --------------------- | --------------- | ----------------------------------------------- | |
|
|
| Python | `ast` module | classes, functions, methods, variables, imports | |
|
|
| C | `libclang` | functions, structs, typedefs, variables | |
|
|
| C++ | `libclang` | classes, namespaces, methods, templates | |
|
|
| Java | `javalang` | classes, interfaces, methods, fields | |
|
|
| JavaScript/TypeScript | `esprima` | classes, functions, variables, imports | |
|
|
| Rust | `tree-sitter` | structs, enums, traits, functions, modules | |
|
|
| HTML | `BeautifulSoup` | DOM elements, inline JS extraction | |
|
|
|
|
|
The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.). |
|
|
|
|
|
## π Getting Started |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
- Docker & Docker Compose |
|
|
- Python 3.10+ (for local development) |
|
|
- CUDA-capable GPU (optional, for faster embeddings) |
|
|
|
|
|
### Quick Start with Docker |
|
|
|
|
|
```bash |
|
|
# Start the MCP server with a sample knowledge graph |
|
|
docker-compose up |
|
|
``` |
|
|
|
|
|
### Building a Knowledge Graph from Your Repository |
|
|
|
|
|
```python |
|
|
from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph |
|
|
|
|
|
# From a local path |
|
|
kg = RepoKnowledgeGraph.from_path( |
|
|
"/path/to/your/repo", |
|
|
skip_dirs=["node_modules", ".git", "__pycache__"], |
|
|
extract_entities=True, |
|
|
index_nodes=True |
|
|
) |
|
|
|
|
|
# Save for later use |
|
|
kg.save_graph_to_file("my_knowledge_graph.json") |
|
|
``` |
|
|
|
|
|
### Running the MCP using Gradio |
|
|
|
|
|
```bash |
|
|
python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860 |
|
|
|
|
|
``` |
|
|
|
|
|
## π Interactive Explorer (Gradio UI) |
|
|
|
|
|
The project includes a Gradio-based web interface for exploring knowledge graphs interactively: |
|
|
|
|
|
- **Search**: Use natural language or keywords to find relevant code |
|
|
- **Navigate**: Click through nodes to explore relationships |
|
|
- **Analyze**: Get statistics about code structure and dependencies |
|
|
- **Visualize**: View the repository tree and entity relationships |
|
|
|
|
|
## π Data Sources |
|
|
|
|
|
The application supports loading knowledge graphs from multiple sources: |
|
|
|
|
|
### 1. HuggingFace Hub Dataset (Recommended for Sharing) |
|
|
|
|
|
Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub): |
|
|
|
|
|
```bash |
|
|
python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name" |
|
|
``` |
|
|
|
|
|
### 2. Local JSON File |
|
|
|
|
|
Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`): |
|
|
|
|
|
```bash |
|
|
python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json |
|
|
``` |
|
|
|
|
|
### 3. Direct from Git Repository |
|
|
|
|
|
Clone and analyze a repository on-the-fly: |
|
|
|
|
|
```bash |
|
|
python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git" |
|
|
``` |
|
|
|
|
|
### Publishing to HuggingFace Hub |
|
|
|
|
|
You can save an existing knowledge graph to HuggingFace Hub for sharing: |
|
|
|
|
|
```python |
|
|
from RepoKnowledgeGraphLib import RepoKnowledgeGraph |
|
|
|
|
|
# Load from local file |
|
|
kg = RepoKnowledgeGraph.load("path/to/graph.json") |
|
|
|
|
|
# Push to HuggingFace Hub (without embeddings to reduce size) |
|
|
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False) |
|
|
|
|
|
# Or with embeddings (larger dataset) |
|
|
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True) |
|
|
``` |
|
|
|
|
|
|
|
|
## ποΈ Architecture Overview |
|
|
|
|
|
``` |
|
|
root/ |
|
|
βββ Dockerfile # Docker configuration |
|
|
βββ requirements.txt # Python dependencies |
|
|
βββ RepoKnowledgeGraphLib/ # Knowledge graph implementation |
|
|
β βββ RepoKnowledgeGraph.py # Main graph class |
|
|
β βββ KnowledgeGraphMCPServer.py # MCP server implementation |
|
|
β βββ EntityExtractor.py # AST-based entity extraction |
|
|
β βββ CodeParser.py # Code chunking |
|
|
β βββ CodeIndex.py # Hybrid search (LanceDB/Weaviate) |
|
|
β βββ ModelService.py # Embedding generation |
|
|
β βββ Node.py # Graph node types |
|
|
βββ gradio_mcp_space.py # Main Gradio web interface |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is developed as part of research at EPITA / Ionis Group. |
|
|
|
|
|
## π Related Resources |
|
|
|
|
|
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard |
|
|
- [Gradio](https://gradio.app/) - Python web interface framework with MCP support |
|
|
- [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing |
|
|
- [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model |
|
|
|
|
|
## π VS Code Integration |
|
|
|
|
|
To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file. |
|
|
|
|
|
### Configuration File Location |
|
|
|
|
|
Create or edit the file at `.vscode/mcp.json` in your workspace root: |
|
|
|
|
|
``` |
|
|
your-workspace/ |
|
|
βββ .vscode/ |
|
|
β βββ mcp.json β Place the configuration here |
|
|
βββ src/ |
|
|
βββ ... |
|
|
``` |
|
|
|
|
|
### Configuration Content |
|
|
|
|
|
Add the following content to `.vscode/mcp.json`: |
|
|
|
|
|
```jsonc |
|
|
{ |
|
|
"servers": { |
|
|
"transformers-code-graph": { |
|
|
"url": "https://mcp-1st-birthday-code-knowledge-graph-explorer-t-327857c.hf.space/gradio_api/mcp/", |
|
|
"type": "http" |
|
|
} |
|
|
}, |
|
|
"inputs": [] |
|
|
} |
|
|
``` |
|
|
|
|
|
### What This Does |
|
|
|
|
|
- **`servers`**: Defines the MCP servers available to VS Code |
|
|
- **`transformers-code-graph`**: A custom name for this server connection |
|
|
- **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space) |
|
|
- **`type`**: Set to `"http"` for remote HTTP-based MCP servers |
|
|
|
|
|
### Using with Your Own Server |
|
|
|
|
|
If you're running your own MCP server locally, update the URL accordingly: |
|
|
|
|
|
```jsonc |
|
|
{ |
|
|
"servers": { |
|
|
"my-code-graph": { |
|
|
"url": "http://localhost:7860/gradio_api/mcp/", |
|
|
"type": "http" |
|
|
} |
|
|
}, |
|
|
"inputs": [] |
|
|
} |
|
|
``` |
|
|
|
|
|
Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase. |