lailaelkoussy's picture
Update README.md
506458f verified
metadata
title: Code Knowledge Graph Explorer β€” πŸ€— Transformers Library
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
tags:
  - building-mcp-track-enterprise
short_description: MCP server for big code β€” explore Transformers

πŸ‘₯ Team

Team Name: CEPIA Ionis Team

Team Members:

  • Laila ELKOUSSY - @lailaelkoussy - Research Engineer, Data Scientist
  • Julien PEREZ - @jnm38 - Research Director

πŸŽ₯ Demo Video

Available in Repo


Social Media Post

Available here


πŸŽ“ Code Knowledge Graph MCP Server

Helping LLM-based agents navigate and understand large codebases

πŸ“š What is this project?

This project provides a Model Context Protocol (MCP) server that transforms code repositories into navigable knowledge graphs. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases β€” a critical capability for modern software engineering education and practice.

πŸ”¬ Use Case: EPITA Coding Courses

This project was developed with educational applications in mind, specifically to support EPITA coding courses:

πŸ” Enhanced Code Discovery for Agents

LLM-based coding agents can use this tool to better discover and navigate large repositories. Instead of blindly searching through files, agents can:

  • Query the knowledge graph to understand the overall architecture
  • Follow relationships between modules, classes, and functions
  • Identify entry points and critical code paths
  • Understand how different parts of the codebase interact

πŸ“ˆ Detecting Areas for Code Improvement

For EPITA courses, this tool helps agents identify areas where student code can be improved:

  • Dead Code Detection: Find unused functions, classes, or variables
  • Circular Dependencies: Detect problematic import cycles between modules
  • Code Coupling Analysis: Identify tightly coupled components that should be refactored
  • Missing Documentation: Find undocumented public APIs and complex functions
  • Complexity Hotspots: Locate chunks with many outgoing calls (high coupling)
  • Orphan Code: Detect code that is declared but never called

πŸŽ“ EPITA Course Integration

  • Project Reviews: Quickly understand student project architectures before grading
  • Automated Feedback: Integrate with LLM tutors to provide targeted improvement suggestions
  • Code Quality Assessment: Consistent evaluation criteria across student submissions
  • Learning Tool: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
  • Research: Study code organization patterns across student projects

The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.


🎯 The Problem We Solve

At EPITA (Γ‰cole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β€” whether their own, their teammates', or open-source libraries β€” is a fundamental skill for any computer science engineer.

However, LLM-based coding assistants face significant challenges when working with large repositories:

  • Context window limitations: LLMs cannot process entire codebases at once
  • Lack of structural awareness: Without understanding how code is organized, LLMs struggle to locate relevant files
  • Missing relationships: Function calls, class inheritance, and module dependencies are not immediately visible
  • Inefficient search: Simple keyword search fails to capture semantic meaning

πŸ’‘ Our Solution: Knowledge Graphs + MCP

This project addresses these challenges by:

  1. Parsing repositories into a structured knowledge graph (files β†’ chunks β†’ entities)
  2. Extracting relationships between code elements (calls, contains, declares, imports)
  3. Indexing content with hybrid search (semantic embeddings + keyword matching)
  4. Exposing tools via MCP that allow LLM agents to navigate the codebase intelligently
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     CODE REPOSITORY                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  File A  β”‚  β”‚  File B  β”‚  β”‚  File C  β”‚  β”‚  File D  β”‚   ...   β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β–Ό             β–Ό             β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               KNOWLEDGE GRAPH CONSTRUCTION                       β”‚
β”‚  β€’ AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML)    β”‚
β”‚  β€’ Entity Extraction (classes, functions, variables, methods)   β”‚
β”‚  β€’ Relationship Detection (calls, inheritance, imports)         β”‚
β”‚  β€’ Code Chunking & Embedding (semantic vectors)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MCP SERVER (Gradio)                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚search_nodes β”‚ β”‚go_to_def   β”‚ β”‚find_usages   β”‚ β”‚get_neighborsβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚get_file_    β”‚ β”‚get_related β”‚ β”‚find_path     β”‚ β”‚print_tree  β”‚ β”‚
β”‚  β”‚structure    β”‚ β”‚_chunks     β”‚ β”‚              β”‚ β”‚            β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLM-BASED AGENT                               β”‚
β”‚  β€’ Can search for relevant code using natural language          β”‚
β”‚  β€’ Navigate from function calls to their definitions            β”‚
β”‚  β€’ Understand the structure of files and directories            β”‚
β”‚  β€’ Trace dependencies and relationships across the codebase     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ MCP Tools Available

The MCP server exposes the following tools for LLM agents:

Tool Description
search_nodes Semantic + keyword search for code chunks
get_node_info Detailed information about any node (file, chunk, entity)
get_node_edges Incoming and outgoing relationships of a node
go_to_definition Find where a function/class/variable is declared
find_usages Find all places where an entity is called/used
get_neighbors Get all directly connected nodes
get_file_structure Overview of a file's chunks and entities
get_related_chunks Find chunks related by a specific relationship type
list_all_entities List all tracked entities in the codebase
get_graph_stats Statistics about the knowledge graph
find_path Find shortest path between two nodes
get_subgraph Extract a subgraph around a node
print_tree Display repository structure as a tree
diff_chunks Compare content between two code chunks
search_by_type_and_name Search entities by type (class, function, etc.) and name
get_chunk_context Get a chunk with its surrounding context

🌐 Supported Languages

The knowledge graph builder uses AST-based entity extraction for accurate parsing:

Language Parser Entity Types
Python ast module classes, functions, methods, variables, imports
C libclang functions, structs, typedefs, variables
C++ libclang classes, namespaces, methods, templates
Java javalang classes, interfaces, methods, fields
JavaScript/TypeScript esprima classes, functions, variables, imports
Rust tree-sitter structs, enums, traits, functions, modules
HTML BeautifulSoup DOM elements, inline JS extraction

The system also detects API endpoints for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).

πŸš€ Getting Started

Prerequisites

  • Docker & Docker Compose
  • Python 3.10+ (for local development)
  • CUDA-capable GPU (optional, for faster embeddings)

Quick Start with Docker

# Start the MCP server with a sample knowledge graph
docker-compose up

Building a Knowledge Graph from Your Repository

from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph

# From a local path
kg = RepoKnowledgeGraph.from_path(
    "/path/to/your/repo",
    skip_dirs=["node_modules", ".git", "__pycache__"],
    extract_entities=True,
    index_nodes=True
)

# Save for later use
kg.save_graph_to_file("my_knowledge_graph.json")

Running the MCP using Gradio

python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860

πŸ“Š Interactive Explorer (Gradio UI)

The project includes a Gradio-based web interface for exploring knowledge graphs interactively:

  • Search: Use natural language or keywords to find relevant code
  • Navigate: Click through nodes to explore relationships
  • Analyze: Get statistics about code structure and dependencies
  • Visualize: View the repository tree and entity relationships

πŸ“ Data Sources

The application supports loading knowledge graphs from multiple sources:

1. HuggingFace Hub Dataset (Recommended for Sharing)

Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):

python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"

2. Local JSON File

Use a local JSON file (e.g., multihop_knowledge_graph_with_embeddings.json):

python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json

3. Direct from Git Repository

Clone and analyze a repository on-the-fly:

python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"

Publishing to HuggingFace Hub

You can save an existing knowledge graph to HuggingFace Hub for sharing:

from RepoKnowledgeGraphLib import RepoKnowledgeGraph

# Load from local file
kg = RepoKnowledgeGraph.load("path/to/graph.json")

# Push to HuggingFace Hub (without embeddings to reduce size)
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)

# Or with embeddings (larger dataset)
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)

πŸ—οΈ Architecture Overview

root/
β”œβ”€β”€ Dockerfile                  # Docker configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ RepoKnowledgeGraphLib/  # Knowledge graph implementation
β”‚   β”œβ”€β”€ RepoKnowledgeGraph.py    # Main graph class
β”‚   β”œβ”€β”€ KnowledgeGraphMCPServer.py # MCP server implementation
β”‚   β”œβ”€β”€ EntityExtractor.py       # AST-based entity extraction
β”‚   β”œβ”€β”€ CodeParser.py            # Code chunking
β”‚   β”œβ”€β”€ CodeIndex.py             # Hybrid search (LanceDB/Weaviate)
β”‚   β”œβ”€β”€ ModelService.py          # Embedding generation
β”‚   └── Node.py                  # Graph node types
└── gradio_mcp_space.py              # Main Gradio web interface

πŸ“„ License

This project is developed as part of research at EPITA / Ionis Group.

πŸ”— Related Resources

πŸ†š VS Code Integration

To use this MCP server with GitHub Copilot in VS Code, you need to configure an mcp.json file.

Configuration File Location

Create or edit the file at .vscode/mcp.json in your workspace root:

your-workspace/
β”œβ”€β”€ .vscode/
β”‚   └── mcp.json    ← Place the configuration here
β”œβ”€β”€ src/
└── ...

Configuration Content

Add the following content to .vscode/mcp.json:

{
    "servers": {
        "transformers-code-graph": {
            "url": "https://mcp-1st-birthday-code-knowledge-graph-explorer-t-327857c.hf.space/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}

What This Does

  • servers: Defines the MCP servers available to VS Code
  • transformers-code-graph: A custom name for this server connection
  • url: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
  • type: Set to "http" for remote HTTP-based MCP servers

Using with Your Own Server

If you're running your own MCP server locally, update the URL accordingly:

{
    "servers": {
        "my-code-graph": {
            "url": "http://localhost:7860/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}

Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.