Spaces:

lailaelkoussy
/

code-knowledge-graph-explorer-transformers-library

Sleeping

App Files Files Community

code-knowledge-graph-explorer-transformers-library / README.md

lailaelkoussy

Update README.md

b1ddffc verified about 1 month ago

preview code

raw

history blame contribute delete

16.3 kB

metadata

title: Code Knowledge Graph Explorer — 🤗 Transformers Library
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
tags:
  - building-mcp-track-enterprise
short_description: MCP server for big code — explore Transformers

🎓 Code Knowledge Graph MCP Server

Helping LLM-based agents navigate and understand large codebases

📚 What is this project?

This project provides a Model Context Protocol (MCP) server that transforms code repositories into navigable knowledge graphs. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases — a critical capability for modern software engineering education and practice.

🔬 Use Case: EPITA Coding Courses

This project was developed with educational applications in mind, specifically to support EPITA coding courses:

🔍 Enhanced Code Discovery for Agents

LLM-based coding agents can use this tool to better discover and navigate large repositories. Instead of blindly searching through files, agents can:

Query the knowledge graph to understand the overall architecture
Follow relationships between modules, classes, and functions
Identify entry points and critical code paths
Understand how different parts of the codebase interact

📈 Detecting Areas for Code Improvement

For EPITA courses, this tool helps agents identify areas where student code can be improved:

Dead Code Detection: Find unused functions, classes, or variables
Circular Dependencies: Detect problematic import cycles between modules
Code Coupling Analysis: Identify tightly coupled components that should be refactored
Missing Documentation: Find undocumented public APIs and complex functions
Complexity Hotspots: Locate chunks with many outgoing calls (high coupling)
Orphan Code: Detect code that is declared but never called

🎓 EPITA Course Integration

Project Reviews: Quickly understand student project architectures before grading
Automated Feedback: Integrate with LLM tutors to provide targeted improvement suggestions
Code Quality Assessment: Consistent evaluation criteria across student submissions
Learning Tool: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
Research: Study code organization patterns across student projects

The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.

🎯 The Problem We Solve

At EPITA (École pour l'informatique et les techniques avancées), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases — whether their own, their teammates', or open-source libraries — is a fundamental skill for any computer science engineer.

However, LLM-based coding assistants face significant challenges when working with large repositories:

Context window limitations: LLMs cannot process entire codebases at once
Lack of structural awareness: Without understanding how code is organized, LLMs struggle to locate relevant files
Missing relationships: Function calls, class inheritance, and module dependencies are not immediately visible
Inefficient search: Simple keyword search fails to capture semantic meaning

💡 Our Solution: Knowledge Graphs + MCP

This project addresses these challenges by:

Parsing repositories into a structured knowledge graph (files → chunks → entities)
Extracting relationships between code elements (calls, contains, declares, imports)
Indexing content with hybrid search (semantic embeddings + keyword matching)
Exposing tools via MCP that allow LLM agents to navigate the codebase intelligently

┌─────────────────────────────────────────────────────────────────┐
│                     CODE REPOSITORY                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│  │  File A  │  │  File B  │  │  File C  │  │  File D  │   ...   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘         │
└───────┼─────────────┼─────────────┼─────────────┼───────────────┘
        ▼             ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────────┐
│               KNOWLEDGE GRAPH CONSTRUCTION                       │
│  • AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML)    │
│  • Entity Extraction (classes, functions, variables, methods)   │
│  • Relationship Detection (calls, inheritance, imports)         │
│  • Code Chunking & Embedding (semantic vectors)                 │
└───────────────────────────────┬─────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    MCP SERVER (Gradio)                           │
│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
│  │search_nodes │ │go_to_def   │ │find_usages   │ │get_neighbors│ │
│  └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
│  │get_file_    │ │get_related │ │find_path     │ │print_tree  │ │
│  │structure    │ │_chunks     │ │              │ │            │ │
│  └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
└───────────────────────────────┬─────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    LLM-BASED AGENT                               │
│  • Can search for relevant code using natural language          │
│  • Navigate from function calls to their definitions            │
│  • Understand the structure of files and directories            │
│  • Trace dependencies and relationships across the codebase     │
└─────────────────────────────────────────────────────────────────┘

🛠️ MCP Tools Available

The MCP server exposes the following tools for LLM agents:

Tool	Description
`search_nodes`	Semantic + keyword search for code chunks
`get_node_info`	Detailed information about any node (file, chunk, entity)
`get_node_edges`	Incoming and outgoing relationships of a node
`go_to_definition`	Find where a function/class/variable is declared
`find_usages`	Find all places where an entity is called/used
`get_neighbors`	Get all directly connected nodes
`get_file_structure`	Overview of a file's chunks and entities
`get_related_chunks`	Find chunks related by a specific relationship type
`list_all_entities`	List all tracked entities in the codebase
`get_graph_stats`	Statistics about the knowledge graph
`find_path`	Find shortest path between two nodes
`get_subgraph`	Extract a subgraph around a node
`print_tree`	Display repository structure as a tree
`diff_chunks`	Compare content between two code chunks
`search_by_type_and_name`	Search entities by type (class, function, etc.) and name
`get_chunk_context`	Get a chunk with its surrounding context

🌐 Supported Languages

The knowledge graph builder uses AST-based entity extraction for accurate parsing:

Language	Parser	Entity Types
Python	`ast` module	classes, functions, methods, variables, imports
C	`libclang`	functions, structs, typedefs, variables
C++	`libclang`	classes, namespaces, methods, templates
Java	`javalang`	classes, interfaces, methods, fields
JavaScript/TypeScript	`esprima`	classes, functions, variables, imports
Rust	`tree-sitter`	structs, enums, traits, functions, modules
HTML	`BeautifulSoup`	DOM elements, inline JS extraction

The system also detects API endpoints for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).

🚀 Getting Started

Prerequisites

Docker & Docker Compose
Python 3.10+ (for local development)
CUDA-capable GPU (optional, for faster embeddings)

Quick Start with Docker

# Start the MCP server with a sample knowledge graph
docker-compose up

Building a Knowledge Graph from Your Repository

from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
# From a local path
kg = RepoKnowledgeGraph.from_path(
    "/path/to/your/repo",
    skip_dirs=["node_modules", ".git", "__pycache__"],
    extract_entities=True,
    index_nodes=True
)
# Save for later use
kg.save_graph_to_file("my_knowledge_graph.json")

Running the MCP using Gradio

python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860

📊 Interactive Explorer (Gradio UI)

The project includes a Gradio-based web interface for exploring knowledge graphs interactively:

Search: Use natural language or keywords to find relevant code
Navigate: Click through nodes to explore relationships
Analyze: Get statistics about code structure and dependencies
Visualize: View the repository tree and entity relationships

📁 Data Sources

The application supports loading knowledge graphs from multiple sources:

1. HuggingFace Hub Dataset (Recommended for Sharing)

Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):

python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"

2. Local JSON File

Use a local JSON file (e.g., multihop_knowledge_graph_with_embeddings.json):

python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json

3. Direct from Git Repository

Clone and analyze a repository on-the-fly:

python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"

Publishing to HuggingFace Hub

You can save an existing knowledge graph to HuggingFace Hub for sharing:

from RepoKnowledgeGraphLib import RepoKnowledgeGraph
# Load from local file
kg = RepoKnowledgeGraph.load("path/to/graph.json")
# Push to HuggingFace Hub (without embeddings to reduce size)
kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
# Or with embeddings (larger dataset)
kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)

🏗️ Architecture Overview

root/
├── Dockerfile                  # Docker configuration
├── requirements.txt            # Python dependencies
├── RepoKnowledgeGraphLib/  # Knowledge graph implementation
│   ├── RepoKnowledgeGraph.py    # Main graph class
│   ├── KnowledgeGraphMCPServer.py # MCP server implementation
│   ├── EntityExtractor.py       # AST-based entity extraction
│   ├── CodeParser.py            # Code chunking
│   ├── CodeIndex.py             # Hybrid search (LanceDB/Weaviate)
│   ├── ModelService.py          # Embedding generation
│   └── Node.py                  # Graph node types
└── gradio_mcp_space.py              # Main Gradio web interface

👥 Team

Team Name: CEPIA Ionis Team

Team Members:

Laila ELKOUSSY - @lailaelkoussy - Research Engineer, Data Scientist
Julien PEREZ - @jnm38 - Research Director

📄 License

This project is developed as part of research at EPITA / Ionis Group.

🔗 Related Resources

Model Context Protocol (MCP) - The protocol standard
Gradio - Python web interface framework with MCP support
LanceDB - Vector database for code indexing
Salesforce SFR-Embedding-Code - Code embedding model

🆚 VS Code Integration

To use this MCP server with GitHub Copilot in VS Code, you need to configure an mcp.json file.

Configuration File Location

Create or edit the file at .vscode/mcp.json in your workspace root:

your-workspace/
├── .vscode/
│   └── mcp.json    ← Place the configuration here
├── src/
└── ...

Configuration Content

Add the following content to .vscode/mcp.json:

{
    "servers": {
        "transformers-code-graph": {
            "url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}

What This Does

servers: Defines the MCP servers available to VS Code
transformers-code-graph: A custom name for this server connection
url: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
type: Set to "http" for remote HTTP-based MCP servers

Using with Your Own Server

If you're running your own MCP server locally, update the URL accordingly:

{
    "servers": {
        "my-code-graph": {
            "url": "http://localhost:7860/gradio_api/mcp/",
            "type": "http"
        }
    },
    "inputs": []
}

Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.