Spaces:

lailaelkoussy
/

code-knowledge-graph-explorer-transformers-library

Sleeping

App Files Files Community

lailaelkoussy commited on Nov 30, 2025

Commit

b1ddffc

verified ·

1 Parent(s): f1dcdb0

Update README.md

Browse files

Files changed (1) hide show

README.md +332 -5

README.md CHANGED Viewed

@@ -1,10 +1,337 @@
 ---
-title: Code Knowledge Graph Explorer Transformers Library
-emoji: 🏢
-colorFrom: red
-colorTo: gray
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Code Knowledge Graph Explorer — 🤗 Transformers Library
+emoji: 🔍
+colorFrom: blue
+colorTo: purple
 sdk: docker
+app_port: 7860
 pinned: false
+tags:
+- building-mcp-track-enterprise
+short_description: MCP server for big code — explore Transformers
 ---
+# 🎓 Code Knowledge Graph MCP Server
+> **Helping LLM-based agents navigate and understand large codebases**
+## 📚 What is this project?
+This project provides a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that transforms code repositories into navigable **knowledge graphs**. It enables Large Language Model (LLM) based agents to efficiently explore, understand, and reason about complex codebases — a critical capability for modern software engineering education and practice.
+## 🔬 Use Case: EPITA Coding Courses
+This project was developed with **educational applications** in mind, specifically to support **EPITA coding courses**:
+### 🔍 Enhanced Code Discovery for Agents
+LLM-based coding agents can use this tool to **better discover and navigate large repositories**. Instead of blindly searching through files, agents can:
+- Query the knowledge graph to understand the overall architecture
+- Follow relationships between modules, classes, and functions
+- Identify entry points and critical code paths
+- Understand how different parts of the codebase interact
+### 📈 Detecting Areas for Code Improvement
+For EPITA courses, this tool helps agents **identify areas where student code can be improved**:
+- **Dead Code Detection**: Find unused functions, classes, or variables
+- **Circular Dependencies**: Detect problematic import cycles between modules
+- **Code Coupling Analysis**: Identify tightly coupled components that should be refactored
+- **Missing Documentation**: Find undocumented public APIs and complex functions
+- **Complexity Hotspots**: Locate chunks with many outgoing calls (high coupling)
+- **Orphan Code**: Detect code that is declared but never called
+### 🎓 EPITA Course Integration
+- **Project Reviews**: Quickly understand student project architectures before grading
+- **Automated Feedback**: Integrate with LLM tutors to provide targeted improvement suggestions
+- **Code Quality Assessment**: Consistent evaluation criteria across student submissions
+- **Learning Tool**: Help students navigate and understand unfamiliar codebases (e.g., open-source projects)
+- **Research**: Study code organization patterns across student projects
+The MCP interface makes it easy to integrate with any LLM-based tutoring or code review system used in EPITA courses.
+---
+### 🎯 The Problem We Solve
+At **EPITA** (École pour l'informatique et les techniques avancées), students work on increasingly complex software projects throughout their curriculum. Understanding large codebases — whether their own, their teammates', or open-source libraries — is a fundamental skill for any computer science engineer.
+However, LLM-based coding assistants face significant challenges when working with large repositories:
+- **Context window limitations**: LLMs cannot process entire codebases at once
+- **Lack of structural awareness**: Without understanding how code is organized, LLMs struggle to locate relevant files
+- **Missing relationships**: Function calls, class inheritance, and module dependencies are not immediately visible
+- **Inefficient search**: Simple keyword search fails to capture semantic meaning
+### 💡 Our Solution: Knowledge Graphs + MCP
+This project addresses these challenges by:
+1. **Parsing repositories** into a structured knowledge graph (files → chunks → entities)
+2. **Extracting relationships** between code elements (calls, contains, declares, imports)
+3. **Indexing content** with hybrid search (semantic embeddings + keyword matching)
+4. **Exposing tools via MCP** that allow LLM agents to navigate the codebase intelligently
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     CODE REPOSITORY                              │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │
+│  │  File A  │  │  File B  │  │  File C  │  │  File D  │   ...   │
+│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘         │
+└───────┼─────────────┼─────────────┼─────────────┼───────────────┘
+        ▼             ▼             ▼             ▼
+┌─────────────────────────────────────────────────────────────────┐
+│               KNOWLEDGE GRAPH CONSTRUCTION                       │
+│  • AST Parsing (Python, C/C++, Java, JavaScript, Rust, HTML)    │
+│  • Entity Extraction (classes, functions, variables, methods)   │
+│  • Relationship Detection (calls, inheritance, imports)         │
+│  • Code Chunking & Embedding (semantic vectors)                 │
+└───────────────────────────────┬─────────────────────────────────┘
+                                ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    MCP SERVER (Gradio)                           │
+│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
+│  │search_nodes │ │go_to_def   │ │find_usages   │ │get_neighbors│ │
+│  └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
+│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌────────────┐ │
+│  │get_file_    │ │get_related │ │find_path     │ │print_tree  │ │
+│  │structure    │ │_chunks     │ │              │ │            │ │
+│  └─────────────┘ └────────────┘ └──────────────┘ └────────────┘ │
+└───────────────────────────────┬─────────────────────────────────┘
+                                ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    LLM-BASED AGENT                               │
+│  • Can search for relevant code using natural language          │
+│  • Navigate from function calls to their definitions            │
+│  • Understand the structure of files and directories            │
+│  • Trace dependencies and relationships across the codebase     │
+└─────────────────────────────────────────────────────────────────┘
+```
+## 🛠️ MCP Tools Available
+The MCP server exposes the following tools for LLM agents:
+| Tool                      | Description                                               |
+| ------------------------- | --------------------------------------------------------- |
+| `search_nodes`            | Semantic + keyword search for code chunks                 |
+| `get_node_info`           | Detailed information about any node (file, chunk, entity) |
+| `get_node_edges`          | Incoming and outgoing relationships of a node             |
+| `go_to_definition`        | Find where a function/class/variable is declared          |
+| `find_usages`             | Find all places where an entity is called/used            |
+| `get_neighbors`           | Get all directly connected nodes                          |
+| `get_file_structure`      | Overview of a file's chunks and entities                  |
+| `get_related_chunks`      | Find chunks related by a specific relationship type       |
+| `list_all_entities`       | List all tracked entities in the codebase                 |
+| `get_graph_stats`         | Statistics about the knowledge graph                      |
+| `find_path`               | Find shortest path between two nodes                      |
+| `get_subgraph`            | Extract a subgraph around a node                          |
+| `print_tree`              | Display repository structure as a tree                    |
+| `diff_chunks`             | Compare content between two code chunks                   |
+| `search_by_type_and_name` | Search entities by type (class, function, etc.) and name  |
+| `get_chunk_context`       | Get a chunk with its surrounding context                  |
+## 🌐 Supported Languages
+The knowledge graph builder uses **AST-based entity extraction** for accurate parsing:
+| Language              | Parser          | Entity Types                                    |
+| --------------------- | --------------- | ----------------------------------------------- |
+| Python                | `ast` module    | classes, functions, methods, variables, imports |
+| C                     | `libclang`      | functions, structs, typedefs, variables         |
+| C++                   | `libclang`      | classes, namespaces, methods, templates         |
+| Java                  | `javalang`      | classes, interfaces, methods, fields            |
+| JavaScript/TypeScript | `esprima`       | classes, functions, variables, imports          |
+| Rust                  | `tree-sitter`   | structs, enums, traits, functions, modules      |
+| HTML                  | `BeautifulSoup` | DOM elements, inline JS extraction              |
+The system also detects **API endpoints** for web frameworks (FastAPI, Flask, Spring Boot, Actix-web, etc.).
+## 🚀 Getting Started
+### Prerequisites
+- Docker & Docker Compose
+- Python 3.10+ (for local development)
+- CUDA-capable GPU (optional, for faster embeddings)
+### Quick Start with Docker
+```bash
+# Start the MCP server with a sample knowledge graph
+docker-compose up
+```
+### Building a Knowledge Graph from Your Repository
+```python
+from RepoKnowledgeGraphLib.RepoKnowledgeGraph import RepoKnowledgeGraph
+# From a local path
+kg = RepoKnowledgeGraph.from_path(
+    "/path/to/your/repo",
+    skip_dirs=["node_modules", ".git", "__pycache__"],
+    extract_entities=True,
+    index_nodes=True
+)
+# Save for later use
+kg.save_graph_to_file("my_knowledge_graph.json")
+```
+### Running the MCP using Gradio
+```bash
+python gradio_mcp.py --graph-file my_knowledge_graph.json --host 0.0.0.0 --port 7860
+```
+## 📊 Interactive Explorer (Gradio UI)
+The project includes a Gradio-based web interface for exploring knowledge graphs interactively:
+- **Search**: Use natural language or keywords to find relevant code
+- **Navigate**: Click through nodes to explore relationships
+- **Analyze**: Get statistics about code structure and dependencies
+- **Visualize**: View the repository tree and entity relationships
+## 📁 Data Sources
+The application supports loading knowledge graphs from multiple sources:
+### 1. HuggingFace Hub Dataset (Recommended for Sharing)
+Load directly from a HuggingFace dataset created by the library (cf. Publishing to Huggingface Hub):
+```bash
+python gradio_mcp.py --host 0.0.0.0 --port 7860 --hf-dataset "username/dataset-name"
+```
+### 2. Local JSON File
+Use a local JSON file (e.g., `multihop_knowledge_graph_with_embeddings.json`):
+```bash
+python gradio_mcp.py --host 0.0.0.0 --port 7860 --graph-file data/multihop_knowledge_graph_with_embeddings.json
+```
+### 3. Direct from Git Repository
+Clone and analyze a repository on-the-fly:
+```bash
+python gradio_mcp.py --host 0.0.0.0 --port 7860 --repo-url "https://github.com/user/repo.git"
+```
+### Publishing to HuggingFace Hub
+You can save an existing knowledge graph to HuggingFace Hub for sharing:
+```python
+from RepoKnowledgeGraphLib import RepoKnowledgeGraph
+# Load from local file
+kg = RepoKnowledgeGraph.load("path/to/graph.json")
+# Push to HuggingFace Hub (without embeddings to reduce size)
+kg.to_hf_dataset("username/my-knowledge-graph", save_embeddings=False, private=False)
+# Or with embeddings (larger dataset)
+kg.to_hf_dataset("username/my-knowledge-graph-with-embeddings", save_embeddings=True)
+```
+## 🏗️ Architecture Overview
+```
+root/
+├── Dockerfile                  # Docker configuration
+├── requirements.txt            # Python dependencies
+├── RepoKnowledgeGraphLib/  # Knowledge graph implementation
+│   ├── RepoKnowledgeGraph.py    # Main graph class
+│   ├── KnowledgeGraphMCPServer.py # MCP server implementation
+│   ├── EntityExtractor.py       # AST-based entity extraction
+│   ├── CodeParser.py            # Code chunking
+│   ├── CodeIndex.py             # Hybrid search (LanceDB/Weaviate)
+│   ├── ModelService.py          # Embedding generation
+│   └── Node.py                  # Graph node types
+└── gradio_mcp_space.py              # Main Gradio web interface
+```
+## 👥 Team
+**Team Name:** CEPIA Ionis Team
+**Team Members:**
+- **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
+- **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
+---
+## 📄 License
+This project is developed as part of research at EPITA / Ionis Group.
+## 🔗 Related Resources
+- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
+- [Gradio](https://gradio.app/) - Python web interface framework with MCP support
+- [LanceDB](https://lancedb.github.io/lancedb/) - Vector database for code indexing
+- [Salesforce SFR-Embedding-Code](https://huggingface.co/Salesforce/SFR-Embedding-Code-400M_R) - Code embedding model
+## 🆚 VS Code Integration
+To use this MCP server with **GitHub Copilot** in VS Code, you need to configure an `mcp.json` file.
+### Configuration File Location
+Create or edit the file at `.vscode/mcp.json` in your workspace root:
+```
+your-workspace/
+├── .vscode/
+│   └── mcp.json    ← Place the configuration here
+├── src/
+└── ...
+```
+### Configuration Content
+Add the following content to `.vscode/mcp.json`:
+```jsonc
+{
+    "servers": {
+        "transformers-code-graph": {
+            "url": "https://lailaelkoussy-transformers-library-knowledge-graph.hf.space/gradio_api/mcp/",
+            "type": "http"
+        }
+    },
+    "inputs": []
+}
+```
+### What This Does
+- **`servers`**: Defines the MCP servers available to VS Code
+- **`transformers-code-graph`**: A custom name for this server connection
+- **`url`**: The endpoint of the hosted MCP server (here pointing to the HuggingFace Space)
+- **`type`**: Set to `"http"` for remote HTTP-based MCP servers
+### Using with Your Own Server
+If you're running your own MCP server locally, update the URL accordingly:
+```jsonc
+{
+    "servers": {
+        "my-code-graph": {
+            "url": "http://localhost:7860/gradio_api/mcp/",
+            "type": "http"
+        }
+    },
+    "inputs": []
+}
+```
+Once configured, GitHub Copilot in VS Code will have access to all the knowledge graph tools (search_nodes, go_to_definition, find_usages, etc.) to help navigate and understand your codebase.