Spaces:

lailaelkoussy
/

EPITA-codevoyager-transformers-library

Runtime error

App Files Files Community

EPITA-codevoyager-transformers-library / README.md

lailaelkoussy

Update README.md

a0430cd verified 5 months ago

preview code

raw

history blame contribute delete

16.3 kB

metadata

title: EPITA CodeVoyager on 🤗 Transformer Library
emoji: 🚀
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
tags:
  - mcp-in-action-track-enterprise

🚀 EPITA CodeVoyager

A conversational AI agent that helps you navigate and understand large codebases through natural language

📚 What is EPITA CodeVoyager?

EPITA CodeVoyager is an interactive chat agent powered by Smolagents that connects to the EPITA Codebase Knowledge Graph MCP Server. It enables users to ask natural language questions about a codebase and receive accurate, grounded answers based on the actual code — not hallucinations.

How It Works

Traditional LLMs generate answers from their training data, which can lead to outdated or fabricated information about specific codebases. EPITA CodeVoyager takes a different approach:

Tool-Augmented Reasoning: Instead of guessing, the agent uses MCP (Model Context Protocol) tools to actively query the knowledge graph — searching for code, navigating relationships, and retrieving actual implementations.
Grounded Responses: Every answer is backed by real code snippets, file paths, and structural information extracted directly from the repository.
Multi-Step Exploration: Complex questions trigger chains of tool calls. For example, understanding how a class works might require: searching for its definition → examining its methods → tracing its inheritance hierarchy → finding usage examples.
Streaming Transparency: Users see the agent's reasoning process in real-time — which tools are called, what information is retrieved, and how the final answer is synthesized.

🤗 Showcase: Hugging Face Transformers Library

We demonstrate EPITA CodeVoyager on the Hugging Face Transformers library — one of the most popular open-source ML libraries with:

4,000+ files across multiple modules
400,000+ lines of code
Hundreds of model implementations (BERT, GPT, LLaMA, etc.)
Complex inheritance hierarchies and cross-file dependencies

This showcase demonstrates how the agent can help users understand and navigate even the most complex codebases through simple conversational queries.

🎯 Why This Matters for Education

Understanding large codebases is a fundamental skill for software engineers. At EPITA (École pour l'informatique et les techniques avancées), students work on increasingly complex projects and need to understand codebases — whether their own, their teammates', or open-source libraries.

LLM-based coding assistants face significant challenges with large repositories: context window limitations, lack of structural awareness, missing relationships, and inefficient search. EPITA CodeVoyager solves these problems by using MCP tools to search, navigate, and understand code repositories intelligently, making it an ideal assistant for developers, students, and educators exploring complex codebases.

🔬 Use Case: EPITA Coding Courses

EPITA CodeVoyager was developed with educational applications in mind, specifically to support EPITA coding courses.

🎯 The Educational Challenge

At EPITA, students work on increasingly complex software projects throughout their curriculum. Understanding large codebases — whether their own, their teammates', or open-source libraries like Transformers — is a fundamental skill for any computer science engineer.

However, navigating a library with thousands of files is overwhelming. Students often:

Struggle to find where specific functionality is implemented
Don't understand how different components connect
Spend hours reading code without grasping the big picture
Miss important design patterns and architectural decisions

💡 How EPITA CodeVoyager Helps

EPITA CodeVoyager addresses these challenges by enabling students to ask questions in natural language:

🔍 Intelligent Code Q&A

Instead of manually searching through thousands of files, users can simply ask:

"How does the AutoModel class work?"
"What classes inherit from PreTrainedModel?"
"How is tokenization implemented in the library?"
"What files are involved in the BERT implementation?"

The agent uses MCP tools from the EPITA Codebase Knowledge Graph MCP Server to explore the codebase, gather relevant information, and provide accurate, well-structured answers grounded in the actual code.

📈 Learning Through Exploration

For EPITA courses and code learning in general, EPITA CodeVoyager helps users:

Understand Architecture: Ask about how components are organized and connected
Trace Code Flow: Follow function calls and understand execution paths
Learn Design Patterns: Discover implementation patterns used in real-world libraries
Prepare for Code Reviews: Understand unfamiliar code before reviewing or contributing

🎓 EPITA Course Integration

Interactive Learning: Students can explore open-source libraries conversationally
Office Hours Support: Integrate with tutoring systems to answer code-related questions
Project Onboarding: Help students understand project codebases quickly
Self-Paced Study: Enable students to learn complex libraries at their own pace

🎓 Broader Applications

Beyond the Transformers library showcase, EPITA CodeVoyager (backed by the EPITA Codebase Knowledge Graph MCP Server) can be applied to any codebase:

Student Projects: Help students understand their teammates' code during group projects
Open Source Onboarding: Quickly learn how popular libraries are structured
Code Reviews: Understand unfamiliar code before reviewing or contributing
Research: Analyze code patterns across different repositories
Industry: Onboard new developers to large enterprise codebases

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         USER (Gradio UI)                         │
│                                                                   │
│   "How does BertModel's forward method work?"                    │
└───────────────────────────────┬───────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EPITA CODEVOYAGER                             │
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                    ToolCallingAgent                          │ │
│  │  • Receives natural language question                       │ │
│  │  • Decides which MCP tools to call                          │ │
│  │  • Chains multiple tool calls if needed                     │ │
│  │  • Synthesizes final answer from tool results               │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  LLM Backend: Any OpenAI-compatible API                          │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Supports any HTTP REST service with OpenAI-style interface │ │
│  │  (OpenAI, Azure, HuggingFace Inference, vLLM, Ollama, etc.) │ │
│  └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────┬───────────────────────────────────┘
                                │ MCP Protocol (HTTP)
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│         EPITA CODEBASE KNOWLEDGE GRAPH MCP SERVER                │
│                                                                   │
│  Tools Used by Agent:                                            │
│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐                │
│  │search_nodes │ │go_to_def   │ │find_usages   │                │
│  └─────────────┘ └────────────┘ └──────────────┘                │
│  ┌─────────────┐ ┌────────────┐ ┌──────────────┐                │
│  │get_node_info│ │get_file_   │ │get_neighbors │                │
│  │             │ │structure   │ │              │                │
│  └─────────────┘ └────────────┘ └──────────────┘                │
└───────────────────────────────┬───────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE GRAPH                               │
│           (Hugging Face Transformers Library)                    │
│                                                                   │
│  • 4,000+ files indexed                                          │
│  • 400k+ lines of code                                           │
│  • Functions, classes, relationships extracted                   │
└─────────────────────────────────────────────────────────────────┘

🛠️ Features

Multi-Provider LLM Support

The agent supports multiple LLM backends:

Provider	Models	Configuration
OpenAI	gpt-5, gpt-5o, gpt-4o-mini, etc.	API key + base URL
Azure OpenAI	gpt-5, gpt-4, gpt-4o (deployed)	API key + endpoint + version
HuggingFace Inference	Qwen2.5-Coder-32B, Llama-3.1, etc.	HF token + optional provider

Streaming Responses

The agent streams responses in real-time, showing:

🧠 Model Thinking: Internal reasoning displayed in collapsible sections
🔧 Tool Calls: Which MCP tools are being invoked
💬 Final Answer: Synthesized response based on code exploration

Configurable Reasoning Steps

Control how deeply the agent explores:

Max Steps: Limit the number of tool calls per query (default: 5)
Lower values = faster responses, higher values = more thorough exploration

🚀 Getting Started

Configure the LLM in the Web UI

Once launched, open the Gradio interface and configure your LLM provider:

For OpenAI:

Model Type: openai
Model Name: gpt-4o-mini (or gpt-4o, gpt-4-turbo)
API Key: Your OpenAI API key
Base URL: https://api.openai.com/v1

For Azure OpenAI:

Model Type: azure
Model Name: Your deployment name
Azure API Key: Your Azure API key
Azure Endpoint: https://your-resource.openai.azure.com
API Version: 2024-02-15-preview

For HuggingFace Inference:

Model Type: hf_inference
Model Name: Qwen/Qwen2.5-Coder-32B-Instruct
HuggingFace Token: Your HF API token
Provider (optional): together, fireworks-ai, cerebras

💡 Example Interactions

Understanding a Class

User: "How does the AutoModel class work?"

Agent:

Calls search_nodes("AutoModel")
Calls get_node_info("src/transformers/models/auto/auto_factory.py_3")
Calls get_file_structure("src/transformers/models/auto/auto_factory.py")
Synthesizes response explaining the auto-class factory pattern

Tracing Dependencies

User: "What classes inherit from PreTrainedModel?"

Agent:

Calls go_to_definition("PreTrainedModel")
Calls find_usages("PreTrainedModel")
Returns list of model classes with inheritance relationships

Exploring Implementation

User: "How does tokenization work in the library?"

Agent:

Calls search_nodes("tokenization")
Calls get_neighbors("src/transformers/tokenization_utils_base.py")
Calls get_file_structure("src/transformers/tokenization_utils.py")
Explains the tokenizer hierarchy and key methods

🔧 Agent Internals

KnowledgeGraphChatAgent Class

The main agent class (powering EPITA CodeVoyager) handles:

class KnowledgeGraphChatAgent:
    def __init__(self, mcp_server_url: str):
        # Connect to MCP server and load tools
        self._initialize_mcp_tools()
    
    def _initialize_model(self, model_type, api_key, ...):
        # Configure OpenAI, Azure, or HF Inference backend
    
    def _initialize_agent(self, max_steps):
        # Create ToolCallingAgent with MCP tools
    
    def chat(self, message, history):
        # Stream responses using stream_to_gradio

Custom Instructions

EPITA CodeVoyager is configured with domain-specific instructions for the Transformers library:

CUSTOM_INSTRUCTIONS = """You are an expert assistant for understanding the Hugging Face Transformers library.

Your role is to help users understand the Transformers codebase by exploring the repository using the available tools. You can:
- Search for functions, classes, and methods in the codebase
- Navigate the file structure and understand code organization
- Find relationships between different components
- Trace how code flows through the library
- Explain implementation details and design patterns

When answering questions:
1. Use the available tools to explore the repository and gather accurate information
2. Provide clear, well-structured explanations based on the actual code
3. Reference specific files, functions, or classes when relevant
4. If you're unsure about something, search the codebase to verify before answering

Always base your answers on the actual code in the repository, not assumptions."""

👥 Team

Team Name: CEPIA Ionis Team

Team Members:

Laila ELKOUSSY - @lailaelkoussy - Research Engineer, Data Scientist
Julien PEREZ - @jnm38 - Research Director

📄 License

This project is developed as part of research at EPITA / Ionis Group.

🔗 Related Resources

Smolagents - The agent framework used
Model Context Protocol (MCP) - The protocol standard
Gradio - Web interface framework