lailaelkoussy commited on
Commit
a0430cd
Β·
verified Β·
1 Parent(s): 91bf311

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +307 -5
README.md CHANGED
@@ -1,10 +1,312 @@
1
  ---
2
- title: EPITA Codevoyager Transformers Library
3
- emoji: πŸŒ–
4
- colorFrom: red
5
- colorTo: indigo
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: EPITA CodeVoyager on πŸ€— Transformer Library
3
+ emoji: πŸš€
4
+ colorFrom: pink
5
+ colorTo: red
6
  sdk: docker
7
  pinned: false
8
+ tags:
9
+ - mcp-in-action-track-enterprise
10
  ---
11
 
12
+ # πŸš€ EPITA CodeVoyager
13
+
14
+ > **A conversational AI agent that helps you navigate and understand large codebases through natural language**
15
+
16
+ ## πŸ“š What is EPITA CodeVoyager?
17
+
18
+ **EPITA CodeVoyager** is an interactive **chat agent** powered by [Smolagents](https://github.com/huggingface/smolagents) that connects to the [**EPITA Codebase Knowledge Graph MCP Server**](https://huggingface.co/spaces/lailaelkoussy/transformers-library-knowledge-graph). It enables users to ask natural language questions about a codebase and receive accurate, grounded answers based on the actual code β€” not hallucinations.
19
+
20
+ ### How It Works
21
+
22
+ Traditional LLMs generate answers from their training data, which can lead to outdated or fabricated information about specific codebases. **EPITA CodeVoyager** takes a different approach:
23
+
24
+ 1. **Tool-Augmented Reasoning**: Instead of guessing, the agent uses MCP (Model Context Protocol) tools to actively query the knowledge graph β€” searching for code, navigating relationships, and retrieving actual implementations.
25
+
26
+ 2. **Grounded Responses**: Every answer is backed by real code snippets, file paths, and structural information extracted directly from the repository.
27
+
28
+ 3. **Multi-Step Exploration**: Complex questions trigger chains of tool calls. For example, understanding how a class works might require: searching for its definition β†’ examining its methods β†’ tracing its inheritance hierarchy β†’ finding usage examples.
29
+
30
+ 4. **Streaming Transparency**: Users see the agent's reasoning process in real-time β€” which tools are called, what information is retrieved, and how the final answer is synthesized.
31
+
32
+ ### πŸ€— Showcase: Hugging Face Transformers Library
33
+
34
+ We demonstrate EPITA CodeVoyager on the [**Hugging Face Transformers**](https://github.com/huggingface/transformers) library β€” one of the most popular open-source ML libraries with:
35
+ - **4,000+ files** across multiple modules
36
+ - **400,000+ lines of code**
37
+ - **Hundreds of model implementations** (BERT, GPT, LLaMA, etc.)
38
+ - **Complex inheritance hierarchies** and cross-file dependencies
39
+
40
+ This showcase demonstrates how the agent can help users understand and navigate even the most complex codebases through simple conversational queries.
41
+
42
+ ### 🎯 Why This Matters for Education
43
+
44
+ Understanding large codebases is a **fundamental skill** for software engineers. At **EPITA** (Γ‰cole pour l'informatique et les techniques avancΓ©es), students work on increasingly complex projects and need to understand codebases β€” whether their own, their teammates', or open-source libraries.
45
+
46
+ LLM-based coding assistants face significant challenges with large repositories: context window limitations, lack of structural awareness, missing relationships, and inefficient search. **EPITA CodeVoyager** solves these problems by using MCP tools to **search**, **navigate**, and **understand** code repositories intelligently, making it an ideal assistant for developers, students, and educators exploring complex codebases.
47
+
48
+ ## πŸ”¬ Use Case: EPITA Coding Courses
49
+
50
+ **EPITA CodeVoyager** was developed with **educational applications** in mind, specifically to support **EPITA coding courses**.
51
+
52
+ ### 🎯 The Educational Challenge
53
+
54
+ At **EPITA**, students work on increasingly complex software projects throughout their curriculum. Understanding large codebases β€” whether their own, their teammates', or open-source libraries like Transformers β€” is a fundamental skill for any computer science engineer.
55
+
56
+ However, navigating a library with **thousands of files** is overwhelming. Students often:
57
+ - Struggle to find where specific functionality is implemented
58
+ - Don't understand how different components connect
59
+ - Spend hours reading code without grasping the big picture
60
+ - Miss important design patterns and architectural decisions
61
+
62
+ ### πŸ’‘ How EPITA CodeVoyager Helps
63
+
64
+ **EPITA CodeVoyager** addresses these challenges by enabling students to **ask questions in natural language**:
65
+
66
+ ### πŸ” Intelligent Code Q&A
67
+
68
+ Instead of manually searching through thousands of files, users can simply ask:
69
+
70
+ - *"How does the `AutoModel` class work?"*
71
+ - *"What classes inherit from `PreTrainedModel`?"*
72
+ - *"How is tokenization implemented in the library?"*
73
+ - *"What files are involved in the BERT implementation?"*
74
+
75
+ The agent uses MCP tools from the **EPITA Codebase Knowledge Graph MCP Server** to explore the codebase, gather relevant information, and provide accurate, well-structured answers grounded in the actual code.
76
+
77
+ ### πŸ“ˆ Learning Through Exploration
78
+
79
+ For EPITA courses and code learning in general, EPITA CodeVoyager helps users:
80
+
81
+ - **Understand Architecture**: Ask about how components are organized and connected
82
+ - **Trace Code Flow**: Follow function calls and understand execution paths
83
+ - **Learn Design Patterns**: Discover implementation patterns used in real-world libraries
84
+ - **Prepare for Code Reviews**: Understand unfamiliar code before reviewing or contributing
85
+
86
+ ### πŸŽ“ EPITA Course Integration
87
+
88
+ - **Interactive Learning**: Students can explore open-source libraries conversationally
89
+ - **Office Hours Support**: Integrate with tutoring systems to answer code-related questions
90
+ - **Project Onboarding**: Help students understand project codebases quickly
91
+ - **Self-Paced Study**: Enable students to learn complex libraries at their own pace
92
+
93
+ ### πŸŽ“ Broader Applications
94
+
95
+ Beyond the Transformers library showcase, **EPITA CodeVoyager** (backed by the EPITA Codebase Knowledge Graph MCP Server) can be applied to any codebase:
96
+
97
+ - **Student Projects**: Help students understand their teammates' code during group projects
98
+ - **Open Source Onboarding**: Quickly learn how popular libraries are structured
99
+ - **Code Reviews**: Understand unfamiliar code before reviewing or contributing
100
+ - **Research**: Analyze code patterns across different repositories
101
+ - **Industry**: Onboard new developers to large enterprise codebases
102
+
103
+ ---
104
+
105
+ ## πŸ—οΈ Architecture
106
+
107
+ ```
108
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
109
+ β”‚ USER (Gradio UI) β”‚
110
+ β”‚ β”‚
111
+ β”‚ "How does BertModel's forward method work?" β”‚
112
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
113
+ β”‚
114
+ β–Ό
115
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
116
+ β”‚ EPITA CODEVOYAGER β”‚
117
+ β”‚ β”‚
118
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
119
+ β”‚ β”‚ ToolCallingAgent β”‚ β”‚
120
+ β”‚ β”‚ β€’ Receives natural language question β”‚ β”‚
121
+ β”‚ β”‚ β€’ Decides which MCP tools to call β”‚ β”‚
122
+ β”‚ β”‚ β€’ Chains multiple tool calls if needed β”‚ β”‚
123
+ β”‚ β”‚ β€’ Synthesizes final answer from tool results β”‚ β”‚
124
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
125
+ β”‚ β”‚
126
+ β”‚ LLM Backend: Any OpenAI-compatible API β”‚
127
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
128
+ β”‚ β”‚ Supports any HTTP REST service with OpenAI-style interface β”‚ β”‚
129
+ β”‚ β”‚ (OpenAI, Azure, HuggingFace Inference, vLLM, Ollama, etc.) β”‚ β”‚
130
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
131
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
132
+ β”‚ MCP Protocol (HTTP)
133
+ β–Ό
134
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
135
+ β”‚ EPITA CODEBASE KNOWLEDGE GRAPH MCP SERVER β”‚
136
+ β”‚ β”‚
137
+ β”‚ Tools Used by Agent: β”‚
138
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
139
+ β”‚ β”‚search_nodes β”‚ β”‚go_to_def β”‚ β”‚find_usages β”‚ β”‚
140
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
141
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” ��──────────────┐ β”‚
142
+ β”‚ β”‚get_node_infoβ”‚ β”‚get_file_ β”‚ β”‚get_neighbors β”‚ β”‚
143
+ β”‚ β”‚ β”‚ β”‚structure β”‚ β”‚ β”‚ β”‚
144
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
145
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
146
+ β”‚
147
+ β–Ό
148
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
149
+ β”‚ KNOWLEDGE GRAPH β”‚
150
+ β”‚ (Hugging Face Transformers Library) β”‚
151
+ β”‚ β”‚
152
+ β”‚ β€’ 4,000+ files indexed β”‚
153
+ β”‚ β€’ 400k+ lines of code β”‚
154
+ β”‚ β€’ Functions, classes, relationships extracted β”‚
155
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
156
+ ```
157
+
158
+ ## πŸ› οΈ Features
159
+
160
+ ### Multi-Provider LLM Support
161
+
162
+ The agent supports multiple LLM backends:
163
+
164
+ | Provider | Models | Configuration |
165
+ |----------|--------|---------------|
166
+ | **OpenAI** | gpt-5, gpt-5o, gpt-4o-mini, etc. | API key + base URL |
167
+ | **Azure OpenAI** | gpt-5, gpt-4, gpt-4o (deployed) | API key + endpoint + version |
168
+ | **HuggingFace Inference** | Qwen2.5-Coder-32B, Llama-3.1, etc. | HF token + optional provider |
169
+
170
+ ### Streaming Responses
171
+
172
+ The agent streams responses in real-time, showing:
173
+ - 🧠 **Model Thinking**: Internal reasoning displayed in collapsible sections
174
+ - πŸ”§ **Tool Calls**: Which MCP tools are being invoked
175
+ - πŸ’¬ **Final Answer**: Synthesized response based on code exploration
176
+
177
+ ### Configurable Reasoning Steps
178
+
179
+ Control how deeply the agent explores:
180
+ - **Max Steps**: Limit the number of tool calls per query (default: 5)
181
+ - Lower values = faster responses, higher values = more thorough exploration
182
+
183
+ ---
184
+
185
+ ## πŸš€ Getting Started
186
+
187
+
188
+ #### Configure the LLM in the Web UI
189
+
190
+ Once launched, open the Gradio interface and configure your LLM provider:
191
+
192
+ **For OpenAI:**
193
+ - Model Type: `openai`
194
+ - Model Name: `gpt-4o-mini` (or `gpt-4o`, `gpt-4-turbo`)
195
+ - API Key: Your OpenAI API key
196
+ - Base URL: `https://api.openai.com/v1`
197
+
198
+ **For Azure OpenAI:**
199
+ - Model Type: `azure`
200
+ - Model Name: Your deployment name
201
+ - Azure API Key: Your Azure API key
202
+ - Azure Endpoint: `https://your-resource.openai.azure.com`
203
+ - API Version: `2024-02-15-preview`
204
+
205
+ **For HuggingFace Inference:**
206
+ - Model Type: `hf_inference`
207
+ - Model Name: `Qwen/Qwen2.5-Coder-32B-Instruct`
208
+ - HuggingFace Token: Your HF API token
209
+ - Provider (optional): `together`, `fireworks-ai`, `cerebras`
210
+
211
+ ---
212
+
213
+
214
+ ## πŸ’‘ Example Interactions
215
+
216
+ ### Understanding a Class
217
+
218
+ **User:** *"How does the AutoModel class work?"*
219
+
220
+ **Agent:**
221
+ 1. Calls `search_nodes("AutoModel")`
222
+ 2. Calls `get_node_info("src/transformers/models/auto/auto_factory.py_3")`
223
+ 3. Calls `get_file_structure("src/transformers/models/auto/auto_factory.py")`
224
+ 4. Synthesizes response explaining the auto-class factory pattern
225
+
226
+ ### Tracing Dependencies
227
+
228
+ **User:** *"What classes inherit from PreTrainedModel?"*
229
+
230
+ **Agent:**
231
+ 1. Calls `go_to_definition("PreTrainedModel")`
232
+ 2. Calls `find_usages("PreTrainedModel")`
233
+ 3. Returns list of model classes with inheritance relationships
234
+
235
+ ### Exploring Implementation
236
+
237
+ **User:** *"How does tokenization work in the library?"*
238
+
239
+ **Agent:**
240
+ 1. Calls `search_nodes("tokenization")`
241
+ 2. Calls `get_neighbors("src/transformers/tokenization_utils_base.py")`
242
+ 3. Calls `get_file_structure("src/transformers/tokenization_utils.py")`
243
+ 4. Explains the tokenizer hierarchy and key methods
244
+
245
+ ---
246
+
247
+ ## πŸ”§ Agent Internals
248
+
249
+ ### KnowledgeGraphChatAgent Class
250
+
251
+ The main agent class (powering EPITA CodeVoyager) handles:
252
+
253
+ ```python
254
+ class KnowledgeGraphChatAgent:
255
+ def __init__(self, mcp_server_url: str):
256
+ # Connect to MCP server and load tools
257
+ self._initialize_mcp_tools()
258
+
259
+ def _initialize_model(self, model_type, api_key, ...):
260
+ # Configure OpenAI, Azure, or HF Inference backend
261
+
262
+ def _initialize_agent(self, max_steps):
263
+ # Create ToolCallingAgent with MCP tools
264
+
265
+ def chat(self, message, history):
266
+ # Stream responses using stream_to_gradio
267
+ ```
268
+
269
+ ### Custom Instructions
270
+
271
+ EPITA CodeVoyager is configured with domain-specific instructions for the Transformers library:
272
+
273
+ ```python
274
+ CUSTOM_INSTRUCTIONS = """You are an expert assistant for understanding the Hugging Face Transformers library.
275
+
276
+ Your role is to help users understand the Transformers codebase by exploring the repository using the available tools. You can:
277
+ - Search for functions, classes, and methods in the codebase
278
+ - Navigate the file structure and understand code organization
279
+ - Find relationships between different components
280
+ - Trace how code flows through the library
281
+ - Explain implementation details and design patterns
282
+
283
+ When answering questions:
284
+ 1. Use the available tools to explore the repository and gather accurate information
285
+ 2. Provide clear, well-structured explanations based on the actual code
286
+ 3. Reference specific files, functions, or classes when relevant
287
+ 4. If you're unsure about something, search the codebase to verify before answering
288
+
289
+ Always base your answers on the actual code in the repository, not assumptions."""
290
+ ```
291
+
292
+ ---
293
+
294
+ ## πŸ‘₯ Team
295
+
296
+ **Team Name:** CEPIA Ionis Team
297
+
298
+ **Team Members:**
299
+ - **Laila ELKOUSSY** - [@lailaelkoussy](https://huggingface.co/lailaelkoussy) - Research Engineer, Data Scientist
300
+ - **Julien PEREZ** - [@jnm38](https://huggingface.co/jnm38) - Research Director
301
+
302
+ ---
303
+
304
+ ## πŸ“„ License
305
+
306
+ This project is developed as part of research at EPITA / Ionis Group.
307
+
308
+ ## πŸ”— Related Resources
309
+
310
+ - [Smolagents](https://github.com/huggingface/smolagents) - The agent framework used
311
+ - [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - The protocol standard
312
+ - [Gradio](https://gradio.app/) - Web interface framework