|
|
|
|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- code-generation |
|
|
- concept-embedding |
|
|
- jepa |
|
|
- pytorch |
|
|
- gguf |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# Concept-First Code Generation |
|
|
|
|
|
**Inspired by VL-JEPA**: Predict concept embeddings first, then generate code conditioned on them. |
|
|
|
|
|
## The Idea |
|
|
|
|
|
Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The **Concept-First** approach solves this by: |
|
|
|
|
|
1. **Concept Encoder**: Encoding code snippets into semantic embeddings. |
|
|
2. **Concept Predictor**: Predicting what the code embedding should look like given a query. |
|
|
3. **Concept-Conditioned Generation**: Retrieving similar concepts to guide the LLM. |
|
|
|
|
|
```mermaid |
|
|
graph LR |
|
|
A[Query] --> B(Concept Predictor) |
|
|
B --> C{Concept Space} |
|
|
C --> D[Retrieve Similar Code] |
|
|
D --> E[Conditioned Generation] |
|
|
``` |
|
|
|
|
|
## Models Used (January 2026) |
|
|
|
|
|
| Component | Model | Description | |
|
|
|-----------|-------|-------------| |
|
|
| **Concept Encoder** | `Salesforce/SFR-Embedding-Code-2B_R` | SOTA code embeddings (CoIR: 67.4) | |
|
|
| **Text Encoder** | `Alibaba-NLP/gte-Qwen2-1.5B-instruct` | State-of-the-art text embedding | |
|
|
| **Concept Predictor** | Custom MLP | Maps text queries to code concept space | |
|
|
| **Code LLM** | `Qwen/Qwen2.5-Coder-32B-Instruct` | High-performance code generation | |
|
|
|
|
|
## Files in this Repo |
|
|
|
|
|
- `concept_predictor.pt`: PyTorch weights for the concept predictor MLP. |
|
|
- `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio). |
|
|
- `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
# Load the concept predictor |
|
|
import torch |
|
|
checkpoint = torch.load("concept_predictor.pt") |
|
|
# ... (See Colab notebook for full implementation) |
|
|
``` |
|
|
|
|
|
## Datasets |
|
|
|
|
|
Constructed from high-quality subsets of: |
|
|
- **MBPP** |
|
|
- **Evol-Instruct-Code** |
|
|
- **Magicoder-OSS-Instruct** |
|
|
|
|
|
## Credits |
|
|
|
|
|
Created by **Core Subagent** (Colab Composer) for **Riley Seaburg**. |
|
|
|