File size: 1,946 Bytes
027f974 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
---
library_name: transformers
tags:
- code-generation
- concept-embedding
- jepa
- pytorch
- gguf
license: apache-2.0
---
# Concept-First Code Generation
**Inspired by VL-JEPA**: Predict concept embeddings first, then generate code conditioned on them.
## The Idea
Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The **Concept-First** approach solves this by:
1. **Concept Encoder**: Encoding code snippets into semantic embeddings.
2. **Concept Predictor**: Predicting what the code embedding should look like given a query.
3. **Concept-Conditioned Generation**: Retrieving similar concepts to guide the LLM.
```mermaid
graph LR
A[Query] --> B(Concept Predictor)
B --> C{Concept Space}
C --> D[Retrieve Similar Code]
D --> E[Conditioned Generation]
```
## Models Used (January 2026)
| Component | Model | Description |
|-----------|-------|-------------|
| **Concept Encoder** | `Salesforce/SFR-Embedding-Code-2B_R` | SOTA code embeddings (CoIR: 67.4) |
| **Text Encoder** | `Alibaba-NLP/gte-Qwen2-1.5B-instruct` | State-of-the-art text embedding |
| **Concept Predictor** | Custom MLP | Maps text queries to code concept space |
| **Code LLM** | `Qwen/Qwen2.5-Coder-32B-Instruct` | High-performance code generation |
## Files in this Repo
- `concept_predictor.pt`: PyTorch weights for the concept predictor MLP.
- `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio).
- `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank.
## Usage
```python
# Load the concept predictor
import torch
checkpoint = torch.load("concept_predictor.pt")
# ... (See Colab notebook for full implementation)
```
## Datasets
Constructed from high-quality subsets of:
- **MBPP**
- **Evol-Instruct-Code**
- **Magicoder-OSS-Instruct**
## Credits
Created by **Core Subagent** (Colab Composer) for **Riley Seaburg**.
|