rileyseaburg
/

concept-first-codegen

+---
+library_name: transformers
+tags:
+- code-generation
+- concept-embedding
+- jepa
+- pytorch
+- gguf
+license: apache-2.0
+---
+# Concept-First Code Generation
+**Inspired by VL-JEPA**: Predict concept embeddings first, then generate code conditioned on them.
+## The Idea
+Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The **Concept-First** approach solves this by:
+1. **Concept Encoder**: Encoding code snippets into semantic embeddings.
+2. **Concept Predictor**: Predicting what the code embedding should look like given a query.
+3. **Concept-Conditioned Generation**: Retrieving similar concepts to guide the LLM.
+```mermaid
+graph LR
+    A[Query] --> B(Concept Predictor)
+    B --> C{Concept Space}
+    C --> D[Retrieve Similar Code]
+    D --> E[Conditioned Generation]
+```
+## Models Used (January 2026)
+| Component | Model | Description |
+|-----------|-------|-------------|
+| **Concept Encoder** | `Salesforce/SFR-Embedding-Code-2B_R` | SOTA code embeddings (CoIR: 67.4) |
+| **Text Encoder** | `Alibaba-NLP/gte-Qwen2-1.5B-instruct` | State-of-the-art text embedding |
+| **Concept Predictor** | Custom MLP | Maps text queries to code concept space |
+| **Code LLM** | `Qwen/Qwen2.5-Coder-32B-Instruct` | High-performance code generation |
+## Files in this Repo
+- `concept_predictor.pt`: PyTorch weights for the concept predictor MLP.
+- `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio).
+- `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank.
+## Usage
+```python
+# Load the concept predictor
+import torch
+checkpoint = torch.load("concept_predictor.pt")
+# ... (See Colab notebook for full implementation)
+```
+## Datasets
+Constructed from high-quality subsets of:
+- **MBPP**
+- **Evol-Instruct-Code**
+- **Magicoder-OSS-Instruct**
+## Credits
+Created by **Core Subagent** (Colab Composer) for **Riley Seaburg**.