--- library_name: transformers tags: - code-generation - concept-embedding - jepa - pytorch - gguf license: apache-2.0 --- # Concept-First Code Generation **Inspired by VL-JEPA**: Predict concept embeddings first, then generate code conditioned on them. ## The Idea Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The **Concept-First** approach solves this by: 1. **Concept Encoder**: Encoding code snippets into semantic embeddings. 2. **Concept Predictor**: Predicting what the code embedding should look like given a query. 3. **Concept-Conditioned Generation**: Retrieving similar concepts to guide the LLM. ```mermaid graph LR A[Query] --> B(Concept Predictor) B --> C{Concept Space} C --> D[Retrieve Similar Code] D --> E[Conditioned Generation] ``` ## Models Used (January 2026) | Component | Model | Description | |-----------|-------|-------------| | **Concept Encoder** | `Salesforce/SFR-Embedding-Code-2B_R` | SOTA code embeddings (CoIR: 67.4) | | **Text Encoder** | `Alibaba-NLP/gte-Qwen2-1.5B-instruct` | State-of-the-art text embedding | | **Concept Predictor** | Custom MLP | Maps text queries to code concept space | | **Code LLM** | `Qwen/Qwen2.5-Coder-32B-Instruct` | High-performance code generation | ## Files in this Repo - `concept_predictor.pt`: PyTorch weights for the concept predictor MLP. - `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio). - `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank. ## Usage ```python # Load the concept predictor import torch checkpoint = torch.load("concept_predictor.pt") # ... (See Colab notebook for full implementation) ``` ## Datasets Constructed from high-quality subsets of: - **MBPP** - **Evol-Instruct-Code** - **Magicoder-OSS-Instruct** ## Credits Created by **Core Subagent** (Colab Composer) for **Riley Seaburg**.