rileyseaburg commited on
Commit
027f974
·
verified ·
1 Parent(s): 4d5e682

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ library_name: transformers
4
+ tags:
5
+ - code-generation
6
+ - concept-embedding
7
+ - jepa
8
+ - pytorch
9
+ - gguf
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # Concept-First Code Generation
14
+
15
+ **Inspired by VL-JEPA**: Predict concept embeddings first, then generate code conditioned on them.
16
+
17
+ ## The Idea
18
+
19
+ Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The **Concept-First** approach solves this by:
20
+
21
+ 1. **Concept Encoder**: Encoding code snippets into semantic embeddings.
22
+ 2. **Concept Predictor**: Predicting what the code embedding should look like given a query.
23
+ 3. **Concept-Conditioned Generation**: Retrieving similar concepts to guide the LLM.
24
+
25
+ ```mermaid
26
+ graph LR
27
+ A[Query] --> B(Concept Predictor)
28
+ B --> C{Concept Space}
29
+ C --> D[Retrieve Similar Code]
30
+ D --> E[Conditioned Generation]
31
+ ```
32
+
33
+ ## Models Used (January 2026)
34
+
35
+ | Component | Model | Description |
36
+ |-----------|-------|-------------|
37
+ | **Concept Encoder** | `Salesforce/SFR-Embedding-Code-2B_R` | SOTA code embeddings (CoIR: 67.4) |
38
+ | **Text Encoder** | `Alibaba-NLP/gte-Qwen2-1.5B-instruct` | State-of-the-art text embedding |
39
+ | **Concept Predictor** | Custom MLP | Maps text queries to code concept space |
40
+ | **Code LLM** | `Qwen/Qwen2.5-Coder-32B-Instruct` | High-performance code generation |
41
+
42
+ ## Files in this Repo
43
+
44
+ - `concept_predictor.pt`: PyTorch weights for the concept predictor MLP.
45
+ - `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio).
46
+ - `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank.
47
+
48
+ ## Usage
49
+
50
+ ```python
51
+ # Load the concept predictor
52
+ import torch
53
+ checkpoint = torch.load("concept_predictor.pt")
54
+ # ... (See Colab notebook for full implementation)
55
+ ```
56
+
57
+ ## Datasets
58
+
59
+ Constructed from high-quality subsets of:
60
+ - **MBPP**
61
+ - **Evol-Instruct-Code**
62
+ - **Magicoder-OSS-Instruct**
63
+
64
+ ## Credits
65
+
66
+ Created by **Core Subagent** (Colab Composer) for **Riley Seaburg**.