File size: 4,150 Bytes
4b445f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
"""
Code Embedding Pipeline
========================

Converts source code into vector embeddings using sentence-transformers.
These embeddings are stored in ChromaDB for semantic search.

How it works:
1. Source code is split into chunks (functions, classes, or fixed-size blocks)
2. Each chunk is embedded into a 384-dimensional vector
3. Vectors capture semantic meaning — similar code has similar vectors
4. When reviewing a PR, we query ChromaDB with the diff to find related code

Why embeddings for code?
Consider this diff:
    + user_id = request.args.get("id")
    + data = db.query(f"SELECT * FROM users WHERE id = {user_id}")

To evaluate this, the agent needs to know:
- Does `db.query()` parameterize inputs? → Need the DB wrapper's source code
- Is there middleware that validates `user_id`? → Need the middleware source
- Are there other similar patterns in the codebase? → Need semantic search

Embeddings let us find this related code WITHOUT knowing the exact file paths.
The query "SQL query with user input" returns relevant code chunks ranked by
semantic similarity — not keyword matching, but meaning matching.

Model: all-MiniLM-L6-v2
- 384 dimensions, 22M parameters
- Runs locally on CPU in ~10ms per chunk (GPU: ~1ms)
- Optimized for semantic similarity tasks
- Good enough for code — not perfect, but fast and free
"""

from __future__ import annotations

import structlog

from app.config import settings

logger = structlog.get_logger()

# Lazy-loaded model to avoid slow import at startup
_model = None


def get_embedding_model():
    """
    Lazy-load the sentence-transformers model.

    We load on first use (not at import time) because:
    1. The model takes ~2 seconds to load
    2. Not every request needs embeddings (cached reviews skip this)
    3. Tests shouldn't load a real ML model
    """
    global _model
    if _model is None:
        try:
            from sentence_transformers import SentenceTransformer
            _model = SentenceTransformer(settings.embedding_model)
            logger.info("Loaded embedding model", model=settings.embedding_model)
        except ImportError:
            logger.warning("sentence-transformers not installed — RAG context disabled")
            return None
    return _model


def embed_texts(texts: list[str]) -> list[list[float]]:
    """
    Embed a list of text strings into vectors.

    Args:
        texts: List of code chunks or queries to embed

    Returns:
        List of embedding vectors (each is a list of floats)
    """
    model = get_embedding_model()
    if model is None:
        return []

    embeddings = model.encode(texts, show_progress_bar=False)
    return embeddings.tolist()


def chunk_code(content: str, filepath: str, chunk_size: int = 60) -> list[dict]:
    """
    Split source code into overlapping chunks for embedding.

    Strategy: We chunk by lines with overlap. Each chunk is ~60 lines
    with 10 lines of overlap to preserve context across boundaries.

    Why 60 lines? It's roughly one function/class — the natural unit of
    code that a developer would reason about. Too small (10 lines) loses
    context. Too large (200 lines) dilutes the embedding signal.

    Args:
        content: Full file source code
        filepath: The file path (included as metadata)
        chunk_size: Lines per chunk (default: 60)

    Returns:
        List of dicts with 'text', 'filepath', 'start_line', 'end_line'
    """
    lines = content.split("\n")
    chunks = []
    overlap = 10
    start = 0

    while start < len(lines):
        end = min(start + chunk_size, len(lines))
        chunk_text = "\n".join(lines[start:end])

        # Skip very small chunks (less than 5 non-empty lines)
        non_empty = sum(1 for line in lines[start:end] if line.strip())
        if non_empty >= 5:
            chunks.append({
                "text": f"# File: {filepath}\n{chunk_text}",
                "filepath": filepath,
                "start_line": start + 1,
                "end_line": end,
            })

        start += max(chunk_size - overlap, 1)  # Overlap for context continuity

    return chunks