File size: 3,123 Bytes
3d5daf9
2af430f
bd4f473
 
 
3d5daf9
8cf9b1c
bd4f473
3d5daf9
 
2af430f
8cf9b1c
2af430f
 
 
 
 
 
 
 
 
 
 
 
 
 
bd4f473
 
 
 
 
2af430f
 
 
 
 
 
 
 
bd4f473
 
 
2af430f
 
 
bd4f473
 
 
2af430f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd4f473
2af430f
 
 
 
 
 
 
 
bd4f473
 
2af430f
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: Code Search API
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
license: mit
app_port: 7860
---

# Code Search API

A FastAPI REST API for semantic code search powered by
[`jinaai/jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code)
and FAISS approximate nearest-neighbour search.

## What's new (v2)

| Area | Before | After |
|---|---|---|
| Model | pplx-embed-v1-0.6B × 2 | jina-embeddings-v2-base-code × 1 |
| Embedding speed | ~2 s / batch | ~500 ms / batch |
| Search (100 K chunks) | ~2 000 ms | ~5 ms |
| Chunking | Sentence windows | AST (Python) / regex (other langs) |
| Persistence | Lost on restart | Saved to `/data` volume |
| Batch indexing | ❌ | ✅ `/index/batch` |

## Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET`    | `/`                      | Health check |
| `GET`    | `/health`                | Model status |
| `POST`   | `/index`                 | Upload & index a single source file |
| `POST`   | `/index/batch`           | Index an entire codebase in one call |
| `POST`   | `/search`                | Search an indexed document / codebase |
| `POST`   | `/embed`                 | Embed arbitrary texts (raw vectors) |
| `GET`    | `/documents`             | List indexed doc IDs |
| `DELETE` | `/documents/{doc_id}`    | Remove a document |

Interactive docs available at `/docs` (Swagger UI).

## Quick start

### Index a single file

```bash
curl -X POST https://YOUR-SPACE.hf.space/index \
  -F "file=@src/utils.py" \
  -F "doc_id=utils"
```

### Index a whole project (IDE integration)

```python
import os, requests

def index_project(base_url: str, project_path: str, doc_id: str):
    SUPPORTED = {".py", ".js", ".ts", ".tsx", ".go", ".rs", ".java", ".md"}
    files = []
    for root, _, filenames in os.walk(project_path):
        for fname in filenames:
            if os.path.splitext(fname)[1] in SUPPORTED:
                full_path = os.path.join(root, fname)
                rel_path  = os.path.relpath(full_path, project_path)
                with open(full_path, "r", errors="replace") as f:
                    files.append({"filename": rel_path, "content": f.read()})

    resp = requests.post(f"{base_url}/index/batch", json={
        "doc_id":  doc_id,
        "files":   files,
        "replace": True,
    }, timeout=300)
    return resp.json()

result = index_project("https://YOUR-SPACE.hf.space", "./my_project", "my_project")
print(result)
# {"doc_id": "my_project", "files_indexed": 42, "chunks_indexed": 318}
```

### Search

```bash
curl -X POST https://YOUR-SPACE.hf.space/search \
  -H "Content-Type: application/json" \
  -d '{"doc_id": "my_project", "query": "fetch user from database", "top_k": 5}'
```

## Supported languages

Python (AST chunking), JavaScript, TypeScript, Go, Rust, Java, C, C++, C#, Ruby, PHP (regex chunking), Markdown & plain text (sentence chunking).

## Persistence

Indexes are saved to the `/data` persistent volume after every `/index` or `/index/batch` call and automatically restored on Space restart — no re-indexing needed.