File size: 1,213 Bytes
2cc06ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: apache-2.0
tags:
  - code
  - multi-language
  - pretraining-data
---

# code-graph-v4

Packaged git clones for the graphjepa / code-transformer project.
with full git history.

## Contents

        - clones_csharp_full.tar.gz
        - clones_java_full.tar.gz
        - clones_javascript_full.tar.gz
        - clones_python_full.tar.gz
        - clones_typescript_full.tar.gz

Each tarball contains `{language}/{repo_id}/...` — extract anywhere,
point the parser at the extracted directory.

## On the receiving (big) machine

```bash
from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="IDMedicine/code-graph-v4",
    filename="clones_python_full.tar.gz",
    repo_type="model",
    local_dir=".",
)
tar -xzf $path -C ./data_multilang/

# Then process each repo with build_bundle.py (needs include_git=True for
# temporal processing; or single-snapshot parsing if code-only).
```

## Limitations

- If packaged without `.git` (the `_code` variants), **no temporal
  processing is possible** downstream — only single-snapshot SSL.
- If packaged with `.git` (the `_full` variants), tarballs are larger
  but the full commit history is preserved for `build_bundle.py`.