File size: 1,213 Bytes
2cc06ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | ---
license: apache-2.0
tags:
- code
- multi-language
- pretraining-data
---
# code-graph-v4
Packaged git clones for the graphjepa / code-transformer project.
with full git history.
## Contents
- clones_csharp_full.tar.gz
- clones_java_full.tar.gz
- clones_javascript_full.tar.gz
- clones_python_full.tar.gz
- clones_typescript_full.tar.gz
Each tarball contains `{language}/{repo_id}/...` — extract anywhere,
point the parser at the extracted directory.
## On the receiving (big) machine
```bash
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="IDMedicine/code-graph-v4",
filename="clones_python_full.tar.gz",
repo_type="model",
local_dir=".",
)
tar -xzf $path -C ./data_multilang/
# Then process each repo with build_bundle.py (needs include_git=True for
# temporal processing; or single-snapshot parsing if code-only).
```
## Limitations
- If packaged without `.git` (the `_code` variants), **no temporal
processing is possible** downstream — only single-snapshot SSL.
- If packaged with `.git` (the `_full` variants), tarballs are larger
but the full commit history is preserved for `build_bundle.py`.
|