File size: 8,760 Bytes
ec1ca76
 
114cc61
 
 
 
 
 
 
ec1ca76
114cc61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---

license: apache-2.0
tags:
- text-generation
- transformers
- safetensors
- conversational
pipeline_tag: text-generation
library_name: transformers
---

# Mysterious Coding Model

This repository contains a specialised AI model for agentic code generation and text generation tasks. The model is inspired by the GPT‑OSS series (gpt ossΒ 20b and gpt ossΒ 120b) described in [the corresponding paper](https://arxiv.org/abs/2508.10925). It is built on open‑source Llama architecture and fine‑tuned for programming assistance, conversation and multi‑language support.

## Key Features

- **Open source**: released under the Apache‑2.0 license.
- **Text and code generation**: supports code completion, bug fixing, refactoring and documentation generation.
- **Efficient storage**: models are stored in the secure and fast `safetensors` format.
- **Multiple precisions**: includes base FP16 models, 8‑bit quantised models and MXFP4 (mixed precision) variants.
- **vLLM compatibility**: compatible with the vLLM engine for high‑throughput inference.
- **Conversational**: instruction tuned for interactive coding assistance.

## Repository Structure

```
coding-model-repository/
β”œβ”€β”€ README.md
β”œβ”€β”€ .gitattributes              # Updated for safetensors
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ model_index.json           # Safetensors model index
β”œβ”€β”€ config.json                # Coding model configuration
β”œβ”€β”€ model_card.md             # Coding model documentation
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ library=safetensors/   # Main safetensors models directory
β”‚   β”‚   β”œβ”€β”€ base/
β”‚   β”‚   β”‚   β”œβ”€β”€ model-00001-of-00003.safetensors
β”‚   β”‚   β”‚   β”œβ”€β”€ model-00002-of-00003.safetensors
β”‚   β”‚   β”‚   β”œβ”€β”€ model-00003-of-00003.safetensors
β”‚   β”‚   β”‚   β”œβ”€β”€ model.safetensors.index.json
β”‚   β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”‚   β”œβ”€β”€ generation_config.json
β”‚   β”‚   β”‚   └── tokenizer/
β”‚   β”‚   β”‚       β”œβ”€β”€ tokenizer.json
β”‚   β”‚   β”‚       β”œβ”€β”€ tokenizer_config.json
β”‚   β”‚   β”‚       β”œβ”€β”€ vocab.json
β”‚   β”‚   β”‚       β”œβ”€β”€ merges.txt
β”‚   β”‚   β”‚       └── special_tokens_map.json
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ quantized/
β”‚   β”‚   β”‚   β”œβ”€β”€ 4bit/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”‚   β”‚   β”‚   └── quantization_config.json
β”‚   β”‚   β”‚   β”œβ”€β”€ 8bit/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”‚   β”‚   β”‚   └── quantization_config.json
β”‚   β”‚   β”‚   └── awq/
β”‚   β”‚   β”‚       β”œβ”€β”€ model.safetensors
β”‚   β”‚   β”‚       └── quant_config.json
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ instruct/
β”‚   β”‚   β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”‚   └── training_config.json
β”‚   β”‚   β”‚
β”‚   β”‚   └── specialized/
β”‚   β”‚       β”œβ”€β”€ python-focused/
β”‚   β”‚       β”‚   └── model.safetensors
β”‚   β”‚       β”œβ”€β”€ web-dev/
β”‚   β”‚       β”‚   └── model.safetensors
β”‚   β”‚       β”œβ”€β”€ systems-programming/
β”‚   β”‚       β”‚   └── model.safetensors
β”‚   β”‚       └── data-science/
β”‚   β”‚           └── model.safetensors
β”‚   β”‚
β”‚   β”œβ”€β”€ adapters/              # Safetensors adapters
β”‚   β”‚   β”œβ”€β”€ lora/
β”‚   β”‚   β”‚   β”œβ”€β”€ adapter_model.safetensors
β”‚   β”‚   β”‚   └── adapter_config.json
β”‚   β”‚   β”œβ”€β”€ coding-specific/
β”‚   β”‚   β”‚   β”œβ”€β”€ debugging-adapter.safetensors
β”‚   β”‚   β”‚   β”œβ”€β”€ refactoring-adapter.safetensors
β”‚   β”‚   β”‚   └── documentation-adapter.safetensors
β”‚   β”‚   └── language-specific/
β”‚   β”‚       β”œβ”€β”€ python-adapter.safetensors
β”‚   β”‚       β”œβ”€β”€ javascript-adapter.safetensors
β”‚   β”‚       β”œβ”€β”€ rust-adapter.safetensors
β”‚   β”‚       └── cpp-adapter.safetensors
β”‚   β”‚
β”‚   └── merged/                # Merged coding models
β”‚       β”œβ”€β”€ code-instruct-merge/
β”‚       β”‚   └── model.safetensors
β”‚       β”œβ”€β”€ multilang-merge/
β”‚       β”‚   └── model.safetensors
β”‚       └── merge_recipes/
β”‚           β”œβ”€β”€ coding_merge_v1.json
β”‚           └── instruct_coding_merge.json
β”‚
β”œβ”€β”€ datasets/                  # Coding datasets
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ code_samples/
β”‚   β”‚   β”œβ”€β”€ documentation/
β”‚   β”‚   └── problem_solutions/
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ humaneval/
β”‚   β”‚   β”œβ”€β”€ mbpp/
β”‚   β”‚   β”œβ”€β”€ codecontests/
β”‚   β”‚   └── custom_benchmarks/
β”‚   └── instruction_tuning/
β”‚       β”œβ”€β”€ code_alpaca/
β”‚       β”œβ”€β”€ evol_instruct_code/
β”‚       └── magicoder_data/
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ convert_to_safetensors.py    # Convert models to safetensors
β”‚   β”œβ”€β”€ validate_safetensors.py     # Validate safetensors integrity
β”‚   β”œβ”€β”€ quantize_coding_model.py    # Code-optimized quantization
β”‚   β”œβ”€β”€ merge_coding_models.py      # Merge coding-specific models
β”‚   β”œβ”€β”€ train_coding_adapter.py     # Train coding adapters
β”‚   β”œβ”€β”€ evaluate_coding.py          # Code generation evaluation
β”‚   └── benchmark_performance.py    # Performance benchmarks
β”‚
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ code_generation/
β”‚   β”‚   β”œβ”€β”€ python_eval.py
β”‚   β”‚   β”œβ”€β”€ javascript_eval.py
β”‚   β”‚   └── multilang_eval.py
β”‚   β”œβ”€β”€ code_completion/
β”‚   β”‚   β”œβ”€β”€ completion_benchmark.py
β”‚   β”‚   └── context_accuracy.py
β”‚   β”œβ”€β”€ code_understanding/
β”‚   β”‚   β”œβ”€β”€ bug_detection.py
β”‚   β”‚   β”œβ”€β”€ code_explanation.py
β”‚   β”‚   └── refactoring_suggestions.py
β”‚   └── benchmarks/
β”‚       β”œβ”€β”€ humaneval_results/
β”‚       β”œβ”€β”€ mbpp_results/
β”‚       └── custom_results/
β”‚
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ code_formatter.py
β”‚   β”œβ”€β”€ syntax_validator.py
β”‚   β”œβ”€β”€ dependency_analyzer.py
β”‚   └── performance_profiler.py
β”‚
└── docs/
    β”œβ”€β”€ coding_model_guide.md
    β”œβ”€β”€ safetensors_usage.md
    β”œβ”€β”€ evaluation_metrics.md
    └── api_reference.md
```

## Usage

To load the model and generate code using `transformers` and `safetensors`, run:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the safetensors model
auto_model = AutoModelForCausalLM.from_pretrained(
    "likhonhfai/mysterious-coding-model",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("likhonhfai/mysterious-coding-model")

prompt = "def fibonacci(n):\n    \"\"\"Calculate the nth Fibonacci number\"\"\"\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = auto_model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=True,
    top_p=0.95,
    temperature=0.1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

For vLLM-based inference or to use quantized models (4‑bit, 8‑bit or AWQ), explore the subdirectories under `models/quantized/` and see the scripts for quantisation and evaluation.

## Safetensors Format

All model weights are stored in `.safetensors` format. This binary format provides:

1. **Security** – loading the model doesn’t execute arbitrary code.
2. **Speed** – faster loading compared to pickle-based formats.
3. **Memory efficiency** – supports lazy loading.
4. **Cross-platform compatibility** – works across operating systems.
5. **Rich metadata** – makes it easier to inspect and validate model shards.

Refer to `scripts/convert_to_safetensors.py` to convert PyTorch checkpoints into safetensors.

## Quantisation

The `models/quantized/` directory contains 4‑bit, 8‑bit and AWQ quantised versions of the model. These variants reduce memory requirements and accelerate inference with minimal impact on accuracy. See `scripts/quantize_coding_model.py` for details.

## Evaluation

Benchmark scripts are available under `evaluation/` and `scripts/evaluate_coding.py`. Use them to run HumanEval, MBPP and other coding benchmarks. Example:

```bash
python scripts/evaluate_coding.py --benchmark humaneval
```

## ArXiv Reference

This model draws on techniques described in the paper ["gpt oss 120b & gpt oss 20b"](https://arxiv.org/abs/2508.10925), which details the training and capabilities of open‑source GPT‑OSS models.

## Contribution

Contributions are welcome! Feel free to open issues or pull requests to improve the code, documentation, or add new adapters and datasets.