Revise README: remove YAML front matter and provide detailed repository structure, usage, safetensors info and coding model description.
114cc61
verified
| license: apache-2.0 | |
| tags: | |
| - text-generation | |
| - transformers | |
| - safetensors | |
| - conversational | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| # Mysterious Coding Model | |
| This repository contains a specialised AI model for agentic code generation and text generation tasks. The model is inspired by the GPTβOSS series (gpt ossΒ 20b and gpt ossΒ 120b) described in [the corresponding paper](https://arxiv.org/abs/2508.10925). It is built on openβsource Llama architecture and fineβtuned for programming assistance, conversation and multiβlanguage support. | |
| ## Key Features | |
| - **Open source**: released under the Apacheβ2.0 license. | |
| - **Text and code generation**: supports code completion, bug fixing, refactoring and documentation generation. | |
| - **Efficient storage**: models are stored in the secure and fast `safetensors` format. | |
| - **Multiple precisions**: includes base FP16 models, 8βbit quantised models and MXFP4 (mixed precision) variants. | |
| - **vLLM compatibility**: compatible with the vLLM engine for highβthroughput inference. | |
| - **Conversational**: instruction tuned for interactive coding assistance. | |
| ## Repository Structure | |
| ``` | |
| coding-model-repository/ | |
| βββ README.md | |
| βββ .gitattributes # Updated for safetensors | |
| βββ .gitignore | |
| βββ requirements.txt | |
| βββ model_index.json # Safetensors model index | |
| βββ config.json # Coding model configuration | |
| βββ model_card.md # Coding model documentation | |
| β | |
| βββ models/ | |
| β βββ library=safetensors/ # Main safetensors models directory | |
| β β βββ base/ | |
| β β β βββ model-00001-of-00003.safetensors | |
| β β β βββ model-00002-of-00003.safetensors | |
| β β β βββ model-00003-of-00003.safetensors | |
| β β β βββ model.safetensors.index.json | |
| β β β βββ config.json | |
| β β β βββ generation_config.json | |
| β β β βββ tokenizer/ | |
| β β β βββ tokenizer.json | |
| β β β βββ tokenizer_config.json | |
| β β β βββ vocab.json | |
| β β β βββ merges.txt | |
| β β β βββ special_tokens_map.json | |
| β β β | |
| β β βββ quantized/ | |
| β β β βββ 4bit/ | |
| β β β β βββ model.safetensors | |
| β β β β βββ quantization_config.json | |
| β β β βββ 8bit/ | |
| β β β β βββ model.safetensors | |
| β β β β βββ quantization_config.json | |
| β β β βββ awq/ | |
| β β β βββ model.safetensors | |
| β β β βββ quant_config.json | |
| β β β | |
| β β βββ instruct/ | |
| β β β βββ model.safetensors | |
| β β β βββ config.json | |
| β β β βββ training_config.json | |
| β β β | |
| β β βββ specialized/ | |
| β β βββ python-focused/ | |
| β β β βββ model.safetensors | |
| β β βββ web-dev/ | |
| β β β βββ model.safetensors | |
| β β βββ systems-programming/ | |
| β β β βββ model.safetensors | |
| β β βββ data-science/ | |
| β β βββ model.safetensors | |
| β β | |
| β βββ adapters/ # Safetensors adapters | |
| β β βββ lora/ | |
| β β β βββ adapter_model.safetensors | |
| β β β βββ adapter_config.json | |
| β β βββ coding-specific/ | |
| β β β βββ debugging-adapter.safetensors | |
| β β β βββ refactoring-adapter.safetensors | |
| β β β βββ documentation-adapter.safetensors | |
| β β βββ language-specific/ | |
| β β βββ python-adapter.safetensors | |
| β β βββ javascript-adapter.safetensors | |
| β β βββ rust-adapter.safetensors | |
| β β βββ cpp-adapter.safetensors | |
| β β | |
| β βββ merged/ # Merged coding models | |
| β βββ code-instruct-merge/ | |
| β β βββ model.safetensors | |
| β βββ multilang-merge/ | |
| β β βββ model.safetensors | |
| β βββ merge_recipes/ | |
| β βββ coding_merge_v1.json | |
| β βββ instruct_coding_merge.json | |
| β | |
| βββ datasets/ # Coding datasets | |
| β βββ training/ | |
| β β βββ code_samples/ | |
| β β βββ documentation/ | |
| β β βββ problem_solutions/ | |
| β βββ evaluation/ | |
| β β βββ humaneval/ | |
| β β βββ mbpp/ | |
| β β βββ codecontests/ | |
| β β βββ custom_benchmarks/ | |
| β βββ instruction_tuning/ | |
| β βββ code_alpaca/ | |
| β βββ evol_instruct_code/ | |
| β βββ magicoder_data/ | |
| β | |
| βββ scripts/ | |
| β βββ convert_to_safetensors.py # Convert models to safetensors | |
| β βββ validate_safetensors.py # Validate safetensors integrity | |
| β βββ quantize_coding_model.py # Code-optimized quantization | |
| β βββ merge_coding_models.py # Merge coding-specific models | |
| β βββ train_coding_adapter.py # Train coding adapters | |
| β βββ evaluate_coding.py # Code generation evaluation | |
| β βββ benchmark_performance.py # Performance benchmarks | |
| β | |
| βββ evaluation/ | |
| β βββ code_generation/ | |
| β β βββ python_eval.py | |
| β β βββ javascript_eval.py | |
| β β βββ multilang_eval.py | |
| β βββ code_completion/ | |
| β β βββ completion_benchmark.py | |
| β β βββ context_accuracy.py | |
| β βββ code_understanding/ | |
| β β βββ bug_detection.py | |
| β β βββ code_explanation.py | |
| β β βββ refactoring_suggestions.py | |
| β βββ benchmarks/ | |
| β βββ humaneval_results/ | |
| β βββ mbpp_results/ | |
| β βββ custom_results/ | |
| β | |
| βββ tools/ | |
| β βββ code_formatter.py | |
| β βββ syntax_validator.py | |
| β βββ dependency_analyzer.py | |
| β βββ performance_profiler.py | |
| β | |
| βββ docs/ | |
| βββ coding_model_guide.md | |
| βββ safetensors_usage.md | |
| βββ evaluation_metrics.md | |
| βββ api_reference.md | |
| ``` | |
| ## Usage | |
| To load the model and generate code using `transformers` and `safetensors`, run: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| # Load the safetensors model | |
| auto_model = AutoModelForCausalLM.from_pretrained( | |
| "likhonhfai/mysterious-coding-model", | |
| torch_dtype=torch.float16, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("likhonhfai/mysterious-coding-model") | |
| prompt = "def fibonacci(n):\n \"\"\"Calculate the nth Fibonacci number\"\"\"\n" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| outputs = auto_model.generate( | |
| **inputs, | |
| max_new_tokens=64, | |
| do_sample=True, | |
| top_p=0.95, | |
| temperature=0.1 | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| For vLLM-based inference or to use quantized models (4βbit, 8βbit or AWQ), explore the subdirectories under `models/quantized/` and see the scripts for quantisation and evaluation. | |
| ## Safetensors Format | |
| All model weights are stored in `.safetensors` format. This binary format provides: | |
| 1. **Security** β loading the model doesnβt execute arbitrary code. | |
| 2. **Speed** β faster loading compared to pickle-based formats. | |
| 3. **Memory efficiency** β supports lazy loading. | |
| 4. **Cross-platform compatibility** β works across operating systems. | |
| 5. **Rich metadata** β makes it easier to inspect and validate model shards. | |
| Refer to `scripts/convert_to_safetensors.py` to convert PyTorch checkpoints into safetensors. | |
| ## Quantisation | |
| The `models/quantized/` directory contains 4βbit, 8βbit and AWQ quantised versions of the model. These variants reduce memory requirements and accelerate inference with minimal impact on accuracy. See `scripts/quantize_coding_model.py` for details. | |
| ## Evaluation | |
| Benchmark scripts are available under `evaluation/` and `scripts/evaluate_coding.py`. Use them to run HumanEval, MBPP and other coding benchmarks. Example: | |
| ```bash | |
| python scripts/evaluate_coding.py --benchmark humaneval | |
| ``` | |
| ## ArXiv Reference | |
| This model draws on techniques described in the paper ["gpt oss 120b & gpt oss 20b"](https://arxiv.org/abs/2508.10925), which details the training and capabilities of openβsource GPTβOSS models. | |
| ## Contribution | |
| Contributions are welcome! Feel free to open issues or pull requests to improve the code, documentation, or add new adapters and datasets. |