Revise README: remove YAML front matter and provide detailed repository structure, usage, safetensors info and coding model description.

114cc61 verified 5 months ago

preview code

raw

history blame contribute delete

8.76 kB

	---
	license: apache-2.0
	tags:
	- text-generation
	- transformers
	- safetensors
	- conversational
	pipeline_tag: text-generation
	library_name: transformers
	---
	# Mysterious Coding Model

	This repository contains a specialised AI model for agentic code generation and text generation tasks. The model is inspired by the GPT‑OSS series (gpt oss 20b and gpt oss 120b) described in [the corresponding paper](https://arxiv.org/abs/2508.10925). It is built on open‑source Llama architecture and fine‑tuned for programming assistance, conversation and multi‑language support.

	## Key Features

	- Open source: released under the Apache‑2.0 license.
	- Text and code generation: supports code completion, bug fixing, refactoring and documentation generation.
	- Efficient storage: models are stored in the secure and fast `safetensors` format.
	- Multiple precisions: includes base FP16 models, 8‑bit quantised models and MXFP4 (mixed precision) variants.
	- vLLM compatibility: compatible with the vLLM engine for high‑throughput inference.
	- Conversational: instruction tuned for interactive coding assistance.

	## Repository Structure

	```
	coding-model-repository/
	├── README.md
	├── .gitattributes # Updated for safetensors
	├── .gitignore
	├── requirements.txt
	├── model_index.json # Safetensors model index
	├── config.json # Coding model configuration
	├── model_card.md # Coding model documentation
	│
	├── models/
	│ ├── library=safetensors/ # Main safetensors models directory
	│ │ ├── base/
	│ │ │ ├── model-00001-of-00003.safetensors
	│ │ │ ├── model-00002-of-00003.safetensors
	│ │ │ ├── model-00003-of-00003.safetensors
	│ │ │ ├── model.safetensors.index.json
	│ │ │ ├── config.json
	│ │ │ ├── generation_config.json
	│ │ │ └── tokenizer/
	│ │ │ ├── tokenizer.json
	│ │ │ ├── tokenizer_config.json
	│ │ │ ├── vocab.json
	│ │ │ ├── merges.txt
	│ │ │ └── special_tokens_map.json
	│ │ │
	│ │ ├── quantized/
	│ │ │ ├── 4bit/
	│ │ │ │ ├── model.safetensors
	│ │ │ │ └── quantization_config.json
	│ │ │ ├── 8bit/
	│ │ │ │ ├── model.safetensors
	│ │ │ │ └── quantization_config.json
	│ │ │ └── awq/
	│ │ │ ├── model.safetensors
	│ │ │ └── quant_config.json
	│ │ │
	│ │ ├── instruct/
	│ │ │ ├── model.safetensors
	│ │ │ ├── config.json
	│ │ │ └── training_config.json
	│ │ │
	│ │ └── specialized/
	│ │ ├── python-focused/
	│ │ │ └── model.safetensors
	│ │ ├── web-dev/
	│ │ │ └── model.safetensors
	│ │ ├── systems-programming/
	│ │ │ └── model.safetensors
	│ │ └── data-science/
	│ │ └── model.safetensors
	│ │
	│ ├── adapters/ # Safetensors adapters
	│ │ ├── lora/
	│ │ │ ├── adapter_model.safetensors
	│ │ │ └── adapter_config.json
	│ │ ├── coding-specific/
	│ │ │ ├── debugging-adapter.safetensors
	│ │ │ ├── refactoring-adapter.safetensors
	│ │ │ └── documentation-adapter.safetensors
	│ │ └── language-specific/
	│ │ ├── python-adapter.safetensors
	│ │ ├── javascript-adapter.safetensors
	│ │ ├── rust-adapter.safetensors
	│ │ └── cpp-adapter.safetensors
	│ │
	│ └── merged/ # Merged coding models
	│ ├── code-instruct-merge/
	│ │ └── model.safetensors
	│ ├── multilang-merge/
	│ │ └── model.safetensors
	│ └── merge_recipes/
	│ ├── coding_merge_v1.json
	│ └── instruct_coding_merge.json
	│
	├── datasets/ # Coding datasets
	│ ├── training/
	│ │ ├── code_samples/
	│ │ ├── documentation/
	│ │ └── problem_solutions/
	│ ├── evaluation/
	│ │ ├── humaneval/
	│ │ ├── mbpp/
	│ │ ├── codecontests/
	│ │ └── custom_benchmarks/
	│ └── instruction_tuning/
	│ ├── code_alpaca/
	│ ├── evol_instruct_code/
	│ └── magicoder_data/
	│
	├── scripts/
	│ ├── convert_to_safetensors.py # Convert models to safetensors
	│ ├── validate_safetensors.py # Validate safetensors integrity
	│ ├── quantize_coding_model.py # Code-optimized quantization
	│ ├── merge_coding_models.py # Merge coding-specific models
	│ ├── train_coding_adapter.py # Train coding adapters
	│ ├── evaluate_coding.py # Code generation evaluation
	│ └── benchmark_performance.py # Performance benchmarks
	│
	├── evaluation/
	│ ├── code_generation/
	│ │ ├── python_eval.py
	│ │ ├── javascript_eval.py
	│ │ └── multilang_eval.py
	│ ├── code_completion/
	│ │ ├── completion_benchmark.py
	│ │ └── context_accuracy.py
	│ ├── code_understanding/
	│ │ ├── bug_detection.py
	│ │ ├── code_explanation.py
	│ │ └── refactoring_suggestions.py
	│ └── benchmarks/
	│ ├── humaneval_results/
	│ ├── mbpp_results/
	│ └── custom_results/
	│
	├── tools/
	│ ├── code_formatter.py
	│ ├── syntax_validator.py
	│ ├── dependency_analyzer.py
	│ └── performance_profiler.py
	│
	└── docs/
	├── coding_model_guide.md
	├── safetensors_usage.md
	├── evaluation_metrics.md
	└── api_reference.md
	```

	## Usage

	To load the model and generate code using `transformers` and `safetensors`, run:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load the safetensors model
	auto_model = AutoModelForCausalLM.from_pretrained(
	"likhonhfai/mysterious-coding-model",
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True
	)

	tokenizer = AutoTokenizer.from_pretrained("likhonhfai/mysterious-coding-model")

	prompt = "def fibonacci(n):\n \"\"\"Calculate the nth Fibonacci number\"\"\"\n"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = auto_model.generate(
	**inputs,
	max_new_tokens=64,
	do_sample=True,
	top_p=0.95,
	temperature=0.1
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	For vLLM-based inference or to use quantized models (4‑bit, 8‑bit or AWQ), explore the subdirectories under `models/quantized/` and see the scripts for quantisation and evaluation.

	## Safetensors Format

	All model weights are stored in `.safetensors` format. This binary format provides:

	1. Security – loading the model doesn’t execute arbitrary code.
	2. Speed – faster loading compared to pickle-based formats.
	3. Memory efficiency – supports lazy loading.
	4. Cross-platform compatibility – works across operating systems.
	5. Rich metadata – makes it easier to inspect and validate model shards.

	Refer to `scripts/convert_to_safetensors.py` to convert PyTorch checkpoints into safetensors.

	## Quantisation

	The `models/quantized/` directory contains 4‑bit, 8‑bit and AWQ quantised versions of the model. These variants reduce memory requirements and accelerate inference with minimal impact on accuracy. See `scripts/quantize_coding_model.py` for details.

	## Evaluation

	Benchmark scripts are available under `evaluation/` and `scripts/evaluate_coding.py`. Use them to run HumanEval, MBPP and other coding benchmarks. Example:

	```bash
	python scripts/evaluate_coding.py --benchmark humaneval
	```

	## ArXiv Reference

	This model draws on techniques described in the paper ["gpt oss 120b & gpt oss 20b"](https://arxiv.org/abs/2508.10925), which details the training and capabilities of open‑source GPT‑OSS models.

	## Contribution

	Contributions are welcome! Feel free to open issues or pull requests to improve the code, documentation, or add new adapters and datasets.