File size: 4,115 Bytes
9a84809 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
pipeline_tag: text-generation
inference: true
widget:
- text: 'def print_hello_world():'
example_title: Hello world
group: Python
license: bigscience-openrail-m
datasets:
- books
- arxiv
- c4
- falcon-refinedweb
- wiki
- github-issues
- stack_markdown
library_name: transformers
tags:
- code
language:
- en
---
[](https://hf.co/QuantFactory)
# QuantFactory/Refact-1_6-base-GGUF
This is quantized version of [smallcloudai/Refact-1_6-base](https://huggingface.co/smallcloudai/Refact-1_6-base) created using llama.cpp
# Original Model Card

# Refact-1.6B-base
Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready 🎉
The model might contain some problems, especially with the FIM format
# It Works As a Chat
The primary application of this model is code completion (infill) in multiple programming languages.
But it works as a chat quite well.
# Example
Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:
```python
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "smallcloudai/Refact-1_6B-fim"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
prompt = '<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>'
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100, temperature=0.2)
print("-"*80)
print(tokenizer.decode(outputs[0]))
```
# Chat Format
The same model works as chat (experimental).
```python
prompt_template = "<empty_output>SYSTEM {system}\n" \
"<empty_output>USER {query}\n" \
"<empty_output>ASSISTANT"
prompt = prompt_template.format(system="You are a programming assistant",
query="How do I sort a list in Python?")
```
# Architecture
As described in more detail in the blog post, we used:
- [ALiBi](https://arxiv.org/abs/2108.12409) based attention
- [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf)
- [Multi Query Attention](https://arxiv.org/abs/1911.02150)
We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below.
# Training
For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets.
Filtering is the key to success of this model:
- We only used text in English
- Only topics related to computer science
- Applied heavy deduplication
The text to code proportion was 50:50, model trained for 1.2T tokens.
We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so
its practical use is limited. But if you still want it, write us a message on Discord.
# Limitations and Bias
The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in
code comments. Its performance on non-English languages is lower, for sure.
# Model Stats
- **Architecture:** LLAMA-like model with multi-query attention
- **Objectives** Fill-in-the-Middle, Chat
- **Tokens context:** 4096
- **Pretraining tokens:** 1.2T
- **Finetuning tokens:** 40B
- **Precision:** bfloat16
- **GPUs** 64 NVidia A5000
- **Training time** 28 days
# License
The model is licensed under the BigScience OpenRAIL-M v1 license agreement
# Citation
If you are using this model, please give a link to this page.
|