Upload 12 files

Browse files

Files changed (11) hide show

DataSet/Json/gradio.json +0 -0
DataSet/Json/huggyfacefintuning.json +1 -0
DataSet/Json/openchatkitreadme.json +1 -0
DataSet/Json/train.json +0 -0
DataSet/train.json +0 -0
README.md +192 -0
pytorch_model.bin.index.json +671 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0
train.json +0 -0

DataSet/Json/gradio.json ADDED Viewed

The diff for this file is too large to render. See raw diff

DataSet/Json/huggyfacefintuning.json ADDED Viewed

	@@ -0,0 +1 @@

+ [{"finetuning":"Fine-tune a pretrained model\n\nThere are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:\n\nFine-tune a pretrained model with 🤗 Transformers Trainer.\nFine-tune a pretrained model in TensorFlow with Keras.\nFine-tune a pretrained model in native PyTorch.\nPrepare a dataset\n\nBefore you can fine-tune a pretrained model, download a dataset and prepare it for training. The previous tutorial showed you how to process data for training, and now you get an opportunity to put those skills to the test!\n\nBegin by loading the Yelp Reviews dataset:\n\nCopied\n>>> from datasets import load_dataset\n\n>>> dataset = load_dataset(\"yelp_review_full\")\n>>> dataset[\"train\"][100]\n{'label': 0,\n 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\\\nThe cashier took my friends\\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\\\\"serving off their orders\\\\\" when they didn\\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\\\nThe manager was rude when giving me my order. She didn\\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\\\nI\\'ve eaten at various McDonalds restaurants for over 30 years. I\\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}\n\nAs you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets map method to apply a preprocessing function over the entire dataset:\n\nCopied\n>>> from transformers import AutoTokenizer\n\n>>> tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n\n\n>>> def tokenize_function(examples):\n... return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n\n\n>>> tokenized_datasets = dataset.map(tokenize_function, batched=True)\n\nIf you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:\n\nCopied\n>>> small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\n>>> small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))\nTrain\n\nAt this point, you should follow the section corresponding to the framework you want to use. You can use the links in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework, just use the button at the top-right of that framework’s block!\n\nPytorch\nHide Pytorch content\nTrain with PyTorch Trainer\n\n🤗 Transformers provides a Trainer class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.\n\nStart by loading your model and specify the number of expected labels. From the Yelp Review dataset card, you know there are five labels:\n\nCopied\n>>> from transformers import AutoModelForSequenceClassification\n\n>>> model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5)\n\nYou will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.\n\nTraining hyperparameters\n\nNext, create a TrainingArguments class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training hyperparameters, but feel free to experiment with these to find your optimal settings.\n\nSpecify where to save the checkpoints from your training:\n\nCopied\n>>> from transformers import TrainingArguments\n\n>>> training_args = TrainingArguments(output_dir=\"test_trainer\")\nEvaluate\n\nTrainer does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics. The 🤗 Evaluate library provides a simple accuracy function you can load with the evaluate.load (see this quicktour for more information) function:\n\nCopied\n>>> import numpy as np\n>>> import evaluate\n\n>>> metric = evaluate.load(\"accuracy\")\n\nCall compute on metric to calculate the accuracy of your predictions. Before passing your predictions to compute, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits):\n\nCopied\n>>> def compute_metrics(eval_pred):\n... logits, labels = eval_pred\n... predictions = np.argmax(logits, axis=-1)\n... return metric.compute(predictions=predictions, references=labels)\n\nIf you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch:\n\nCopied\n>>> from transformers import TrainingArguments, Trainer\n\n>>> training_args = TrainingArguments(output_dir=\"test_trainer\", evaluation_strategy=\"epoch\")\nTrainer\n\nCreate a Trainer object with your model, training arguments, training and test datasets, and evaluation function:\n\nCopied\n>>> trainer = Trainer(\n... model=model,\n... args=training_args,\n... train_dataset=small_train_dataset,\n... eval_dataset=small_eval_dataset,\n... compute_metrics=compute_metrics,\n... )\n\nThen fine-tune your model by calling train():\n\nCopied\n>>> trainer.train()\nTensorFlow\nHide TensorFlow content\nTrain a TensorFlow model with Keras\n\nYou can also train 🤗 Transformers models in TensorFlow with the Keras API!\n\nLoading data for Keras\n\nWhen you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras. Let’s try that first before we do anything more complicated.\n\nFirst, load a dataset. We’ll use the CoLA dataset from the GLUE benchmark, since it’s a simple binary text classification task, and just take the training split for now.\n\nCopied\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"glue\", \"cola\")\ndataset = dataset[\"train\"] # Just take the training split for now\n\nNext, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s, so we can just convert that directly to a NumPy array without tokenization!\n\nCopied\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\ntokenized_data = tokenizer(dataset[\"sentence\"], return_tensors=\"np\", padding=True)\n# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras\ntokenized_data = dict(tokenized_data)\n\nlabels = np.array(dataset[\"label\"]) # Label is already an array of 0 and 1\n\nFinally, load, compile, and fit the model:\n\nCopied\nfrom transformers import TFAutoModelForSequenceClassification\nfrom tensorflow.keras.optimizers import Adam\n\n# Load and compile our model\nmodel = TFAutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\")\n# Lower learning rates are often better for fine-tuning transformers\nmodel.compile(optimizer=Adam(3e-5))\n\nmodel.fit(tokenized_data, labels)\n\nYou don’t have to pass a loss argument to your models when you compile() them! Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always override this by specifying a loss yourself if you want to!\n\nThis approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why? Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle “jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!\n\nLoading data as a tf.data.Dataset\n\nIf you want to avoid slowing down training, you can load your data as a tf.data.Dataset instead. Although you can write your own tf.data pipeline if you want, we have two convenience methods for doing this:\n\nprepare_tf_dataset(): This is the method we recommend in most cases. Because it is a method on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and discard the others to make a simpler, more performant dataset.\nto_tf_dataset: This method is more low-level, and is useful when you want to exactly control how your dataset is created, by specifying exactly which columns and label_cols to include.\n\nBefore you can use prepare_tf_dataset(), you will need to add the tokenizer outputs to your dataset as columns, as shown in the following code sample:\n\nCopied\ndef tokenize_dataset(data):\n # Keys of the returned dictionary will be added to the dataset as columns\n return tokenizer(data[\"text\"])\n\n\ndataset = dataset.map(tokenize_dataset)\n\nRemember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly reduces the number of padding tokens compared to padding the entire dataset.\n\nCopied\n>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)\n\nNote that in the code sample above, you need to pass the tokenizer to prepare_tf_dataset so it can correctly pad batches as they’re loaded. If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument. If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language modelling), you can use the collate_fn argument instead to pass a function that will be called to transform the list of samples into a batch and apply any preprocessing you want. See our examples or notebooks to see this approach in action.\n\nOnce you’ve created a tf.data.Dataset, you can compile and fit the model as before:\n\nCopied\nmodel.compile(optimizer=Adam(3e-5))\n\nmodel.fit(tf_dataset)\nTrain in native PyTorch\nPytorch\nHide Pytorch content\n\nTrainer takes care of the training loop and allows you to fine-tune a model in a single line of code. For users who prefer to write their own training loop, you can also fine-tune a 🤗 Transformers model in native PyTorch.\n\nAt this point, you may need to restart your notebook or execute the following code to free some memory:\n\nCopied\ndel model\ndel trainer\ntorch.cuda.empty_cache()\n\nNext, manually postprocess tokenized_dataset to prepare it for training.\n\nRemove the text column because the model does not accept raw text as an input:\n\nCopied\n>>> tokenized_datasets = tokenized_datasets.remove_columns([\"text\"])\n\nRename the label column to labels because the model expects the argument to be named labels:\n\nCopied\n>>> tokenized_datasets = tokenized_datasets.rename_column(\"label\", \"labels\")\n\nSet the format of the dataset to return PyTorch tensors instead of lists:\n\nCopied\n>>> tokenized_datasets.set_format(\"torch\")\n\nThen create a smaller subset of the dataset as previously shown to speed up the fine-tuning:\n\nCopied\n>>> small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\n>>> small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))\nDataLoader\n\nCreate a DataLoader for your training and test datasets so you can iterate over batches of data:\n\nCopied\n>>> from torch.utils.data import DataLoader\n\n>>> train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)\n>>> eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)\n\nLoad your model with the number of expected labels:\n\nCopied\n>>> from transformers import AutoModelForSequenceClassification\n\n>>> model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5)\nOptimizer and learning rate scheduler\n\nCreate an optimizer and learning rate scheduler to fine-tune the model. Let’s use the AdamW optimizer from PyTorch:\n\nCopied\n>>> from torch.optim import AdamW\n\n>>> optimizer = AdamW(model.parameters(), lr=5e-5)\n\nCreate the default learning rate scheduler from Trainer:\n\nCopied\n>>> from transformers import get_scheduler\n\n>>> num_epochs = 3\n>>> num_training_steps = num_epochs * len(train_dataloader)\n>>> lr_scheduler = get_scheduler(\n... name=\"linear\", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps\n... )\n\nLastly, specify device to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes.\n\nCopied\n>>> import torch\n\n>>> device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n>>> model.to(device)\n\nGet free access to a cloud GPU if you don’t have one with a hosted notebook like Colaboratory or SageMaker StudioLab.\n\nGreat, now you are ready to train! 🥳\n\nTraining loop\n\nTo keep track of your training progress, use the tqdm library to add a progress bar over the number of training steps:\n\nCopied\n>>> from tqdm.auto import tqdm\n\n>>> progress_bar = tqdm(range(num_training_steps))\n\n>>> model.train()\n>>> for epoch in range(num_epochs):\n... for batch in train_dataloader:\n... batch = {k: v.to(device) for k, v in batch.items()}\n... outputs = model(**batch)\n... loss = outputs.loss\n... loss.backward()\n\n... optimizer.step()\n... lr_scheduler.step()\n... optimizer.zero_grad()\n... progress_bar.update(1)\nEvaluate\n\nJust like how you added an evaluation function to Trainer, you need to do the same when you write your own training loop. But instead of calculating and reporting the metric at the end of each epoch, this time you’ll accumulate all the batches with add_batch and calculate the metric at the very end.\n\nCopied\n>>> import evaluate\n\n>>> metric = evaluate.load(\"accuracy\")\n>>> model.eval()\n>>> for batch in eval_dataloader:\n... batch = {k: v.to(device) for k, v in batch.items()}\n... with torch.no_grad():\n... outputs = model(**batch)\n\n... logits = outputs.logits\n... predictions = torch.argmax(logits, dim=-1)\n... metric.add_batch(predictions=predictions, references=batch[\"labels\"])\n\n>>> metric.compute()\nAdditional resources\n\nFor more fine-tuning examples, refer to:\n\n🤗 Transformers Examples includes scripts to train common NLP tasks in PyTorch and TensorFlow.\n\n🤗 Transformers Notebooks contains various notebooks on how to fine-tune a model for specific tasks in PyTorch and TensorFlow."}]

DataSet/Json/openchatkitreadme.json ADDED Viewed

	@@ -0,0 +1 @@

+ [{"propertyName1":"OpenChatKit\n\nOpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions.\n\nIn this repo, you'll find code for:\n\nTraining an OpenChatKit model\nTesting inference using the model\nAugmenting the model with additional context from a retrieval index\nContents\nRequirements\nPre-trained Weights\nDatasets\nData Contributions\nPretrained Base Model\nTraining and Finetuning\n(Optional) 8bit Adam\nTrain GPT-NeoX-Chat-Base-20B\nConverting Weights to Huggingface Format\nInference\nMonitoring\nLoguru\nWeights & Biases\nExperimental: Retrieval-Augmented Models\nLicense\nCiting OpenChatKit\nAcknowledgements\nRequirements\n\nBefore you begin, you need to install PyTorch and other dependencies.\n\nInstall Miniconda from their website.\nCreate an environment called OpenChatKit using the environment.yml file at the root of this repo.\nconda env create -f environment.yml\n\nThis repo also uses Git LFS to manage some files. Install it using the instructions on their site then run:\n\ngit lfs install\nPre-trained Weights\n\nGPT-NeoXT-Chat-Base-20B is a 20B-parameter variant of GPT-NeoX, fine-tuned on conversational datasets. We are releasing pre-trained weights for this model as togethercomputer/GPT-NeoXT-Chat-Base-20B on Huggingface.\n\nMore details can be found on the model card for GPT-NeoXT-Chat-Base-20B on Huggingface.\n\nDatasets\n\nThe chat model was trained on the OIG dataset built by LAION, Together, and Ontocord.ai. To download the dataset from Huggingface run the command below from the root of the repo.\n\npython data/OIG/prepare.py\n\nOnce the command completes, the data will be in the data/OIG/files directory.\n\nData Contributions\n\nYou can help make this chat model better by contributing data! See the OpenDataHub repo for more details.\n\nPretrained Base Model\n\nAs mentioned above, the chat model is a fine-tuned variant of GPT-NeoX-20B from Eleuther AI. To download GPT-NeoX-20B and prepare it for fine tuning, run this command from the root of the repo.\n\npython pretrained/GPT-NeoX-20B/prepare.py\n\nThe weights for this model will be in the pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b.\n\nTraining and Finetuning\n(Optional) 8bit Adam\n\nTo use 8bit-adam during training, install the bitsandbytes package.\n\npip install bitsandbytes # optional, to use 8bit-adam\nTrain GPT-NeoX-Chat-Base-20B\n\nThe training/finetune_GPT-NeoXT-Chat-Base-20B.sh script configures and runs the training loop. After downloading the dataset and the base model, run:\n\nbash training/finetune_GPT-NeoXT-Chat-Base-20B.sh\n\nThe script launches 8 processes with a pipeline-parallel degree of 8 and a data-parallel degree of 1.\n\nAs the training loop runs, checkpoints are saved to the model_ckpts directory at the root of the repo.\n\nPlease see the training README for more details about customizing the training run.\n\nConverting Weights to Huggingface Format\n\nBefore you can use this model to perform inference, it must be converted to the Hugginface format.\n\nmkdir huggingface_models \\\n&& python tools/convert_to_hf_gptneox.py \\\n --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5 \n --save-path /huggingface_models/GPT-NeoXT-Chat-Base-20B \n --n-stages 8 \n --n-layer-per-stage 6\nInference\n\nTo help you test the model, we provide a simple test command line test harness to interact with the bot.\n\npython inference/bot.py\n\nBy default the script will load the model named GPT-NeoXT-Chat-Base-20B model under the huggingface_models directory, but you can override that behavior by specifying --model.\n\nFor example, if you want to load the base model from our Huggingface, repo, you can run the following command which downloads the weights from HuggingFace.\n\npython inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B\n\nOnce the model has loaded, enter text at the prompt and the model will reply.\n\n$ python inference/bot.py \nLoading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...\nWelcome to OpenChatKit shell. Type /help or /? to list commands.\n\n>>> Hello.\nSetting `pad_token_id` to `eos_token_id`:0 for open-end generation.\nHello human.\n\n>>> \n\nCommands are prefixed with a /, and the /quit command exits.\n\nMonitoring\n\nBy default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using loguru or report them to Weights & Biases.\n\nLoguru\n\nAdd the flag --train-log-backend loguru to your training script to log to ./logs/file_{time}.log\n\nWeights & Biases\n\nTo use Weights & Biases, first login with your Weights & Biases token.\n\nwandb login\n\nAnd set --train-log-backend wandb in the training script to enable logging to Weights & Biases.\n\nExperimental: Retrieval-Augmented Models\n\nNote: Retrieval is still experimental.\n\nThe code in /retrieval implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever.\n\nDownload the Wikipedia index.\npython data/wikipedia-3sentence-level-retrieval-index/prepare.py\nRun the bot with the --retrieval flag.\npython inference/bot.py --retrieval\n\nAfter starting, the bot will load both the chat model and the retrieval index, which takes a long time. Once the model and the index are loaded, all queries will be augmented with extra context.\n\n$ python inference/bot.py --retrieval\nLoading /OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0...\nLoading retrieval index...\nWelcome to OpenChatKit shell. Type /help or /? to list commands.\n\n>>> Where is Zurich?\nSetting `pad_token_id` to `eos_token_id`:0 for open-end generation.\nWhere is Zurich?\nZurich is located in Switzerland.\n\n>>>\nLicense\n\nAll code in this repository was developed by Together Computer except where otherwise noted. Copyright (c) 2023, Together Computer. All rights reserved. The code is licensed under the Apache 2.0 license.\n\nCopyright 2023 Together Computer\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n\n\nThis repository also contains code written by a number of other authors. Such contributions are marked and the relevant licensing is included where appropriate.\n\nFor full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please contact us.\n\nCiting OpenChatKit\n@software{openchatkit,\n title = {{OpenChatKit: An Open Toolkit and Base Model for Dialogue-style Applications}},\n author = {Together Computer},\n url = {https://github.com/togethercomputer/OpenChatKit}\n month = {3},\n year = {2023},\n version = {0.15},\n}\nAcknowledgements\n\nOur model is a fine-tuned version of gpt-neox-20b, a large language model trained by Eleuther AI. We evaluated our model on HELM provided by the Center for Research on Foundation Models. And we collaborated with both CRFM and HazyResearch at Stanford to build this model.\n\nWe collaborated with LAION and Ontocord.ai to build the training data used to fine tune this model."}]

DataSet/Json/train.json ADDED Viewed

The diff for this file is too large to render. See raw diff

DataSet/train.json ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+---
+license: apache-2.0
+language:
+- en
+---
+***<p style="font-size: 24px">Feel free to try out our [OpenChatKit feedback app](https://huggingface.co/spaces/togethercomputer/OpenChatKit)!</p>***
+# GPT-NeoXT-Chat-Base-20B
+> TLDR: As part of OpenChatKit (codebase available [here](https://github.com/togethercomputer/OpenChaT)),
+> GPT-NeoXT-Chat-Base-20B is a 20B parameter language model, fine-tuned from EleutherAI’s GPT-NeoX with over 40 million instructions on 100% carbon negative compute.
+GPT-NeoXT-Chat-Base-20B is based on ElutherAI’s GPT-NeoX model, and is fine-tuned with data focusing on dialog-style interactions.
+We focused the tuning on several tasks such as question answering, classification, extraction, and summarization.
+We’ve fine-tuned the model with a collection of 43 million high-quality instructions.
+Together partnered with LAION and Ontocord.ai, who both helped curate the dataset the model is based on.
+You can read more about this process and the availability of this dataset in LAION’s blog post [here](https://laion.ai/blog/oig-dataset/).
+## Model Details
+- **Developed by**: Together Computer.
+- **Model type**: Language Model
+- **Language(s)**: English
+- **License**: Apache 2.0
+- **Model Description**: A 20B parameter open source chat model, fine-tuned from EleutherAI’s NeoX with over 40M instructions on 100% carbon negative compute
+- **Resources for more information**: [GitHub Repository](https://github.com/togethercomputer/OpenChaT).
+# Quick Start
+```python
+from transformers import pipeline
+pipe = pipeline(model='togethercomputer/GPT-NeoXT-Chat-Base-20B')
+pipe('''<human>: Hello!\n<bot>:''')
+```
+or
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-NeoXT-Chat-Base-20B")
+model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-NeoXT-Chat-Base-20B")
+```
+## Strengths of the model
+There are several tasks that OpenChatKit excels at out of the box. This includes:
+- Example 1: Summarization and question answering within context.
+    ```markdown
+    **Summarize a long document into a single sentence and conduct question answering related to the document, with multiple rounds**
+    <human>: Last year, the travel industry saw a big rebound in demand — and that demand is showing no signs of slowing down this spring break travel season. Planes and hotels will be full, travelers will likely face long queues, cancellations, massive crowds and plenty of other travel nightmares. But perhaps the most frustrating thing you’ll have to prepare for this spring break is if your luggage goes missing. You’ve got to somehow track down your belongings that were in the airline’s care and then wait to be reunited — all while trying to enjoy the start of your long-awaited vacation. As far as travel nightmares go, lost luggage is up there as one of the worst.
+    To be honest up front, I am great at strategically losing things — or rather, losing stuff with style. (I hope you appreciate the “Toy Story” reference.) So when a product comes along that’s a perfect way to keep track of my belongings — especially when traveling — I’m adding it to my travel bag.
+    I’ve been a big fan of Apple’s $29 item tracker, dubbed AirTag. The long-rumored AirTag product finally launched in 2021, and while it’s not the first item tracker to exist, it has two key features that set it apart from the rest to make it the ultimate travel companion: access to the Find My network and Precision Finding technology.
+    The Federal Aviation Administration has even ruled that having an AirTag in your luggage is permitted and, in fact, not a safety hazard. The devices only contain 0.1 grams of lithium, which falls below the FAA’s limit.
+    “Luggage tracking devices powered by lithium metal cells that have 0.3 grams or less of lithium can be used on checked baggage,” the FAA said in a statement. “Apple AirTags meet this threshold; other luggage tracking devices may not.”
+    But, you may be wondering what makes them so special and why you should carry one in your checked luggage this spring break? Let’s explain why.
+    Summarize the above into a single paragraph.
+    <bot>: If you're traveling this spring break, you might want to consider investing in an item tracker to help you find your lost luggage. Apple's AirTag is the best option, according to a new report.
+    <human>: But isn't it unsafe to put AirTag in a luggage?
+    <bot>: The Federal Aviation Administration has ruled that having an AirTag in your luggage is permitted and, in fact, not a safety hazard.
+    <human>: Ah, why is that?
+    <bot>: The devices only contain 0.1 grams of lithium, which falls below the FAA’s limit.
+    ```
+- Example 2: Extraction.
+    ```markdown
+    **Extract structured information from unstructured documents, as illustrated in the following example.**
+    <human>: Generate a table summarizing the options outlined in this email.
+    Team, we have to make a decision on the product plan for Goocrux v2.
+    The first option will take us 3 weeks of development, and will achieve performance of 80ms, but means that the UI has to change.
+    The second option will take us 4 weeks of development, and will achieve performance of 60ms, without any UI change.
+    The third option will take the longest -- 6 weeks of development, will achieve performance of 30ms, without any UI change.
+    A markdown table with four columns: (1) Option ID, (2) Development time, (3) Performance, (4) UI Change:
+    <bot>: #.
+    | ID | Option ID | Development time | Performance | UI Change |
+    | --------- | --------- | --------- | --------- | --------- |
+    | 1 | 1 | 3 weeks | 80ms | Yes |
+    | 2 | 2 | 4 weeks | 60ms | No |
+    | 3 | 3 | 6 weeks | 30ms | No |
+    ```
+- Example 3: Classification.
+   ```markdown
+   **Classifying a sentence or paragraph into different categories.**
+   <human>: Classify the sentiment of the following sentence into Positive, Neutral, or Negative: Super excited about teaching Stanford’s first course on Large Language Models! Check the syllabus out here
+   <bot>: Positive
+   <human>: Classify the sentiment of the following sentence into Positive, Neutral, or Negative: How about the following sentence: It is raining outside and I feel so blue
+   <bot>: Negative
+   ```
+In addition, the model does well on few-shot prompts. For both classification and extraction, the model performs even better with few shots, as in most HELM tasks. [Contact us](https://www.together.xyz/contact) if you’re interested in trying few-shot prompts with the model.
+## Weaknesses of the model
+That said, there are several areas where we have more work to do, and we need your help! Some of these include:
+- Knowledge-based closed question and answering: The chatbot may hallucinate and give incorrect results. Be sure to fact check, and if possible provide feedback with the corrected information.
+- Coding tasks: The chatbot was not trained on a large enough corpus of source code to excel at writing code. We welcome contributions of additional datasets to improve this!
+- Repetition: Sometimes the chatbot will repeat its response. We’re working to improve this, but in the meantime you can click the refresh button to start a new conversation.
+- Context switching: If you change the topic in the middle of a conversation the chatbot often cannot make the switch automatically and will continue to give answers related to the prior topic.
+- Creative writing and longer answers: The chatbot does not generate long, creative text such as an essay or story.
+We are excited to work with you to address these weaknesses by getting your feedback, bolstering data sets, and improving accuracy.
+# Uses
+## Direct Use
+The model is intended for research purposes. Possible research areas and tasks include
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of dialogue models or language models.
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+- Research on dialogue models or language models.
+Excluded uses are described below.
+### Misuse, Malicious Use, and Out-of-Scope Use
+The OpenChatKit community provides GPT-NeoXT-Chat-Base-20B as an open source tool for building chatbots.
+The community is not responsible for any misuse, malicious use, or out-of-scope use of the model.
+It is the responsibility of the end user to ensure that the model is used in a responsible and ethical manner.
+#### Out-of-Scope Use
+GPT-NeoXT-Chat-Base-20B is designed for use in chatbot applications and may not perform well for other use cases outside of its intended scope.
+For example, it may not be suitable for use in safety-critical applications or for making decisions that have a significant impact on individuals or society.
+It is important to consider the limitations of the model and to only use it for its intended purpose.
+#### Misuse and Malicious Use
+GPT-NeoXT-Chat-Base-20B is designed for use in chatbot applications and should not be used for any other purpose.
+Misuse of the model, such as using it to engage in illegal or unethical activities, is strictly prohibited and goes against the principles of the OpenChatKit community project.
+Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
+- Generating fake news, misinformation, or propaganda
+- Promoting hate speech, discrimination, or violence against individuals or groups
+- Impersonating individuals or organizations without their consent
+- Engaging in cyberbullying or harassment
+- Defamatory content
+- Spamming or scamming
+- Sharing confidential or sensitive information without proper authorization
+- Violating the terms of use of the model or the data used to train it
+- Creating automated bots for malicious purposes such as spreading malware, phishing scams, or spamming
+## Limitations
+GPT-NeoXT-Chat-Base-20B, like other language model-based chatbots, has limitations that should be taken into consideration.
+For example, the model may not always provide accurate or relevant answers, particularly for questions that are complex, ambiguous, or outside of its training data.
+We therefore welcome contributions from individuals and organizations, and encourage collaboration towards creating a more robust and inclusive chatbot.
+## Training
+**Training Data**
+Please refer to [togethercomputer/OpenDataHub](https://github.com/togethercomputer/OpenDataHub)
+**Training Procedure**
+- **Hardware:** 2 x 8 x A100 GPUs
+- **Optimizer:** [8bit-AdamW](https://github.com/TimDettmers/bitsandbytes)
+- **Gradient Accumulations**: 2
+- **Batch:** 2 x 2 x 64 x 2048 = 524288 tokens
+- **Learning rate:** warmup to 1e-6 for 100 steps and then kept constant
+## Community
+Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)

pytorch_model.bin.index.json ADDED Viewed

	@@ -0,0 +1,671 @@

+{
+  "metadata": {
+    "total_size": 41293685880
+  },
+  "weight_map": {
+    "embed_out.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.embed_in.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.final_layer_norm.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.final_layer_norm.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.0.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.10.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.11.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.11.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.12.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.13.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.14.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.15.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.16.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.17.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.18.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.19.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.2.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.20.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.dense.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.dense.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.query_key_value.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.query_key_value.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.mlp.dense_4h_to_h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.mlp.dense_4h_to_h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.mlp.dense_h_to_4h.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.mlp.dense_h_to_4h.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.20.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.attention.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.attention.masked_bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.attention.rotary_emb.inv_freq": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.input_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.input_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.21.post_attention_layernorm.bias": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.21.post_attention_layernorm.weight": "pytorch_model-00002-of-00005.bin",
+    "gpt_neox.layers.22.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.22.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.23.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.24.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.25.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.26.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.27.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.28.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.29.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.3.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.30.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.mlp.dense_4h_to_h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.mlp.dense_4h_to_h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.30.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.dense.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.dense.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.masked_bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.query_key_value.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.query_key_value.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.attention.rotary_emb.inv_freq": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.input_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.input_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.31.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.31.mlp.dense_h_to_4h.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.mlp.dense_h_to_4h.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.post_attention_layernorm.bias": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.31.post_attention_layernorm.weight": "pytorch_model-00003-of-00005.bin",
+    "gpt_neox.layers.32.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.32.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.33.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.34.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.35.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.36.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.37.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.38.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.39.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.4.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.40.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.40.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.mlp.dense_4h_to_h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.mlp.dense_4h_to_h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.mlp.dense_h_to_4h.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.mlp.dense_h_to_4h.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.41.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.dense.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.dense.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.masked_bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.query_key_value.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.query_key_value.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.attention.rotary_emb.inv_freq": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.input_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.input_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.mlp.dense_4h_to_h.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.42.mlp.dense_4h_to_h.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.42.mlp.dense_h_to_4h.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.42.mlp.dense_h_to_4h.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.42.post_attention_layernorm.bias": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.42.post_attention_layernorm.weight": "pytorch_model-00004-of-00005.bin",
+    "gpt_neox.layers.43.attention.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.dense.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.dense.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.masked_bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.query_key_value.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.query_key_value.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.attention.rotary_emb.inv_freq": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.input_layernorm.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.input_layernorm.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.mlp.dense_4h_to_h.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.mlp.dense_4h_to_h.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.mlp.dense_h_to_4h.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.mlp.dense_h_to_4h.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.post_attention_layernorm.bias": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.43.post_attention_layernorm.weight": "pytorch_model-00005-of-00005.bin",
+    "gpt_neox.layers.5.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.dense.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.dense.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.masked_bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.query_key_value.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.query_key_value.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.attention.rotary_emb.inv_freq": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.input_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.mlp.dense_4h_to_h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.mlp.dense_4h_to_h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.mlp.dense_h_to_4h.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.mlp.dense_h_to_4h.weight": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.post_attention_layernorm.bias": "pytorch_model-00001-of-00005.bin",
+    "gpt_neox.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00005.bin"
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "name_or_path": "EleutherAI/gpt-neox-20b",
+  "special_tokens_map_file": "/root/.cache/huggingface/transformers/d9026dc928c47ac2a72d46fea7db959acc4bacac2176bf32be5c331604b77d32.3ae9ae72462581d20e36bc528e9c47bb30cd671bb21add40ca0b24a0be9fac22",
+  "tokenizer_class": "GPTNeoXTokenizer",
+  "unk_token": "<|endoftext|>"
+}

train.json ADDED Viewed

The diff for this file is too large to render. See raw diff