csabakecskemeti commited on
Commit
bcf1df6
·
verified ·
1 Parent(s): 14ef358

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -4,9 +4,10 @@ pipeline_tag: question-answering
4
 
5
  The goal of the model to provide a fine-tuned Phi2 (https://huggingface.co/microsoft/phi-2) model that has knowledge about the Vintage NEXTSTEP Operation system,
6
  and able to answer question in the topic.
 
7
  The model has trained on 35439 Question Answer pairs automatically generated from the NEXTSTEP 3.3 System Administrator
8
  documentation. For the training data generation locally running Q8 Quantized Orca2 13B (https://huggingface.co/TheBloke/Orca-2-13B-GGUF)
9
- model has been used. The training data generation was completely unsuperwised, with minilar sanity check (like ignore data chunks
10
  contains less than 100 tokens). The maximum token size for Orca2 is 4096 so a simple rule of split chunks over 3500 tokens
11
  (considering propt instructions) has been used. Chunking did not consider context (text data might split within the context).
12
  Evaluation set has been generated similar method on 1% of the raw data with LLama2 chat (https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).
 
4
 
5
  The goal of the model to provide a fine-tuned Phi2 (https://huggingface.co/microsoft/phi-2) model that has knowledge about the Vintage NEXTSTEP Operation system,
6
  and able to answer question in the topic.
7
+
8
  The model has trained on 35439 Question Answer pairs automatically generated from the NEXTSTEP 3.3 System Administrator
9
  documentation. For the training data generation locally running Q8 Quantized Orca2 13B (https://huggingface.co/TheBloke/Orca-2-13B-GGUF)
10
+ model has been used. The training data generation was completely unsuperwised, with only some sanity check (like ignore data chunks
11
  contains less than 100 tokens). The maximum token size for Orca2 is 4096 so a simple rule of split chunks over 3500 tokens
12
  (considering propt instructions) has been used. Chunking did not consider context (text data might split within the context).
13
  Evaluation set has been generated similar method on 1% of the raw data with LLama2 chat (https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).