Instructions to use TroyDoesAI/Tiny-RAG-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use TroyDoesAI/Tiny-RAG-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="TroyDoesAI/Tiny-RAG-gguf", filename="Tiny-RAG.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use TroyDoesAI/Tiny-RAG-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf TroyDoesAI/Tiny-RAG-gguf # Run inference directly in the terminal: llama-cli -hf TroyDoesAI/Tiny-RAG-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf TroyDoesAI/Tiny-RAG-gguf # Run inference directly in the terminal: llama-cli -hf TroyDoesAI/Tiny-RAG-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf TroyDoesAI/Tiny-RAG-gguf # Run inference directly in the terminal: ./llama-cli -hf TroyDoesAI/Tiny-RAG-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf TroyDoesAI/Tiny-RAG-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf TroyDoesAI/Tiny-RAG-gguf
Use Docker
docker model run hf.co/TroyDoesAI/Tiny-RAG-gguf
- LM Studio
- Jan
- Ollama
How to use TroyDoesAI/Tiny-RAG-gguf with Ollama:
ollama run hf.co/TroyDoesAI/Tiny-RAG-gguf
- Unsloth Studio new
How to use TroyDoesAI/Tiny-RAG-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TroyDoesAI/Tiny-RAG-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TroyDoesAI/Tiny-RAG-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for TroyDoesAI/Tiny-RAG-gguf to start chatting
- Docker Model Runner
How to use TroyDoesAI/Tiny-RAG-gguf with Docker Model Runner:
docker model run hf.co/TroyDoesAI/Tiny-RAG-gguf
- Lemonade
How to use TroyDoesAI/Tiny-RAG-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull TroyDoesAI/Tiny-RAG-gguf
Run and chat with the model
lemonade run user.Tiny-RAG-gguf-{{QUANT_TAG}}List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TroyDoesAI/Tiny-RAG-gguf# Run inference directly in the terminal:
llama-cli -hf TroyDoesAI/Tiny-RAG-ggufUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf TroyDoesAI/Tiny-RAG-gguf# Run inference directly in the terminal:
./llama-cli -hf TroyDoesAI/Tiny-RAG-ggufBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf TroyDoesAI/Tiny-RAG-gguf# Run inference directly in the terminal:
./build/bin/llama-cli -hf TroyDoesAI/Tiny-RAG-ggufUse Docker
docker model run hf.co/TroyDoesAI/Tiny-RAG-ggufExperimenting with Dataset Quality to improve generations, TinyLlama is faster to prototype datasets.
Base Model : TinyLlama
Overview This model is meant to enhance adherence to provided context (e.g., for RAG applications) and reduce hallucinations, inspired by airoboros context-obedient question answer format.
Overview
The format for a contextual prompt is as follows:
Contextual-Request:
BEGININPUT
BEGINCONTEXT
[key0: value0]
[key1: value1]
... other metdata ...
ENDCONTEXT
[insert your text blocks here]
ENDINPUT
[add as many other blocks, in the exact same format]
BEGININSTRUCTION
[insert your instruction(s). The model was tuned with single questions, paragraph format, lists, etc.]
ENDINSTRUCTION
I know it's a bit verbose and annoying, but after much trial and error, using these explicit delimiters helps the model understand where to find the responses and how to associate specific sources with it.
Contextual-Request:- denotes the type of request pattern the model is to follow for consistencyBEGININPUT- denotes a new input blockBEGINCONTEXT- denotes the block of context (metadata key/value pairs) to associate with the current input blockENDCONTEXT- denotes the end of the metadata block for the current input- [text] - Insert whatever text you want for the input block, as many paragraphs as can fit in the context.
ENDINPUT- denotes the end of the current input block- [repeat as many input blocks in this format as you want]
BEGININSTRUCTION- denotes the start of the list (or one) instruction(s) to respond to for all of the input blocks above.- [instruction(s)]
ENDINSTRUCTION- denotes the end of instruction set
Here's a trivial, but important example to prove the point:
Contextual-Request:
BEGININPUT
BEGINCONTEXT
date: 2021-01-01
url: https://web.site/123
ENDCONTEXT
In a shocking turn of events, blueberries are now green, but will be sticking with the same name.
ENDINPUT
BEGININSTRUCTION
What color are bluberries? Source?
ENDINSTRUCTION
And the expected response:
### Contextual Response:
Blueberries are now green.
Source:
date: 2021-01-01
url: https://web.site/123
References in response
As shown in the example, the dataset includes many examples of including source details in the response, when the question asks for source/citation/references.
Why do this? Well, the R in RAG seems to be the weakest link in the chain. Retrieval accuracy, depending on many factors including the overall dataset size, can be quite low. This accuracy increases when retrieving more documents, but then you have the issue of actually using the retrieved documents in prompts. If you use one prompt per document (or document chunk), you know exactly which document the answer came from, so there's no issue. If, however, you include multiple chunks in a single prompt, it's useful to include the specific reference chunk(s) used to generate the response, rather than naively including references to all of the chunks included in the prompt.
For example, suppose I have two documents:
url: http://foo.bar/1
Strawberries are tasty.
url: http://bar.foo/2
The cat is blue.
If the question being asked is What color is the cat?, I would only expect the 2nd document to be referenced in the response, as the other link is irrelevant.
- Downloads last month
- 3
We're not able to determine the quantization variants.
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf TroyDoesAI/Tiny-RAG-gguf# Run inference directly in the terminal: llama-cli -hf TroyDoesAI/Tiny-RAG-gguf