Updated how the gradio looks, and doubled down on the system prompt to ensure that the system refers to the documents as its knowledge base. 09ea7c4 verified gyrmo commited on Apr 8
Updated the reranker so that it stops breaking the llm section. bcd5f13 verified gyrmo commited on Apr 1
Reducing the batch size to 1 - it'll be slower, but at least this way the model output doesn't collapse. 04ce9a3 verified gyrmo commited on Apr 1
Updated the reranker function to push the rerank model to the cpu to save on space instead of having it eat up the very limited GPU space that we are working with. Aso reduced the batch type to 8. 4e77426 verified gyrmo commited on Apr 1
Added the reranker to improve the quality of the nodes passed on to the query engine. 491dcef verified gyrmo commited on Apr 1
Forgot to update the package imports to bring in the extractor for the log handler 6f3f6e2 verified gyrmo commited on Mar 26
Changed the temperature to 0.5, and added a function that will extract the condensed question for analysis. bd30577 verified gyrmo commited on Mar 26
Added a prompt helper to help manage the tokens, reduced the summary size to 800 abf50ce verified gyrmo commited on Feb 26
I now have more GPU, therefore I have now reduced the GPU utililisation to 0.8 9b002ee verified gyrmo commited on Feb 26
Increase the max model length, and corrected the quantisation to awq_merlin. d5cb9c0 verified gyrmo commited on Feb 25
Changed max model length to 3600 to improve hte KV cache issue.. 6e99e67 verified gyrmo commited on Feb 25
Reduced max model length, and increased gpu utilisation to 0.9 647e94c verified gyrmo commited on Feb 25
Specified chat mode, and made sure that the message was streamed for a nice UI action. 47362ec verified gyrmo commited on Feb 25
Reduced the GPU utilization and specified the quantization method. 7c71431 verified gyrmo commited on Feb 25
Added a memory buffer, and moved the wait llm function to the main bit for gradio. 6c941da verified gyrmo commited on Feb 25
I have added some server specifics because the gradio bit isn't starting up. c54877c verified gyrmo commited on Feb 24
Moved the embedding model to the CPU. This will allow me to have more space on the GPU for the LLM. 824aa63 verified gyrmo commited on Feb 24
Switched the model from Llama 3.3-70B to Llama-3.3-70B-Instruct-FP4. 152d1ec verified gyrmo commited on Feb 24
Switching to a pre-quantised version of llama 3.3-70B sourced from Nvidia. 4a09bfe verified gyrmo commited on Feb 24
Updated vllm_server to include a wait for vllm portion that ensures that the model is up before the chat section loads. 10f7946 verified gyrmo commited on Feb 23