ncos / README.md

Update README.md

680c743 verified 11 months ago

4.76 kB

license: llama3.1
inference: false
fine-tuning: false
tags:
  - llama3.1
base_model: meta-llama/Llama-3.1-70B-Instruct
pipeline_tag: text-generation
library_name: transformers

NoxtuaCompliance

Noxtua-Compliance-70B-V1 is a specialized large language model designed for legal compliance applications. It is finetuned from the Llama-3-70B-Instruct model using a custom legal cases dataset to understand more complex contexts and achieve precise results when analyzing complex legal issues.

Model details

Model Name: Noxtua-Compliance-70B-V1

Base Model: Llama-3-70B-Instruct

Parameter Count: 70 billion

Run with vllm

docker run --runtime nvidia --gpus=all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:v0.6.6.post1 --model ACATECH/ncos --tensor-parallel-size=2 --disable-log-requests --max-model-len 120000 --gpu-memory-utilization 0.95

Use with transformers

See the snippet below for usage with Transformers:

import torch
import transformers

model_id = "ACATECH/ncos"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
    "text-generation", 
    model=model_id, 
    tokenizer=tokenizer, 
    max_new_tokens=1024, 
    torch_dtype = torch.float16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are an intelligent AI assistant in the legal domain called Noxtua NCOS from the company Xayn. You will assist the user with care, respect and professionalism. Always answer in the same language as the question. Freely use legal jargon."},
    {"role": "user", "content": "Carry out an entire authority check of the following text."},
]

print(pipeline(messages))

Please consider setting temperature = 0 to get consistent outputs.

Framework versions

Transformers 4.47.1
Pytorch 2.5.1+cu121

Quantization

In case your hardware is not sufficient to run the large model, you might want to consider reducing the size via a quantization method. This will reduce size of the model, such that you can run it with lesser specs. However, information stored in the model will also be lost. Caution: we advise against using the model in quantized form, as much of the finetuned information will get lost. So we cannot guarantee that the resulting model is up to the performance of the un-quantized variant, which was provided by the domain expert.

In case you decide to give the quantized model a try, you are probably not able to load the full model into VRAM or RAM. So we recommend to use *gguf-quantization. It allows you to run quantization without loading the full model into RAM. To use gguf-quantization, we need to transform the model into a gguf format. For this step the lama.cpp (release: b5233) tools should be sufficient. Our model has a specialized setup, which is not automatically detected by the new lama.cpp convert function. Thus we used the legacy version with the flag "vocab-type".

> python .\llama.cpp\examples\convert_legacy_llama.py .\ncos_model_directory\  --outfile ncos.gguf --vocab-type bpe

The resulting file can now be quantized with a working memory of less than 5 GB. Please read up on the different kind of quantization and parameters for each option to chose the right quantization scheme for your use case. Here (as an example) we specify a ad-hoc 4bit quantization q4_0

.\lama.cpp\llama-quantize.exe .\ncos.gguf .\ncos-q4_0.gguf q4_0

The 4bit version of the model is roughly 40GB in size. This reduces the GPU requirements quite a bit. If running via the "CPU"-option, we can now even run the model on high end consumer setups. The easiest way to set up a local deployment is probably to call the ollama-library (version: v0.6.7) to setup the model. But you could also stick to the gradio instructions above. Important is to read up on how to set up a "modelfile" according to your use case. The preferred system prompt and some additional setups can be found in the config files of the model.

ollama create ncos-q40 -f .\ncos-gguf\Modelfile

Recommended Hardware

Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.

FYI: We know that this is a "sporty" requirement. We produced the model as a PoC (proof of concept) implementation. However, we are planning to follow up the refinement of our model with a destilled version of the model, conducted by domain experts who can supervise the process. To make use of the model on more limited hardware, you could of course follow standard quantization procedures as sketeched in the 'use with transformers' section.