Phind
/

Phind-70B

Text Generation

text-generation-inference

Model card Files Files and versions

Phind-70B / README.md

michaelroyzen's picture

Update README.md

9c08e97 verified 14 days ago

|

history blame contribute delete

2.96 kB

	---
	license: llama3.3
	library_name: transformers
	pipeline_tag: text-generation
	base_model: meta-llama/Llama-3.3-70B-Instruct
	tags:
	- llama
	- llama-3
	- code
	- instruct
	- fine-tuned
	language:
	- en
	---

	# Phind-70B

	Phind-70B is a fine-tuned version of [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), optimized for code generation, technical reasoning, and general instruction following.

	## Model Details

	\| Attribute \| Details \|
	\|-----------\|---------\|
	\| Base Model \| meta-llama/Llama-3.3-70B-Instruct \|
	\| Model Type \| Causal Language Model \|
	\| Parameters \| 70 Billion \|
	\| Context Length \| 128K tokens \|
	\| Language \| English \|
	\| License \| Llama 3.3 Community License \|

	## Intended Use

	Phind-70B is designed for:

	- Code generation across multiple programming languages
	- Technical problem-solving and debugging
	- General instruction following and reasoning tasks
	- Multi-turn conversations requiring context retention

	## How to Use

	### With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "Phind/Phind-70B"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "You are Phind, an intelligent assistant that helps with programming and technical questions."},
	{"role": "user", "content": "Write a Python function to find the longest palindromic substring."},
	]

	input_ids = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	outputs = model.generate(
	input_ids,
	max_new_tokens=1024,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	)

	response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
	print(response)
	```

	## Chat Template

	This model uses the Llama 3 chat format:

	```
	<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>

	{system_message}<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>

	{user_message}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|}

	{assistant_response}<\|eot_id\|>
	```

	## Hardware Requirements

	\| Precision \| VRAM Required \|
	\|-----------\|---------------\|
	\| FP16/BF16 \| ~140 GB \|
	\| INT8 \| ~70 GB \|
	\| INT4 \| ~35 GB \|

	For inference, we recommend using multiple GPUs with tensor parallelism or quantized versions for consumer hardware.

	## Limitations

	- May occasionally generate incorrect or misleading information
	- Not suitable for production use without additional safety measures
	- Performance may vary on tasks outside the training distribution
	- Should not be used for generating harmful, illegal, or unethical content

	## Acknowledgments

	This model builds upon the excellent work by Meta on the Llama 3.3 model family. We are grateful for their contributions to open-source AI.