Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / inference-endpoints /pr_151 /en /tutorials /chat_bot.md

rtrm

about 1 month ago

preview code

download

raw

15 kB

	# Build and deploy your own chat application

	This tutorial will guide you from end to end on how to deploy your own chat application using Hugging Face Inference Endpoints. We will use Gradio to create a chat interface and an OpenAI client to connect to the Inference Endpoint.

	<Tip>

	This Tutorial uses Python, but your client can be any language that can make HTTP requests. The model and engine you deploy on Inference Endpoints uses the OpenAI Chat Completions format, so you can use any [OpenAI client](https://platform.openai.com/docs/libraries) to connect to them, in languages like JavaScript, Java, and Go.

	</Tip>

	## Create your Inference Endpoint

	First, we need to create an Inference Endpoint for a model that can chat.

	Start by navigating to the Inference Endpoints UI, and once you have logged in you should see a button for creating a new Inference
	Endpoint. Click the "New" button.

	![new-button](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/quick_start/1-new-button.png)

	From there you'll be directed to the catalog. The Model Catalog consists of popular models which have tuned configurations to work just as one-click
	deploys. You can filter by name, task, price of the hardware and much more.

	![catalog](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/quick_start/2-catalog.png)

	In this example let's deploy the [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) model. You can find
	it by searching for `qwen3 1.7b` in the search field and deploy it by clicking the card.

	![qwen](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/qwen-search.png)

	Next we'll choose which hardware and deployment settings we'll go for. Since this is a catalog model, all of the pre-selected options are very good
	defaults. So in this case we don't need to change anything. In case you want a deeper dive on what the different settings mean you can check out
	the [configuration guide](./guides/configuration).

	For this model the Nvidia L4 is the recommended choice. It will be perfect for our testing. Performant but still reasonably priced. Also note that by
	default the endpoint will scale down to zero, meaning it will become idle after 1h of inactivity.

	Now all you need to do is click click "Create Endpoint" 🚀

	![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/config.png)

	Now our Inference Endpoint is initializing, which usually takes about 3-5 minutes. If you want to can allow browser notifications which will give you a
	ping once the endpoint reaches a running state.

	![init](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/init.png)

	## Test your Inference Endpoint in the browser

	Now that we've created our Inference Endpoint, we can test it in the playground section.

	![playground](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/playground.png)

	You can use the model through a chat interface or copy code snippets to use it in your own application.

	## Get your Inference Endpoint details

	We need to grab details of our Inference Endpoint, which we can find in the Endpoint's [Overview](https://endpoints.huggingface.co/). We will need the following details:

	- The base URL of the endpoint plus the version of the OpenAI API (e.g. `https://<id>.<region>.<cloud>.endpoints.huggingface.cloud/v1/`)
	- The name of the endpoint to use (e.g. `qwen3-1-7b-xll`)
	- The token to use for authentication (e.g. `hf_<token>`)

	![endpoint-details](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/endpoint-page.png)

	We can find the token in your [account settings](https://huggingface.co/settings/tokens) which is accessible from the top dropdown and clicking on your account name.

	## Deploy in a few lines of code

	The easiest way to deploy a chat application with [Gradio](https://gradio.app/) is to use the convenient `load_chat` method. This abstracts everything away and you can have a working chat application quickly.

	```python
	import os

	import gradio as gr

	gr.load_chat(
	base_url="<endpoint-url>/v1/", # Replace with your endpoint URL + version
	model="endpoint-name", # Replace with your endpoint name
	token=os.getenv("HF_TOKEN"), # Replace with your token
	).launch()
	```

	The `load_chat` method won't cater for your production needs, but it's a great way to get started and test your application.


	## Build your own custom chat application

	If you want more control over your chat application, you can build your own custom chat interface with Gradio. This gives you more flexibility to customize the behavior, add features, and handle errors.

	Choose your preferred method for connecting to Inference Endpoints:

	<hfoptions id="chat-implementation">
	<hfoption id="hf-client">

	Using Hugging Face InferenceClient

	First, install the required dependencies:

	```bash
	pip install gradio huggingface-hub
	```

	The Hugging Face InferenceClient provides a clean interface that's compatible with the OpenAI API format:

	```python
	import os
	import gradio as gr
	from huggingface_hub import InferenceClient

	# Initialize the Hugging Face InferenceClient
	client = InferenceClient(
	base_url="<endpoint-url>/v1/", # Replace with your endpoint URL
	token=os.getenv("HF_TOKEN") # Use environment variable for security
	)

	def chat_with_hf_client(message, history):
	# Convert Gradio history to messages format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]

	# Add the current message
	messages.append({"role": "user", "content": message})

	# Create chat completion
	chat_completion = client.chat.completions.create(
	model="endpoint-name", # Use the name of your endpoint (i.e. qwen3-1.7b-instruct-xxxx)
	messages=messages,
	max_tokens=150,
	temperature=0.7,
	)

	# Return the response
	return chat_completion.choices[0].message.content

	# Create the Gradio interface
	demo = gr.ChatInterface(
	fn=chat_with_hf_client,
	type="messages",
	title="Custom Chat with Inference Endpoints",
	examples=["What is deep learning?", "Explain neural networks", "How does AI work?"]
	)

	if __name__ == "__main__":
	demo.launch()
	```

	</hfoption>
	<hfoption id="openai-client">

	Using OpenAI Client

	First, install the required dependencies:
	```bash
	pip install gradio openai
	```

	Here's a basic chat function using the OpenAI client:

	```python
	import os
	import gradio as gr
	from openai import OpenAI

	# Initialize the OpenAI client with your Inference Endpoint
	client = OpenAI(
	base_url="<endpoint-url>/v1/", # Replace with your endpoint URL
	api_key=os.getenv("HF_TOKEN") # Use environment variable for security
	)

	def chat_with_openai(message, history):

	# Convert Gradio history to OpenAI format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]

	# Add the current message
	messages.append({"role": "user", "content": message})

	# Create chat completion
	chat_completion = client.chat.completions.create(
	model="endpoint-name", # Use the name of your endpoint (i.e. qwen3-1.7b-xxxx)
	messages=messages,
	max_tokens=150,
	temperature=0.7,
	)

	# return the response
	return chat_completion.choices[0].message.content


	# Create the Gradio interface
	demo = gr.ChatInterface(
	fn=chat_with_openai,
	type="messages",
	title="Custom Chat with Inference Endpoints",
	examples=["What is deep learning?", "Explain neural networks", "How does AI work?"]
	)

	if __name__ == "__main__":
	demo.launch()
	```

	</hfoption>
	<hfoption id="requests">

	Using Requests Library

	First, install the required dependencies:
	```bash
	pip install gradio requests
	```

	Here's a basic chat function using the requests library with the Messages API:

	```python
	import os
	import gradio as gr
	import requests

	# Configure your Inference Endpoint
	API_URL = "<endpoint-url>/v1/chat/completions" # Use the chat completions endpoint

	headers = {
	"Accept": "application/json",
	"Content-Type": "application/json",
	"Authorization": f"Bearer {os.getenv('HF_TOKEN')}" # Use environment variable for security
	}

	def chat_with_requests(message, history):
	# Convert Gradio history to messages format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]

	# Add the current message
	messages.append({"role": "user", "content": message})

	# Prepare the payload using the Messages API format
	payload = {
	"model": "endpoint-name", # Use the name of your endpoint (i.e. qwen3-1.7b-xxxx)
	"messages": messages,
	"max_tokens": 150,
	"temperature": 0.7,
	"stream": False
	}

	# Make the request
	response = requests.post(API_URL, headers=headers, json=payload)
	response.raise_for_status()

	# Parse the response
	result = response.json()
	return result["choices"][0]["message"]["content"]

	# Create the Gradio interface
	demo = gr.ChatInterface(
	fn=chat_with_requests,
	type="messages",
	title="Custom Chat with Inference Endpoints",
	examples=["What is deep learning?", "Explain neural networks", "How does AI work?"]
	)

	if __name__ == "__main__":
	demo.launch()
	```

	</hfoption>
	</hfoptions>



	## Adding Streaming Support

	For a better user experience, you can implement streaming responses. This will require us to handle the messages and `yield` them to the client.

	Here's how to add streaming to each client:

	<hfoptions id="streaming-implementation">
	<hfoption id="hf-client">

	### Hugging Face InferenceClient Streaming

	The Hugging Face InferenceClient supports streaming similar to the OpenAI client:

	```python
	import os
	import gradio as gr
	from huggingface_hub import InferenceClient

	client = InferenceClient(
	base_url="<endpoint-url>/v1/",
	token=os.getenv("HF_TOKEN")
	)

	def chat_with_hf_streaming(message, history):
	# Convert history to messages format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]
	messages.append({"role": "user", "content": message})

	# Create streaming chat completion
	chat_completion = client.chat.completions.create(
	model="endpoint-name",
	messages=messages,
	max_tokens=150,
	temperature=0.7,
	stream=True # Enable streaming
	)

	response = ""
	for chunk in chat_completion:
	if chunk.choices[0].delta.content:
	response += chunk.choices[0].delta.content
	yield response # Yield partial response for streaming

	# Create streaming interface
	demo = gr.ChatInterface(
	fn=chat_with_hf_streaming,
	type="messages",
	title="Streaming Chat with Inference Endpoints"
	)

	demo.launch()
	```

	</hfoption>
	<hfoption id="openai-client">

	### OpenAI Client Streaming

	To use streaming with the OpenAI client, we need to set `stream=True` and yield the response as it builds:

	```python
	import os
	import gradio as gr
	from openai import OpenAI

	client = OpenAI(base_url="<endpoint-url>/v1/", api_key=os.getenv("HF_TOKEN"))


	def chat_with_streaming(message, history):
	# Convert history to OpenAI format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]
	messages.append({"role": "user", "content": message})


	chat_completion = client.chat.completions.create(
	model="endpoint-name", # Use the name of your endpoint (i.e. qwen3-1.7b-xxxx)
	messages=messages,
	max_tokens=150,
	temperature=0.7,
	stream=True, # Enable streaming
	)

	response = ""
	for chunk in chat_completion:
	if chunk.choices[0].delta.content:
	response += chunk.choices[0].delta.content
	yield response # Yield partial response for streaming


	# Create streaming interface
	demo = gr.ChatInterface(
	fn=chat_with_streaming,
	type="messages",
	title="Streaming Chat with Inference Endpoints",
	)

	demo.launch()

	```

	</hfoption>
	<hfoption id="requests">

	### Requests Library Streaming

	For requests, you can use the streaming approach with the Messages API by setting `stream=True`:

	```python
	import os
	import gradio as gr
	import requests
	import json

	API_URL = "https://<id>.<region>.<cloud>.endpoints.huggingface.cloud/v1/chat/completions"

	headers = {
	"Accept": "application/json",
	"Content-Type": "application/json",
	"Authorization": f"Bearer {os.getenv('HF_TOKEN')}",
	}


	def chat_with_requests_streaming(message, history):
	# Convert Gradio history to messages format
	messages = [{"role": msg["role"], "content": msg["content"]} for msg in history]
	messages.append({"role": "user", "content": message})

	# Prepare payload using Messages API format
	payload = {
	"model": "smollm2-1-7b-instruct-ljn",
	"messages": messages,
	"max_tokens": 150,
	"temperature": 0.7,
	"stream": True, # Enable streaming
	}

	response = requests.post(API_URL, headers=headers, json=payload, stream=True)

	content = ""

	for line in response.iter_lines():
	line = line.decode("utf-8")

	if line.startswith("data: ") and not line.endswith("[DONE]"):
	data = json.loads(line[len("data: ") :])
	chunk = data["choices"][0]["delta"].get("content", "")
	content += chunk
	yield content


	# Create streaming interface
	demo = gr.ChatInterface(
	fn=chat_with_requests_streaming,
	type="messages",
	title="Streaming Chat with Inference Endpoints",
	)

	demo.launch()

	```

	</hfoption>

	</hfoptions>

	## Deploy your chat application

	Our app will run on port 7860 and look like this:

	![Gradio app](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/app.png)

	To deploy, we'll need to create a new Space and upload our files.

	1. Create a new Space: Go to [huggingface.co/new-space](https://huggingface.co/new-space)
	2. Choose Gradio SDK and make it public
	3. Upload your files: Upload `app.py`
	4. Add your token: In Space settings, add `HF_TOKEN` as a secret (get it from [your settings](https://huggingface.co/settings/tokens))
	5. Launch: Your app will be live at `https://huggingface.co/spaces/your-username/your-space-name`

	> Note: While we used CLI authentication locally, Spaces requires the token as a secret for the deployment environment.

	## Next steps

	That's it! You now have a chat application running on Hugging Face Spaces powered by Inference Endpoints.

	Why not level up and try out the [next guide](./transcription) to build a Text-to-Speech application?


	<EditOnGithub source="https://github.com/huggingface/hf-endpoints-documentation/blob/main/docs/source/tutorials/chat_bot.md" />

Xet Storage Details

Size:: 15 kB
Xet hash:: 6d0ab6e07ba6bc505f2c2d050dd099b45930371149cf702b02ad766bdf032d75

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.