Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Build and deploy your own chat application","local":"build-and-deploy-your-own-chat-application","sections":[{"title":"Create your Inference Endpoint","local":"create-your-inference-endpoint","sections":[],"depth":2},{"title":"Test your Inference Endpoint in the browser","local":"test-your-inference-endpoint-in-the-browser","sections":[],"depth":2},{"title":"Get your Inference Endpoint details","local":"get-your-inference-endpoint-details","sections":[],"depth":2},{"title":"Deploy in a few lines of code","local":"deploy-in-a-few-lines-of-code","sections":[],"depth":2},{"title":"Build your own custom chat application","local":"build-your-own-custom-chat-application","sections":[],"depth":2},{"title":"Adding Streaming Support","local":"adding-streaming-support","sections":[{"title":"Hugging Face InferenceClient Streaming","local":"hugging-face-inferenceclient-streaming","sections":[],"depth":3},{"title":"OpenAI Client Streaming","local":"openai-client-streaming","sections":[],"depth":3},{"title":"Requests Library Streaming","local":"requests-library-streaming","sections":[],"depth":3}],"depth":2},{"title":"Deploy your chat application","local":"deploy-your-chat-application","sections":[],"depth":2},{"title":"Next steps","local":"next-steps","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/inference-endpoints/pr_136/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/entry/start.fb9ab4d6.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/scheduler.f6b352c8.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/singletons.ceca4163.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/index.26cf6c5a.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/paths.142cd5df.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/entry/app.6247727a.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/index.b90df637.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/nodes/0.2fcde12d.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/nodes/22.2be68c1a.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/Tip.366d2e6e.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/CodeBlock.e5718f9d.js"> | |
| <link rel="modulepreload" href="/docs/inference-endpoints/pr_136/en/_app/immutable/chunks/getInferenceSnippets.1e3ae0bf.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Build and deploy your own chat application","local":"build-and-deploy-your-own-chat-application","sections":[{"title":"Create your Inference Endpoint","local":"create-your-inference-endpoint","sections":[],"depth":2},{"title":"Test your Inference Endpoint in the browser","local":"test-your-inference-endpoint-in-the-browser","sections":[],"depth":2},{"title":"Get your Inference Endpoint details","local":"get-your-inference-endpoint-details","sections":[],"depth":2},{"title":"Deploy in a few lines of code","local":"deploy-in-a-few-lines-of-code","sections":[],"depth":2},{"title":"Build your own custom chat application","local":"build-your-own-custom-chat-application","sections":[],"depth":2},{"title":"Adding Streaming Support","local":"adding-streaming-support","sections":[{"title":"Hugging Face InferenceClient Streaming","local":"hugging-face-inferenceclient-streaming","sections":[],"depth":3},{"title":"OpenAI Client Streaming","local":"openai-client-streaming","sections":[],"depth":3},{"title":"Requests Library Streaming","local":"requests-library-streaming","sections":[],"depth":3}],"depth":2},{"title":"Deploy your chat application","local":"deploy-your-chat-application","sections":[],"depth":2},{"title":"Next steps","local":"next-steps","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="build-and-deploy-your-own-chat-application" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#build-and-deploy-your-own-chat-application"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Build and deploy your own chat application</span></h1> <p data-svelte-h="svelte-1k65zp7">This tutorial will guide you from end to end on how to deploy your own chat application using Hugging Face Inference Endpoints. We will use Gradio to create a chat interface and an OpenAI client to connect to the Inference Endpoint.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1t7fc3k">This Tutorial uses Python, but your client can be any language that can make HTTP requests. The model and engine you deploy on Inference Endpoints uses the <strong>OpenAI Chat Completions format</strong>, so you can use any <a href="https://platform.openai.com/docs/libraries" rel="nofollow">OpenAI client</a> to connect to them, in languages like JavaScript, Java, and Go.</p></div> <h2 class="relative group"><a id="create-your-inference-endpoint" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#create-your-inference-endpoint"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Create your Inference Endpoint</span></h2> <p data-svelte-h="svelte-15thrvo">First, we need to create an Inference Endpoint for a model that can chat.</p> <p data-svelte-h="svelte-147h3qd">Start by navigating to the Inference Endpoints UI, and once you have logged in you should see a button for creating a new Inference | |
| Endpoint. Click the “New” button.</p> <p data-svelte-h="svelte-dnyg4"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/quick_start/1-new-button.png" alt="new-button"></p> <p data-svelte-h="svelte-11utja6">From there you’ll be directed to the catalog. The Model Catalog consists of popular models which have tuned configurations to work just as one-click | |
| deploys. You can filter by name, task, price of the hardware and much more.</p> <p data-svelte-h="svelte-fxecmn"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/quick_start/2-catalog.png" alt="catalog"></p> <p data-svelte-h="svelte-1hfg5l1">In this example let’s deploy the <a href="https://huggingface.co/Qwen/Qwen3-1.7B" rel="nofollow">Qwen/Qwen3-1.7B</a> model. You can find | |
| it by searching for <code>qwen3 1.7b</code> in the search field and deploy it by clicking the card.</p> <p data-svelte-h="svelte-lar0em"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/qwen-search.png" alt="qwen"></p> <p data-svelte-h="svelte-1ekjdqo">Next we’ll choose which hardware and deployment settings we’ll go for. Since this is a catalog model, all of the pre-selected options are very good | |
| defaults. So in this case we don’t need to change anything. In case you want a deeper dive on what the different settings mean you can check out | |
| the <a href="./guides/configuration">configuration guide</a>.</p> <p data-svelte-h="svelte-1baa6m9">For this model the Nvidia L4 is the recommended choice. It will be perfect for our testing. Performant but still reasonably priced. Also note that by | |
| default the endpoint will scale down to zero, meaning it will become idle after 1h of inactivity.</p> <p data-svelte-h="svelte-uhpblt">Now all you need to do is click click “Create Endpoint” 🚀</p> <p data-svelte-h="svelte-13bbzr1"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/config.png" alt="config"></p> <p data-svelte-h="svelte-1ujocpz">Now our Inference Endpoint is initializing, which usually takes about 3-5 minutes. If you want to can allow browser notifications which will give you a | |
| ping once the endpoint reaches a running state.</p> <p data-svelte-h="svelte-1vjlki5"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/init.png" alt="init"></p> <h2 class="relative group"><a id="test-your-inference-endpoint-in-the-browser" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#test-your-inference-endpoint-in-the-browser"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Test your Inference Endpoint in the browser</span></h2> <p data-svelte-h="svelte-19rlqt8">Now that we’ve created our Inference Endpoint, we can test it in the playground section.</p> <p data-svelte-h="svelte-1ka9eqz"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/playground.png" alt="playground"></p> <p data-svelte-h="svelte-1pdktab">You can use the model through a chat interface or copy code snippets to use it in your own application.</p> <h2 class="relative group"><a id="get-your-inference-endpoint-details" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#get-your-inference-endpoint-details"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Get your Inference Endpoint details</span></h2> <p data-svelte-h="svelte-1lccfab">We need to grab details of our Inference Endpoint, which we can find in the Endpoint’s <a href="https://endpoints.huggingface.co/" rel="nofollow">Overview</a>. We will need the following details:</p> <ul data-svelte-h="svelte-zmd562"><li>The base URL of the endpoint plus the version of the OpenAI API (e.g. <code>https://<id>.<region>.<cloud>.endpoints.huggingface.cloud/v1/</code>)</li> <li>The name of the endpoint to use (e.g. <code>qwen3-1-7b-xll</code>)</li> <li>The token to use for authentication (e.g. <code>hf_<token></code>)</li></ul> <p data-svelte-h="svelte-2v5kr8"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/endpoint-page.png" alt="endpoint-details"></p> <p data-svelte-h="svelte-y2qnlj">We can find the token in your <a href="https://huggingface.co/settings/tokens" rel="nofollow">account settings</a> which is accessible from the top dropdown and clicking on your account name.</p> <h2 class="relative group"><a id="deploy-in-a-few-lines-of-code" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#deploy-in-a-few-lines-of-code"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Deploy in a few lines of code</span></h2> <p data-svelte-h="svelte-bfky2a">The easiest way to deploy a chat application with <a href="https://gradio.app/" rel="nofollow">Gradio</a> is to use the convenient <code>load_chat</code> method. This abstracts everything away and you can have a working chat application quickly.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> os | |
| <span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr | |
| gr.load_chat( | |
| base_url=<span class="hljs-string">"<endpoint-url>/v1/"</span>, <span class="hljs-comment"># Replace with your endpoint URL + version</span> | |
| model=<span class="hljs-string">"endpoint-name"</span>, <span class="hljs-comment"># Replace with your endpoint name</span> | |
| token=os.getenv(<span class="hljs-string">"HF_TOKEN"</span>), <span class="hljs-comment"># Replace with your token</span> | |
| ).launch()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1gpt8m1">The <code>load_chat</code> method won’t cater for your production needs, but it’s a great way to get started and test your application.</p> <h2 class="relative group"><a id="build-your-own-custom-chat-application" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#build-your-own-custom-chat-application"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Build your own custom chat application</span></h2> <p data-svelte-h="svelte-1krh3b">If you want more control over your chat application, you can build your own custom chat interface with Gradio. This gives you more flexibility to customize the behavior, add features, and handle errors.</p> <p data-svelte-h="svelte-4kkdzr">Choose your preferred method for connecting to Inference Endpoints:</p> <div class="flex space-x-2 items-center my-1.5 mr-8 h-7 !pl-0 -mx-3 md:mx-0"><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd border-gray-800 bg-black dark:bg-gray-700 text-white">hf-client </div><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd text-gray-500 cursor-pointer opacity-90 hover:text-gray-700 dark:hover:text-gray-200 hover:shadow-sm">openai-client </div><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd text-gray-500 cursor-pointer opacity-90 hover:text-gray-700 dark:hover:text-gray-200 hover:shadow-sm">requests </div></div> <div class="language-select"><p data-svelte-h="svelte-15fpcnt"><strong>Using Hugging Face InferenceClient</strong></p> <p data-svelte-h="svelte-1f3oki6">First, install the required dependencies:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install gradio huggingface-hub<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-u7e6qd">The Hugging Face InferenceClient provides a clean interface that’s compatible with the OpenAI API format:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> os | |
| <span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr | |
| <span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> InferenceClient | |
| <span class="hljs-comment"># Initialize the Hugging Face InferenceClient</span> | |
| client = InferenceClient( | |
| base_url=<span class="hljs-string">"<endpoint-url>/v1/"</span>, <span class="hljs-comment"># Replace with your endpoint URL</span> | |
| token=os.getenv(<span class="hljs-string">"HF_TOKEN"</span>) <span class="hljs-comment"># Use environment variable for security</span> | |
| ) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">chat_with_hf_client</span>(<span class="hljs-params">message, history</span>): | |
| <span class="hljs-comment"># Convert Gradio history to messages format</span> | |
| messages = [{<span class="hljs-string">"role"</span>: msg[<span class="hljs-string">"role"</span>], <span class="hljs-string">"content"</span>: msg[<span class="hljs-string">"content"</span>]} <span class="hljs-keyword">for</span> msg <span class="hljs-keyword">in</span> history] | |
| <span class="hljs-comment"># Add the current message</span> | |
| messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: message}) | |
| <span class="hljs-comment"># Create chat completion</span> | |
| chat_completion = client.chat.completions.create( | |
| model=<span class="hljs-string">"endpoint-name"</span>, <span class="hljs-comment"># Use the name of your endpoint (i.e. qwen3-1.7b-instruct-xxxx)</span> | |
| messages=messages, | |
| max_tokens=<span class="hljs-number">150</span>, | |
| temperature=<span class="hljs-number">0.7</span>, | |
| ) | |
| <span class="hljs-comment"># Return the response</span> | |
| <span class="hljs-keyword">return</span> chat_completion.choices[<span class="hljs-number">0</span>].message.content | |
| <span class="hljs-comment"># Create the Gradio interface</span> | |
| demo = gr.ChatInterface( | |
| fn=chat_with_hf_client, | |
| <span class="hljs-built_in">type</span>=<span class="hljs-string">"messages"</span>, | |
| title=<span class="hljs-string">"Custom Chat with Inference Endpoints"</span>, | |
| examples=[<span class="hljs-string">"What is deep learning?"</span>, <span class="hljs-string">"Explain neural networks"</span>, <span class="hljs-string">"How does AI work?"</span>] | |
| ) | |
| <span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>: | |
| demo.launch()<!-- HTML_TAG_END --></pre></div> </div> <h2 class="relative group"><a id="adding-streaming-support" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#adding-streaming-support"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Adding Streaming Support</span></h2> <p data-svelte-h="svelte-7z4ymg">For a better user experience, you can implement streaming responses. This will require us to handle the messages and <code>yield</code> them to the client.</p> <p data-svelte-h="svelte-ttzscb">Here’s how to add streaming to each client:</p> <div class="flex space-x-2 items-center my-1.5 mr-8 h-7 !pl-0 -mx-3 md:mx-0"><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd border-gray-800 bg-black dark:bg-gray-700 text-white">hf-client </div><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd text-gray-500 cursor-pointer opacity-90 hover:text-gray-700 dark:hover:text-gray-200 hover:shadow-sm">openai-client </div><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd text-gray-500 cursor-pointer opacity-90 hover:text-gray-700 dark:hover:text-gray-200 hover:shadow-sm">requests </div></div> <div class="language-select"> <h3 class="relative group"><a id="hugging-face-inferenceclient-streaming" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#hugging-face-inferenceclient-streaming"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Hugging Face InferenceClient Streaming</span></h3> <p data-svelte-h="svelte-tuciwp">The Hugging Face InferenceClient supports streaming similar to the OpenAI client:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> os | |
| <span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr | |
| <span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> InferenceClient | |
| client = InferenceClient( | |
| base_url=<span class="hljs-string">"<endpoint-url>/v1/"</span>, | |
| token=os.getenv(<span class="hljs-string">"HF_TOKEN"</span>) | |
| ) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">chat_with_hf_streaming</span>(<span class="hljs-params">message, history</span>): | |
| <span class="hljs-comment"># Convert history to messages format</span> | |
| messages = [{<span class="hljs-string">"role"</span>: msg[<span class="hljs-string">"role"</span>], <span class="hljs-string">"content"</span>: msg[<span class="hljs-string">"content"</span>]} <span class="hljs-keyword">for</span> msg <span class="hljs-keyword">in</span> history] | |
| messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: message}) | |
| <span class="hljs-comment"># Create streaming chat completion</span> | |
| chat_completion = client.chat.completions.create( | |
| model=<span class="hljs-string">"endpoint-name"</span>, | |
| messages=messages, | |
| max_tokens=<span class="hljs-number">150</span>, | |
| temperature=<span class="hljs-number">0.7</span>, | |
| stream=<span class="hljs-literal">True</span> <span class="hljs-comment"># Enable streaming</span> | |
| ) | |
| response = <span class="hljs-string">""</span> | |
| <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chat_completion: | |
| <span class="hljs-keyword">if</span> chunk.choices[<span class="hljs-number">0</span>].delta.content: | |
| response += chunk.choices[<span class="hljs-number">0</span>].delta.content | |
| <span class="hljs-keyword">yield</span> response <span class="hljs-comment"># Yield partial response for streaming</span> | |
| <span class="hljs-comment"># Create streaming interface</span> | |
| demo = gr.ChatInterface( | |
| fn=chat_with_hf_streaming, | |
| <span class="hljs-built_in">type</span>=<span class="hljs-string">"messages"</span>, | |
| title=<span class="hljs-string">"Streaming Chat with Inference Endpoints"</span> | |
| ) | |
| demo.launch()<!-- HTML_TAG_END --></pre></div> </div> <h2 class="relative group"><a id="deploy-your-chat-application" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#deploy-your-chat-application"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Deploy your chat application</span></h2> <p data-svelte-h="svelte-1n5m4re">Our app will run on port 7860 and look like this:</p> <p data-svelte-h="svelte-rhi1bh"><img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tutorials/chatbot/app.png" alt="Gradio app"></p> <p data-svelte-h="svelte-13uvzr6">To deploy, we’ll need to create a new Space and upload our files.</p> <ol data-svelte-h="svelte-1yvydwn"><li><strong>Create a new Space</strong>: Go to <a href="https://huggingface.co/new-space" rel="nofollow">huggingface.co/new-space</a></li> <li><strong>Choose Gradio SDK</strong> and make it public</li> <li><strong>Upload your files</strong>: Upload <code>app.py</code></li> <li><strong>Add your token</strong>: In Space settings, add <code>HF_TOKEN</code> as a secret (get it from <a href="https://huggingface.co/settings/tokens" rel="nofollow">your settings</a>)</li> <li><strong>Launch</strong>: Your app will be live at <code>https://huggingface.co/spaces/your-username/your-space-name</code></li></ol> <blockquote data-svelte-h="svelte-1gqrdse"><p><strong>Note</strong>: While we used CLI authentication locally, Spaces requires the token as a secret for the deployment environment.</p></blockquote> <h2 class="relative group"><a id="next-steps" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#next-steps"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Next steps</span></h2> <p data-svelte-h="svelte-xl8t4e">That’s it! You now have a chat application running on Hugging Face Spaces powered by Inference Endpoints.</p> <p data-svelte-h="svelte-19b7xtr">Why not level up and try out the <a href="./tutorials/transcription">next guide</a> to build a Text-to-Speech application?</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/hf-endpoints-documentation/blob/main/docs/source/tutorials/chat_bot.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1q0n26o = { | |
| assets: "/docs/inference-endpoints/pr_136/en", | |
| base: "/docs/inference-endpoints/pr_136/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/inference-endpoints/pr_136/en/_app/immutable/entry/start.fb9ab4d6.js"), | |
| import("/docs/inference-endpoints/pr_136/en/_app/immutable/entry/app.6247727a.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 22], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 38.3 kB
- Xet hash:
- 09b2852e2503450144f32cacc0dfd4826f09397161ecc8f6e96634e6f1e90df6
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.