Buckets:

hf-doc-build/doc-dev / cookbook /main /en /rag_with_unstructured_data.html
rtrm's picture
download
raw
49 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Building RAG with Custom Unstructured Data&quot;,&quot;local&quot;:&quot;building-rag-with-custom-unstructured-data&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Unstructured data preprocessing&quot;,&quot;local&quot;:&quot;unstructured-data-preprocessing&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Chunking&quot;,&quot;local&quot;:&quot;chunking&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Setting up the retriever&quot;,&quot;local&quot;:&quot;setting-up-the-retriever&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;RAG with LangChain&quot;,&quot;local&quot;:&quot;rag-with-langchain&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Results and next steps&quot;,&quot;local&quot;:&quot;results-and-next-steps&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/cookbook/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/entry/start.96b44205.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/scheduler.65852ee5.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/singletons.a64a46c3.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/paths.f88132ad.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/entry/app.e92a3d99.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/index.aa74147d.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/nodes/0.0809e592.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/nodes/40.915097f6.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/DocNotebookDropdown.479f4286.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/EditOnGithub.4eda6a96.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Building RAG with Custom Unstructured Data&quot;,&quot;local&quot;:&quot;building-rag-with-custom-unstructured-data&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Unstructured data preprocessing&quot;,&quot;local&quot;:&quot;unstructured-data-preprocessing&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Chunking&quot;,&quot;local&quot;:&quot;chunking&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Setting up the retriever&quot;,&quot;local&quot;:&quot;setting-up-the-retriever&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;RAG with LangChain&quot;,&quot;local&quot;:&quot;rag-with-langchain&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Results and next steps&quot;,&quot;local&quot;:&quot;results-and-next-steps&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="flex space-x-1 absolute z-10 right-0 top-0"> <a href="https://colab.research.google.com/github/huggingface/cookbook/blob/multiagent_assist_improvements/notebooks/en/rag_with_unstructured_data.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </div> <h1 class="relative group"><a id="building-rag-with-custom-unstructured-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#building-rag-with-custom-unstructured-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Building RAG with Custom Unstructured Data</span></h1> <p data-svelte-h="svelte-26mfp8"><em>Authored by: <a href="https://github.com/MKhalusova" rel="nofollow">Maria Khalusova</a></em></p> <p data-svelte-h="svelte-1634vfe">If you’re new to RAG, please explore the basics of RAG first in <a href="https://huggingface.co/learn/cookbook/rag_zephyr_langchain" rel="nofollow">this other notebook</a>, and then come back here to learn about building RAG with custom data.</p> <p data-svelte-h="svelte-1qci2tz">Whether you’re building your own RAG-based personal assistant, a pet project, or an enterprise RAG system, you will quickly discover that a lot of important knowledge is stored in various formats like PDFs, emails, Markdown files, PowerPoint presentations, HTML pages, Word documents, and so on.</p> <p data-svelte-h="svelte-3xwkhu">How do you preprocess all of this data in a way that you can use it for RAG?
In this quick tutorial, you’ll learn how to build a RAG system that will incorporate data from multiple data types. You’ll use <a href="https://github.com/Unstructured-IO/unstructured" rel="nofollow">Unstructured</a> for data preprocessing, open-source models from Hugging Face Hub for embeddings and text generation, ChromaDB as a vector store, and LangChain for bringing everything together.</p> <p data-svelte-h="svelte-1w15t57">Let’s go! We’ll begin by installing the required dependencies:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[<span class="hljs-built_in">all</span>-docs] langchain chromadb langchain_community<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-hffqba">Next, let’s get a mix of documents. Suppose, I want to build a RAG system that’ll help me manage pests in my garden. For this purpose, I’ll use diverse documents that cover the topic of IPM (integrated pest management):</p> <ul data-svelte-h="svelte-1g2z5vq"><li>PDF: <code>https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf</code></li> <li>Powerpoint: <code>https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx</code></li> <li>EPUB: <code>https://www.gutenberg.org/ebooks/45957</code></li> <li>HTML: <code>https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html</code></li></ul> <p data-svelte-h="svelte-1d3ly9c">Feel free to use your own documents for your topic of choice from the list of document types supported by Unstructured: <code>.eml</code>, <code>.html</code>, <code>.md</code>, <code>.msg</code>, <code>.rst</code>, <code>.rtf</code>, <code>.txt</code>, <code>.xml</code>, <code>.png</code>, <code>.jpg</code>, <code>.jpeg</code>, <code>.tiff</code>, <code>.bmp</code>, <code>.heic</code>, <code>.csv</code>, <code>.doc</code>, <code>.docx</code>, <code>.epub</code>, <code>.odt</code>, <code>.pdf</code>, <code>.ppt</code>, <code>.pptx</code>, <code>.tsv</code>, <code>.xlsx</code>.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!mkdir -p <span class="hljs-string">&quot;./documents&quot;</span>
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O <span class="hljs-string">&quot;./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf&quot;</span>
!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O <span class="hljs-string">&quot;./documents/Citrus_IPM_090913.pptx&quot;</span>
!wget https://www.gutenberg.org/ebooks/<span class="hljs-number">45957.</span>epub3.images -O <span class="hljs-string">&quot;./documents/45957.epub&quot;</span>
!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-<span class="hljs-keyword">and</span>-plant-insects-<span class="hljs-keyword">and</span>-pests.html -O <span class="hljs-string">&quot;./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html&quot;</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="unstructured-data-preprocessing" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#unstructured-data-preprocessing"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Unstructured data preprocessing</span></h2> <p data-svelte-h="svelte-3uyhm5">You can use the Unstructured library to preprocess documents one by one, and write your own script to walk through a directory, but it’s easier to use a Local source connector to ingest all documents in a given directory. Unstructured can ingest documents from local directories, S3 buckets, blob storage, SFTP, and many other places your documents might be stored in. The ingestion from those sources will be very similar differing mostly in authentication options.
Here you’ll use Local source connector, but feel free to explore other options in the <a href="https://docs.unstructured.io/open-source/ingest/source-connectors/overview" rel="nofollow">Unstructured documentation</a>.</p> <p data-svelte-h="svelte-14xbuq6">Optionally, you can also choose a <a href="https://docs.unstructured.io/open-source/ingest/destination-connectors/overview" rel="nofollow">destination</a> for the processed documents - this could be MongoDB, Pinecone, Weaviate, etc. In this notebook, we’ll keep everything local.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># Optional cell to reduce the amount of logs</span>
<span class="hljs-keyword">import</span> logging
logger = logging.getLogger(<span class="hljs-string">&quot;unstructured.ingest&quot;</span>)
logger.root.removeHandler(logger.root.handlers[<span class="hljs-number">0</span>])<!-- HTML_TAG_END --></pre></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">import</span> os
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> unstructured.ingest.connector.local <span class="hljs-keyword">import</span> SimpleLocalConfig
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> unstructured.ingest.interfaces <span class="hljs-keyword">import</span> PartitionConfig, ProcessorConfig, ReadConfig
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> unstructured.ingest.runner <span class="hljs-keyword">import</span> LocalRunner
<span class="hljs-meta">&gt;&gt;&gt; </span>output_path = <span class="hljs-string">&quot;./local-ingest-output&quot;</span>
<span class="hljs-meta">&gt;&gt;&gt; </span>runner = LocalRunner(
<span class="hljs-meta">... </span> processor_config=ProcessorConfig(
<span class="hljs-meta">... </span> <span class="hljs-comment"># logs verbosity</span>
<span class="hljs-meta">... </span> verbose=<span class="hljs-literal">True</span>,
<span class="hljs-meta">... </span> <span class="hljs-comment"># the local directory to store outputs</span>
<span class="hljs-meta">... </span> output_dir=output_path,
<span class="hljs-meta">... </span> num_processes=<span class="hljs-number">2</span>,
<span class="hljs-meta">... </span> ),
<span class="hljs-meta">... </span> read_config=ReadConfig(),
<span class="hljs-meta">... </span> partition_config=PartitionConfig(
<span class="hljs-meta">... </span> partition_by_api=<span class="hljs-literal">True</span>,
<span class="hljs-meta">... </span> api_key=<span class="hljs-string">&quot;YOUR_UNSTRUCTURED_API_KEY&quot;</span>,
<span class="hljs-meta">... </span> ),
<span class="hljs-meta">... </span> connector_config=SimpleLocalConfig(
<span class="hljs-meta">... </span> input_path=<span class="hljs-string">&quot;./documents&quot;</span>,
<span class="hljs-meta">... </span> <span class="hljs-comment"># whether to get the documents recursively from given directory</span>
<span class="hljs-meta">... </span> recursive=<span class="hljs-literal">False</span>,
<span class="hljs-meta">... </span> ),
<span class="hljs-meta">... </span>)
<span class="hljs-meta">&gt;&gt;&gt; </span>runner.run()<!-- HTML_TAG_END --></pre></div> <pre data-svelte-h="svelte-v64b1c">INFO: NumExpr defaulting to 2 threads.
</pre> <p data-svelte-h="svelte-1bdx5k2">Let’s take a closer look at the configs that we have here.</p> <p data-svelte-h="svelte-9g2j19"><code>ProcessorConfig</code> controls various aspects of the processing pipeline, including output locations, number of workers, error handling behavior, logging verbosity and more. The only mandatory parameter here is the <code>output_dir</code> - the local directory where you want to store the outputs.</p> <p data-svelte-h="svelte-1j049ps"><code>ReadConfig</code> can be used to customize the data reading process for different scenarios, such as re-downloading data, preserving downloaded files, or limiting the number of documents processed. In most cases the default <code>ReadConfig</code> will work.</p> <p data-svelte-h="svelte-45jyya">In the <code>PartitionConfig</code> you can choose whether to partition the documents locally or via API. This example uses API, and for this reason requires Unstructured API key. You can get yours <a href="https://unstructured.io/api-key-free" rel="nofollow">here</a>. The free Unstructured API is capped at 1000 pages, and offers better OCR models for image-based documents than a local installation of Unstructured.
If you remove these two parameters, the documents will be processed locally, but you may need to install additional dependencies if the documents require OCR and/or document understanding models. Namely, you may need to install poppler and tesseract in this case, which you can get with brew:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!<span class="hljs-keyword">brew </span><span class="hljs-keyword">install </span>poppler
!<span class="hljs-keyword">brew </span><span class="hljs-keyword">install </span>tesseract<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ewm525">If you’re on Windows, you can find alternative installation instructions in the <a href="https://docs.unstructured.io/open-source/installation/full-installation" rel="nofollow">Unstructured docs</a>.</p> <p data-svelte-h="svelte-1wt80e0">Finally, in the <code>SimpleLocalConfig</code> you need to specify where your original documents reside, and whether you want to walk through the directory recursively.</p> <p data-svelte-h="svelte-9x3e13">Once the documents are processed you’ll find 4 json files in the <code>local-ingest-output</code> directory, one per document that was processed.
Unstructured partitions all types of documents in a uniform manner, and returns json with document elements.</p> <p data-svelte-h="svelte-n92gq9"><a href="https://docs.unstructured.io/api-reference/api-services/document-elements" rel="nofollow">Document elements</a> have a type, e.g. <code>NarrativeText</code>, <code>Title</code>, or <code>Table</code>, they contain the extracted text, and metadata that Unstructured was able to obtain. Some metadata is common for all elements, such as filename of the document the element is from. Other metadata depends on file type or element type. For example, a <code>Table</code> element will contain table’s representation as html in the metadata, and metadata for emails will contain information about senders and recipients.</p> <p data-svelte-h="svelte-2t53hk">Let’s import element objects from these json files.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> unstructured.staging.base <span class="hljs-keyword">import</span> elements_from_json
elements = []
<span class="hljs-keyword">for</span> filename <span class="hljs-keyword">in</span> os.listdir(output_path):
filepath = os.path.join(output_path, filename)
elements.extend(elements_from_json(filepath))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-x3hyox">Now that that you have extracted the elements from the documents, you can chunk them to fit the context window of the embeddings model.</p> <h2 class="relative group"><a id="chunking" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#chunking"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Chunking</span></h2> <p data-svelte-h="svelte-1mlxzmt">If you are familiar with chunking methods that split long text documents into smaller chunks, you’ll notice that Unstructured’s chunking methods slightly differ, since the partitioning step already divides an entire document into its structural elements: titles, list items, tables, text, etc. By partitioning documents this way, you can avoid a situation where unrelated pieces of text end up in the same element, and then same chunk.</p> <p data-svelte-h="svelte-1hwgds0">Now, when you chunk the document elements with Unstructured, individual elements are already small so they will only be split if they exceed the desired maximum chunk size. Otherwise, they will remain as is. You can also optionally choose to combine consecutive text elements such as list items, for instance, that will together fit within chunk size limit.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> unstructured.chunking.title <span class="hljs-keyword">import</span> chunk_by_title
chunked_elements = chunk_by_title(
elements,
<span class="hljs-comment"># maximum for chunk size</span>
max_characters=<span class="hljs-number">512</span>,
<span class="hljs-comment"># You can choose to combine consecutive elements that are too small</span>
<span class="hljs-comment"># e.g. individual list items</span>
combine_text_under_n_chars=<span class="hljs-number">200</span>,
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1j40r14">The chunks are ready for RAG. To use them with LangChain, you can easily convert Unstructured elements to LangChain documents.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> langchain_core.documents <span class="hljs-keyword">import</span> Document
documents = []
<span class="hljs-keyword">for</span> chunked_element <span class="hljs-keyword">in</span> chunked_elements:
metadata = chunked_element.metadata.to_dict()
metadata[<span class="hljs-string">&quot;source&quot;</span>] = metadata[<span class="hljs-string">&quot;filename&quot;</span>]
<span class="hljs-keyword">del</span> metadata[<span class="hljs-string">&quot;languages&quot;</span>]
documents.append(Document(page_content=chunked_element.text, metadata=metadata))<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="setting-up-the-retriever" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#setting-up-the-retriever"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Setting up the retriever</span></h2> <p data-svelte-h="svelte-1r2fx9j">This example uses ChromaDB as a vector store and <a href="https://huggingface.co/BAAI/bge-base-en-v1.5" rel="nofollow"><code>BAAI/bge-base-en-v1.5</code></a> embeddings model, feel free to use any other vector store.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> langchain_community.vectorstores <span class="hljs-keyword">import</span> Chroma
<span class="hljs-keyword">from</span> langchain.embeddings <span class="hljs-keyword">import</span> HuggingFaceEmbeddings
<span class="hljs-keyword">from</span> langchain.vectorstores <span class="hljs-keyword">import</span> utils <span class="hljs-keyword">as</span> chromautils
<span class="hljs-comment"># ChromaDB doesn&#x27;t support complex metadata, e.g. lists, so we drop it here.</span>
<span class="hljs-comment"># If you&#x27;re using a different vector store, you may not need to do this</span>
docs = chromautils.filter_complex_metadata(documents)
embeddings = HuggingFaceEmbeddings(model_name=<span class="hljs-string">&quot;BAAI/bge-base-en-v1.5&quot;</span>)
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type=<span class="hljs-string">&quot;similarity&quot;</span>, search_kwargs={<span class="hljs-string">&quot;k&quot;</span>: <span class="hljs-number">3</span>})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1cyhdx7">If you plan to use a gated model from the Hugging Face Hub, be it an embeddings or text generation model, you’ll need to authenticate yourself with your Hugging Face token, which you can get in your Hugging Face profile’s settings.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> notebook_login
notebook_login()<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="rag-with-langchain" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#rag-with-langchain"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>RAG with LangChain</span></h2> <p data-svelte-h="svelte-3710om">Let’s bring everything together and build RAG with LangChain.
In this example we’ll be using <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct" rel="nofollow"><code>Llama-3-8B-Instruct</code></a> from Meta. To make sure it can run smoothly in the free T4 runtime from Google Colab, you’ll need to quantize it.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> langchain.prompts <span class="hljs-keyword">import</span> PromptTemplate
<span class="hljs-keyword">from</span> langchain.llms <span class="hljs-keyword">import</span> HuggingFacePipeline
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
<span class="hljs-keyword">from</span> langchain.chains <span class="hljs-keyword">import</span> RetrievalQA<!-- HTML_TAG_END --></pre></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->model_name = <span class="hljs-string">&quot;meta-llama/Meta-Llama-3-8B-Instruct&quot;</span>
bnb_config = BitsAndBytesConfig(
load_in_4bit=<span class="hljs-literal">True</span>, bnb_4bit_use_double_quant=<span class="hljs-literal">True</span>, bnb_4bit_quant_type=<span class="hljs-string">&quot;nf4&quot;</span>, bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(<span class="hljs-string">&quot;&lt;|eot_id|&gt;&quot;</span>)]
text_generation_pipeline = pipeline(
model=model,
tokenizer=tokenizer,
task=<span class="hljs-string">&quot;text-generation&quot;</span>,
temperature=<span class="hljs-number">0.2</span>,
do_sample=<span class="hljs-literal">True</span>,
repetition_penalty=<span class="hljs-number">1.1</span>,
return_full_text=<span class="hljs-literal">False</span>,
max_new_tokens=<span class="hljs-number">200</span>,
eos_token_id=terminators,
)
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
prompt_template = <span class="hljs-string">&quot;&quot;&quot;
&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don&#x27;t know the answer, just say &quot;I do not know.&quot; Don&#x27;t make up an answer.
Question: {question}
Context: {context}&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;
&quot;&quot;&quot;</span>
prompt = PromptTemplate(
input_variables=[<span class="hljs-string">&quot;context&quot;</span>, <span class="hljs-string">&quot;question&quot;</span>],
template=prompt_template,
)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, chain_type_kwargs={<span class="hljs-string">&quot;prompt&quot;</span>: prompt})<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="results-and-next-steps" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#results-and-next-steps"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Results and next steps</span></h2> <p data-svelte-h="svelte-u2l4ip">Now that you have your RAG chain, let’s ask it about aphids. Are they a pest in my garden?</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->question = <span class="hljs-string">&quot;Are aphids a pest?&quot;</span>
qa_chain.invoke(question)[<span class="hljs-string">&quot;result&quot;</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1wuxk0l">Output:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they<span class="hljs-string">&#x27;re known to multiply quickly, which is why it&#x27;</span>s essential to control them promptly. As mentioned <span class="hljs-keyword">in</span> the text, aphids can also attract ants, <span class="hljs-built_in">which</span> are attracted to the sweet, sticky substance they produce called honeydew. So, <span class="hljs-built_in">yes</span>, aphids are indeed a pest that requires attention to prevent further harm to your plants!<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1rxjghx">This looks like a promising start! Now that you know the basics of preprocessing complex unstructured data for RAG, you can continue improving upon this example. Here are some ideas:</p> <ul data-svelte-h="svelte-112y9e5"><li>You can connect to a different source to ingest the documents from, for example, an S3 bucket.</li> <li>You can add <code>return_source_documents=True</code> in the <code>qa_chain</code> arguments to make the chain return the documents that were passed to the prompt as context. This can be useful to understand what sources were used to generate the answer.</li> <li>If you want to leverage the elements metadata at the retrieval stage, consider using Hugging Face agents and creating a custom retriever tool as described in <a href="https://huggingface.co/learn/cookbook/agents#2--rag-with-iterative-query-refinement--source-selection" rel="nofollow">this other notebook</a>.</li> <li>There are many things you could do to improve search results. For instance, you could use Hybrid search instead of a single similarity-search retriever. Hybrid search combines multiple search algorithms to improve the accuracy and relevance of search results. Typically it’s a combination of keyword-based search algorithms with vector search methods.</li></ul> <p data-svelte-h="svelte-1g8numv">Have fun building RAG applications with Unstructured data!</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/cookbook/blob/main/notebooks/en/rag_with_unstructured_data.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1l2350x = {
assets: "/docs/cookbook/main/en",
base: "/docs/cookbook/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/cookbook/main/en/_app/immutable/entry/start.96b44205.js"),
import("/docs/cookbook/main/en/_app/immutable/entry/app.e92a3d99.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 40],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
49 kB
·
Xet hash:
b7870611339cbc85cac4524d473989bfae512399a5efdf52fb2135b9bfcccaf2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.