Buckets:

hf-doc-build/doc-dev / cookbook /main /en /code_search.html
rtrm's picture
download
raw
63.1 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Code Search with Vector Embeddings and Qdrant&quot;,&quot;local&quot;:&quot;code-search-with-vector-embeddings-and-qdrant&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;The approach&quot;,&quot;local&quot;:&quot;the-approach&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}">
<link href="/docs/cookbook/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/entry/start.96b44205.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/scheduler.65852ee5.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/singletons.a64a46c3.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/paths.f88132ad.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/entry/app.e92a3d99.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/index.aa74147d.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/nodes/0.0809e592.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/nodes/13.1def8a77.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/Tip.bb8ccac8.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/DocNotebookDropdown.479f4286.js">
<link rel="modulepreload" href="/docs/cookbook/main/en/_app/immutable/chunks/EditOnGithub.4eda6a96.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Code Search with Vector Embeddings and Qdrant&quot;,&quot;local&quot;:&quot;code-search-with-vector-embeddings-and-qdrant&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;The approach&quot;,&quot;local&quot;:&quot;the-approach&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="flex space-x-1 absolute z-10 right-0 top-0"> <a href="https://colab.research.google.com/github/huggingface/cookbook/blob/multiagent_assist_improvements/notebooks/en/code_search.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </div> <h2 class="relative group"><a id="code-search-with-vector-embeddings-and-qdrant" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#code-search-with-vector-embeddings-and-qdrant"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Code Search with Vector Embeddings and Qdrant</span></h2> <p data-svelte-h="svelte-17bl742"><em>Authored by: <a href="https://qdrant.tech/" rel="nofollow">Qdrant Team</a></em></p> <p data-svelte-h="svelte-ae7bkg">In this notebook, we demonstrate how you can use vector embeddings to navigate a codebase, and find relevant code snippets. We’ll search codebases using natural semantic queries, and search for code based on a similar logic.</p> <p data-svelte-h="svelte-kszg01">You can check out the <a href="https://code-search.qdrant.tech/" rel="nofollow">live deployment</a> of this approach which exposes the Qdrant codebase for search with a web interface.</p> <h3 class="relative group"><a id="the-approach" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-approach"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The approach</span></h3> <p data-svelte-h="svelte-1dvof7">We need two models to accomplish our goal.</p> <ul data-svelte-h="svelte-19x5x7t"><li><p>General usage neural encoder for Natural Language Processing (NLP), in our case <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2" rel="nofollow">sentence-transformers/all-MiniLM-L6-v2</a>. We’ll call this NLP model.</p></li> <li><p>Specialized embeddings for code-to-code similarity search. We’ll use the <a href="https://huggingface.co/jinaai/jina-embeddings-v2-base-code" rel="nofollow">jinaai/jina-embeddings-v2-base-code</a> model for the task. It supports English and 30 widely used programming languages with a 8192 sequence length. Let’s call this code model.</p></li></ul> <p data-svelte-h="svelte-d94nr5">To prepare our code for the NLP model, we need to preprocess the code to a format that closely resembles natural language. The code model supports a variety of standard programming languages, so there is no need to preprocess the snippets. We can use the code as is.</p> <h2 class="relative group"><a id="installing-dependencies" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#installing-dependencies"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Installing Dependencies</span></h2> <p data-svelte-h="svelte-ut59o">Let’s install the packages we’ll work with.</p> <ul data-svelte-h="svelte-n6syf2"><li><a href="https://pypi.org/project/inflection/" rel="nofollow">inflection</a> - A string transformation library. It singularizes and pluralizes English words, and transforms CamelCase to underscored string.</li> <li><a href="https://pypi.org/project/fastembed/" rel="nofollow">fastembed</a> - A CPU-first, lightweight library for generating vector embeddings. <a href="https://github.com/qdrant/fastembed#%EF%B8%8F-fastembed-on-a-gpu" rel="nofollow">GPU support is available</a>.</li> <li><a href="https://pypi.org/project/qdrant-client/" rel="nofollow">qdrant-client</a> - Official Python library to interface with the Qdrant server.</li></ul> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->%pip install inflection qdrant-client fastembed<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="data-preparation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#data-preparation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Data preparation</span></h3> <p data-svelte-h="svelte-1g3hp9b">Chunking the application sources into smaller parts is a non-trivial task. In general, functions, class methods, structs, enums, and all the other language-specific constructs are good candidates for chunks. They are big enough to contain some meaningful information, but small enough to be processed by embedding models with a limited context window. You can also use docstrings, comments, and other metadata can be used to enrich the chunks with additional information.</p> <div style="text-align:center" data-svelte-h="svelte-aiydin"><img src="https://huggingface.co/datasets/Anush008/cookbook-images/resolve/main/data-chunking.png"></div> <p data-svelte-h="svelte-k9nbvr">Text-based search is based on function signatures, but code search may return smaller pieces, such as loops. So, if we receive a particular function signature from the NLP model and part of its implementation from the code model, we merge the results.</p> <h3 class="relative group"><a id="parsing-the-codebaase" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#parsing-the-codebaase"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Parsing the Codebaase</span></h3> <p data-svelte-h="svelte-i8w8vo">We’ll use the <a href="https://github.com/qdrant/qdrant" rel="nofollow">Qdrant codebase</a> for this demo.
While this codebase uses Rust, you can use this approach with any other language. You can use an <a href="https://microsoft.github.io/language-server-protocol/" rel="nofollow">Language Server Protocol (LSP)</a> tool to build a graph of the codebase, and then extract chunks. We did our work with the <a href="https://rust-analyzer.github.io/" rel="nofollow">rust-analyzer</a>. We exported the parsed codebase into the <a href="https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/" rel="nofollow">LSIF</a> format, a standard for code intelligence data. Next, we used the LSIF data to navigate the codebase and extract the chunks.</p> <p data-svelte-h="svelte-1mfzfdl">You can use the same approach for other languages. There are <a href="https://microsoft.github.io/language-server-protocol/implementors/servers/" rel="nofollow">plenty of implementations</a> available.</p> <p data-svelte-h="svelte-15jgfrp">We will then export the chunks into JSON documents with not only the code itself, but also context with the location of the code in the project.</p> <p data-svelte-h="svelte-14cl4i2">You can examine the Qdrant structures, parsed in JSON, in the <a href="https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl" rel="nofollow">structures.jsonl file</a> in our Google Cloud Storage bucket. Download it and use it as a source of data for our code search.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-6aidpg">Next, load the file and parse the lines into a list of dictionaries:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> json
structures = []
<span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">&quot;structures.jsonl&quot;</span>, <span class="hljs-string">&quot;r&quot;</span>) <span class="hljs-keyword">as</span> fp:
<span class="hljs-keyword">for</span> i, row <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(fp):
entry = json.loads(row)
structures.append(entry)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-y6ssjb">Let’s see how one entry looks like.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->structures[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="code-to-natural-language-conversion" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#code-to-natural-language-conversion"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Code to natural language conversion</span></h3> <p data-svelte-h="svelte-1oie9o1">Each programming language has its own syntax which is not a part of the natural language. Thus, a general-purpose model probably does not understand the code as is. We can, however, normalize the data by removing code specifics and including additional context, such as module, class, function, and file name. We take the following steps:</p> <ol data-svelte-h="svelte-xx6zbg"><li>Extract the signature of the function, method, or other code construct.</li> <li>Divide camel case and snake case names into separate words.</li> <li>Take the docstring, comments, and other important metadata.</li> <li>Build a sentence from the extracted data using a predefined template.</li> <li>Remove the special characters and replace them with spaces.</li></ol> <p data-svelte-h="svelte-1k8bmra">We can now define the <code>textify</code> function that uses the <code>inflection</code> library to carry out our conversions:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> inflection
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> <span class="hljs-type">Dict</span>, <span class="hljs-type">Any</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">textify</span>(<span class="hljs-params">chunk: <span class="hljs-type">Dict</span>[<span class="hljs-built_in">str</span>, <span class="hljs-type">Any</span>]</span>) -&gt; <span class="hljs-built_in">str</span>:
<span class="hljs-comment"># Get rid of all the camel case / snake case</span>
<span class="hljs-comment"># - inflection.underscore changes the camel case to snake case</span>
<span class="hljs-comment"># - inflection.humanize converts the snake case to human readable form</span>
name = inflection.humanize(inflection.underscore(chunk[<span class="hljs-string">&quot;name&quot;</span>]))
signature = inflection.humanize(inflection.underscore(chunk[<span class="hljs-string">&quot;signature&quot;</span>]))
<span class="hljs-comment"># Check if docstring is provided</span>
docstring = <span class="hljs-string">&quot;&quot;</span>
<span class="hljs-keyword">if</span> chunk[<span class="hljs-string">&quot;docstring&quot;</span>]:
docstring = <span class="hljs-string">f&quot;that does <span class="hljs-subst">{chunk[<span class="hljs-string">&#x27;docstring&#x27;</span>]}</span> &quot;</span>
<span class="hljs-comment"># Extract the location of that snippet of code</span>
context = <span class="hljs-string">f&quot;module <span class="hljs-subst">{chunk[<span class="hljs-string">&#x27;context&#x27;</span>][<span class="hljs-string">&#x27;module&#x27;</span>]}</span> &quot;</span> <span class="hljs-string">f&quot;file <span class="hljs-subst">{chunk[<span class="hljs-string">&#x27;context&#x27;</span>][<span class="hljs-string">&#x27;file_name&#x27;</span>]}</span>&quot;</span>
<span class="hljs-keyword">if</span> chunk[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;struct_name&quot;</span>]:
struct_name = inflection.humanize(inflection.underscore(chunk[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;struct_name&quot;</span>]))
context = <span class="hljs-string">f&quot;defined in struct <span class="hljs-subst">{struct_name}</span> <span class="hljs-subst">{context}</span>&quot;</span>
<span class="hljs-comment"># Combine all the bits and pieces together</span>
text_representation = <span class="hljs-string">f&quot;<span class="hljs-subst">{chunk[<span class="hljs-string">&#x27;code_type&#x27;</span>]}</span> <span class="hljs-subst">{name}</span> &quot;</span> <span class="hljs-string">f&quot;<span class="hljs-subst">{docstring}</span>&quot;</span> <span class="hljs-string">f&quot;defined as <span class="hljs-subst">{signature}</span> &quot;</span> <span class="hljs-string">f&quot;<span class="hljs-subst">{context}</span>&quot;</span>
<span class="hljs-comment"># Remove any special characters and concatenate the tokens</span>
tokens = re.split(<span class="hljs-string">r&quot;\W&quot;</span>, text_representation)
tokens = <span class="hljs-built_in">filter</span>(<span class="hljs-keyword">lambda</span> x: x, tokens)
<span class="hljs-keyword">return</span> <span class="hljs-string">&quot; &quot;</span>.join(tokens)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1fo0w5p">Now we can use <code>textify</code> to convert all chunks into text representations:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->text_representations = <span class="hljs-built_in">list</span>(<span class="hljs-built_in">map</span>(textify, structures))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-3wwgsm">Let’s see how one of our representations looks like:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->text_representations[<span class="hljs-number">1000</span>]<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="natural-language-embeddings" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#natural-language-embeddings"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Natural language embeddings</span></h3> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> fastembed <span class="hljs-keyword">import</span> TextEmbedding
batch_size = <span class="hljs-number">5</span>
nlp_model = TextEmbedding(<span class="hljs-string">&quot;sentence-transformers/all-MiniLM-L6-v2&quot;</span>, threads=<span class="hljs-number">0</span>)
nlp_embeddings = nlp_model.embed(text_representations, batch_size=batch_size)<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="code-embeddings" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#code-embeddings"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Code Embeddings</span></h3> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->code_snippets = [structure[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;snippet&quot;</span>] <span class="hljs-keyword">for</span> structure <span class="hljs-keyword">in</span> structures]
code_model = TextEmbedding(<span class="hljs-string">&quot;jinaai/jina-embeddings-v2-base-code&quot;</span>)
code_embeddings = code_model.embed(code_snippets, batch_size=batch_size)<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="building-qdrant-collection" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#building-qdrant-collection"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Building Qdrant collection</span></h3> <p data-svelte-h="svelte-8h1xvl">Qdrant supports multiple modes of deployment. Including in-memory for prototyping, Docker and Qdrant Cloud. You can refer to the <a href="https://qdrant.tech/documentation/guides/installation/" rel="nofollow">installation instructions</a> for more information.</p> <p data-svelte-h="svelte-5jc3sw">We’ll continue the tutorial using an in-memory instance.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1xbb3g2">In-memory can only be used for quick-prototyping and tests. It is a Python implementation of the Qdrant server methods.</p></div> <p data-svelte-h="svelte-1fvhhbg">Let’s create a collection to store our vectors.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> qdrant_client <span class="hljs-keyword">import</span> QdrantClient, models
COLLECTION_NAME = <span class="hljs-string">&quot;qdrant-sources&quot;</span>
client = QdrantClient(<span class="hljs-string">&quot;:memory:&quot;</span>) <span class="hljs-comment"># Use in-memory storage</span>
<span class="hljs-comment"># client = QdrantClient(&quot;http://locahost:6333&quot;) # For Qdrant server</span>
client.create_collection(
COLLECTION_NAME,
vectors_config={
<span class="hljs-string">&quot;text&quot;</span>: models.VectorParams(
size=<span class="hljs-number">384</span>,
distance=models.Distance.COSINE,
),
<span class="hljs-string">&quot;code&quot;</span>: models.VectorParams(
size=<span class="hljs-number">768</span>,
distance=models.Distance.COSINE,
),
},
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dpehrl">Our newly created collection is ready to accept the data. Let’s upload the embeddings:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm
points = []
total = <span class="hljs-built_in">len</span>(structures)
<span class="hljs-built_in">print</span>(<span class="hljs-string">&quot;Number of points to upload: &quot;</span>, total)
<span class="hljs-keyword">for</span> <span class="hljs-built_in">id</span>, (text_embedding, code_embedding, structure) <span class="hljs-keyword">in</span> tqdm(
<span class="hljs-built_in">enumerate</span>(<span class="hljs-built_in">zip</span>(nlp_embeddings, code_embeddings, structures)), total=total
):
<span class="hljs-comment"># FastEmbed returns generators. Embeddings are computed as consumed.</span>
points.append(
models.PointStruct(
<span class="hljs-built_in">id</span>=<span class="hljs-built_in">id</span>,
vector={
<span class="hljs-string">&quot;text&quot;</span>: text_embedding,
<span class="hljs-string">&quot;code&quot;</span>: code_embedding,
},
payload=structure,
)
)
<span class="hljs-comment"># Upload points in batches</span>
<span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(points) &gt;= batch_size:
client.upload_points(COLLECTION_NAME, points=points, wait=<span class="hljs-literal">True</span>)
points = []
<span class="hljs-comment"># Ensure any remaining points are uploaded</span>
<span class="hljs-keyword">if</span> points:
client.upload_points(COLLECTION_NAME, points=points)
<span class="hljs-built_in">print</span>(<span class="hljs-string">f&quot;Total points in collection: <span class="hljs-subst">{client.count(COLLECTION_NAME).count}</span>&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-3ws99c">The uploaded points are immediately available for search. Next, query the collection to find relevant code snippets.</p> <h3 class="relative group"><a id="querying-the-codebase" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#querying-the-codebase"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Querying the codebase</span></h3> <p data-svelte-h="svelte-h95ly4">We use one of the models to search the collection via Qdrant’s new <a href="https://qdrant.tech/blog/qdrant-1.10.x/" rel="nofollow">Query API</a>. Start with text embeddings. Run the following query “How do I count points in a collection?”. Review the results.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->query = <span class="hljs-string">&quot;How do I count points in a collection?&quot;</span>
hits = client.query_points(
COLLECTION_NAME,
query=<span class="hljs-built_in">next</span>(nlp_model.query_embed(query)).tolist(),
using=<span class="hljs-string">&quot;text&quot;</span>,
limit=<span class="hljs-number">3</span>,
).points<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1yoto9z">Now, review the results. The following table lists the module, the file name
and score. Each line includes a link to the signature.</p> <table data-svelte-h="svelte-129t8ym"><thead><tr><th>module</th> <th>file_name</th> <th>score</th> <th>signature</th></tr></thead> <tbody><tr><td>operations</td> <td>types.rs</td> <td>0.5493385</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/collection/src/operations/types.rs#L794" rel="nofollow"><code>pub struct CountRequestInternal</code></a></td></tr> <tr><td>map_index</td> <td>types.rs</td> <td>0.49973965</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/segment/src/index/field_index/map_index/mod.rs#L89" rel="nofollow"><code>fn get_points_with_value_count</code></a></td></tr> <tr><td>map_index</td> <td>mutable_map_index.rs</td> <td>0.49941066</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/segment/src/index/field_index/map_index/mutable_map_index.rs#L143" rel="nofollow"><code>pub fn get_points_with_value_count</code></a></td></tr></tbody></table> <p data-svelte-h="svelte-19cp8rk">It seems we were able to find some relevant code structures. Let’s try the same with the code embeddings:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->hits = client.query_points(
COLLECTION_NAME,
query=<span class="hljs-built_in">next</span>(code_model.query_embed(query)).tolist(),
using=<span class="hljs-string">&quot;code&quot;</span>,
limit=<span class="hljs-number">3</span>,
).points<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1wuxk0l">Output:</p> <table data-svelte-h="svelte-1d7gdtl"><thead><tr><th>module</th> <th>file_name</th> <th>score</th> <th>signature</th></tr></thead> <tbody><tr><td>field_index</td> <td>geo_index.rs</td> <td>0.7217579</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/segment/src/index/field_index/geo_index/mod.rs#L319" rel="nofollow"><code>fn count_indexed_points</code></a></td></tr> <tr><td>numeric_index</td> <td>mod.rs</td> <td>0.7113214</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/segment/src/index/field_index/numeric_index/mod.rs#L317" rel="nofollow"><code>fn count_indexed_points</code></a></td></tr> <tr><td>full_text_index</td> <td>text_index.rs</td> <td>0.6993165</td> <td><a href="https://github.com/qdrant/qdrant/blob/4aac02315bb3ca461a29484094cf6d19025fce99/lib/segment/src/index/field_index/full_text_index/text_index.rs#L179" rel="nofollow"><code>fn count_indexed_points</code></a></td></tr></tbody></table> <p data-svelte-h="svelte-6u9tc7">While the scores retrieved by different models are not comparable, but we can
see that the results are different. Code and text embeddings can capture
different aspects of the codebase. We can use both models to query the collection
and then combine the results to get the most relevant code snippets.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> qdrant_client <span class="hljs-keyword">import</span> models
hits = client.query_points(
collection_name=COLLECTION_NAME,
prefetch=[
models.Prefetch(
query=<span class="hljs-built_in">next</span>(nlp_model.query_embed(query)).tolist(),
using=<span class="hljs-string">&quot;text&quot;</span>,
limit=<span class="hljs-number">5</span>,
),
models.Prefetch(
query=<span class="hljs-built_in">next</span>(code_model.query_embed(query)).tolist(),
using=<span class="hljs-string">&quot;code&quot;</span>,
limit=<span class="hljs-number">5</span>,
),
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
).points<!-- HTML_TAG_END --></pre></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> hits:
<span class="hljs-meta">... </span> <span class="hljs-built_in">print</span>(
<span class="hljs-meta">... </span> <span class="hljs-string">&quot;| &quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;module&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | &quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;file_path&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | &quot;</span>,
<span class="hljs-meta">... </span> hit.score,
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | `&quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;signature&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot;` |&quot;</span>,
<span class="hljs-meta">... </span> )<!-- HTML_TAG_END --></pre></div> <pre data-svelte-h="svelte-13qauwq">| operations | lib/collection/src/operations/types.rs | 0.5 | ` &amp;num; [doc = &quot; Count Request&quot;] &amp;num; [doc = &quot; Counts the number of points which satisfy the given filter.&quot;] &amp;num; [doc = &quot; If filter is not provided, the count of all points in the collection will be returned.&quot;] &amp;num; [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] &amp;num; [serde (rename_all = &quot;snake_case&quot;)] pub struct CountRequestInternal &amp;#123; &amp;num; [doc = &quot; Look only for points which satisfies this conditions&quot;] &amp;num; [validate] pub filter : Option &lt; Filter &gt; , &amp;num; [doc = &quot; If true, count exact number of points. If false, count approximate number of points faster.&quot;] &amp;num; [doc = &quot; Approximate count might be unreliable during the indexing process. Default: true&quot;] &amp;num; [serde (default = &quot;default_exact_count&quot;)] pub exact : bool , } ` |
| field_index | lib/segment/src/index/field_index/geo_index.rs | 0.5 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| map_index | lib/segment/src/index/field_index/map_index/mod.rs | 0.33333334 | ` fn get_points_with_value_count &lt; Q &gt; (&amp; self , value : &amp; Q) -&gt; Option &lt; usize &gt; where Q : ? Sized , N : std :: borrow :: Borrow &lt; Q &gt; , Q : Hash + Eq , ` |
| numeric_index | lib/segment/src/index/field_index/numeric_index/mod.rs | 0.33333334 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| fixtures | lib/segment/src/fixtures/payload_context_fixture.rs | 0.25 | ` fn total_point_count (&amp; self) -&gt; usize ` |
| map_index | lib/segment/src/index/field_index/map_index/mutable_map_index.rs | 0.25 | ` fn get_points_with_value_count &lt; Q &gt; (&amp; self , value : &amp; Q) -&gt; Option &lt; usize &gt; where Q : ? Sized , N : std :: borrow :: Borrow &lt; Q &gt; , Q : Hash + Eq , ` |
| id_tracker | lib/segment/src/id_tracker/simple_id_tracker.rs | 0.2 | ` fn total_point_count (&amp; self) -&gt; usize ` |
| map_index | lib/segment/src/index/field_index/map_index/mod.rs | 0.2 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| map_index | lib/segment/src/index/field_index/map_index/mod.rs | 0.16666667 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| field_index | lib/segment/src/index/field_index/stat_tools.rs | 0.16666667 | ` fn number_of_selected_points (points : usize , values : usize) -&gt; usize ` |
</pre> <p data-svelte-h="svelte-rfg7y4">This is one example of how you can fuse the results from different models.
In a real-world scenario, you might run some reranking and deduplication, as well as additional processing of the results.</p> <h3 class="relative group"><a id="grouping-the-results" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#grouping-the-results"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Grouping the results</span></h3> <p data-svelte-h="svelte-t87f2d">You can improve the search results, by grouping them by payload properties.
In our case, we can group the results by the module. If we use code embeddings,
we can see multiple results from the <code>map_index</code> module. Let’s group the
results and assume a single result per module:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->results = client.query_points_groups(
COLLECTION_NAME,
query=<span class="hljs-built_in">next</span>(code_model.query_embed(query)).tolist(),
using=<span class="hljs-string">&quot;code&quot;</span>,
group_by=<span class="hljs-string">&quot;context.module&quot;</span>,
limit=<span class="hljs-number">5</span>,
group_size=<span class="hljs-number">1</span>,
)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">for</span> group <span class="hljs-keyword">in</span> results.groups:
<span class="hljs-meta">... </span> <span class="hljs-keyword">for</span> hit <span class="hljs-keyword">in</span> group.hits:
<span class="hljs-meta">... </span> <span class="hljs-built_in">print</span>(
<span class="hljs-meta">... </span> <span class="hljs-string">&quot;| &quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;module&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | &quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;context&quot;</span>][<span class="hljs-string">&quot;file_name&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | &quot;</span>,
<span class="hljs-meta">... </span> hit.score,
<span class="hljs-meta">... </span> <span class="hljs-string">&quot; | `&quot;</span>,
<span class="hljs-meta">... </span> hit.payload[<span class="hljs-string">&quot;signature&quot;</span>],
<span class="hljs-meta">... </span> <span class="hljs-string">&quot;` |&quot;</span>,
<span class="hljs-meta">... </span> )<!-- HTML_TAG_END --></pre></div> <pre data-svelte-h="svelte-jxqdo">| field_index | geo_index.rs | 0.7217579 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| numeric_index | mod.rs | 0.7113214 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| fixtures | payload_context_fixture.rs | 0.6993165 | ` fn total_point_count (&amp; self) -&gt; usize ` |
| map_index | mod.rs | 0.68385994 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
| full_text_index | text_index.rs | 0.6660142 | ` fn count_indexed_points (&amp; self) -&gt; usize ` |
</pre> <p data-svelte-h="svelte-1rvelqz">That concludes our tutorial. Thanks for taking the time to get here. We’ve just begun exploring what’s possible with vector embeddings and how to improve it. Feel free to experiment your way; you could build something very cool! Do share it with us 🙏 We are <a href="https://qdrant.tech/contact-us/" rel="nofollow">here</a>.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/cookbook/blob/main/notebooks/en/code_search.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1l2350x = {
assets: "/docs/cookbook/main/en",
base: "/docs/cookbook/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/cookbook/main/en/_app/immutable/entry/start.96b44205.js"),
import("/docs/cookbook/main/en/_app/immutable/entry/app.e92a3d99.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 13],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
63.1 kB
·
Xet hash:
fcf6631a2e2802aa1fef862b35b25cd3cb2bdf839f2a150dbb4f2dbc11a85331

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.