Buckets:

aigentic
/

Caselaw_Access_Project_embeddings-bucket

10 days ago

1.19 kB

	---
	license: agpl-3.0
	task_categories:
	- feature-extraction
	tags:
	- legal
	size_categories:
	- 100M<n<1B
	---
	Original Repository:

	https://huggingface.co/datasets/justicedao/Caselaw_Access_Project_embeddings/

	This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis.

	Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network

	The ipfs content id "cid" is the primary key that links the dataset to the embeddings, should you want to retrieve from the dataset instead.

	The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct

	Those models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions

	These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model.

	To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster.

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.