| license: agpl-3.0 | |
| task_categories: | |
| - feature-extraction | |
| tags: | |
| - legal | |
| size_categories: | |
| - 100M<n<1B | |
| Original Repository: | |
| https://huggingface.co/datasets/justicedao/Caselaw_Access_Project_embeddings/ | |
| This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis. | |
| Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network | |
| The ipfs content id "cid" is the primary key that links the dataset to the embeddings, should you want to retrieve from the dataset instead. | |
| The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct | |
| Those models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions | |
| These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model. | |
| To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster. |
Xet Storage Details
- Size:
- 1.19 kB
- Xet hash:
- 4d1d7c398603436ede190cfc1944c05025ca89fbc25889c9e2878b143e198d8c
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.