Spaces:
Paused
Paused
Update README.md
Browse files
README.md
CHANGED
|
@@ -8,4 +8,88 @@ pinned: false
|
|
| 8 |
license: mit
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
license: mit
|
| 9 |
---
|
| 10 |
|
| 11 |
+
RedPajama Dataset API
|
| 12 |
+
|
| 13 |
+
A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset
|
| 14 |
+
|
| 15 |
+
Overview
|
| 16 |
+
|
| 17 |
+
This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.
|
| 18 |
+
|
| 19 |
+
Features
|
| 20 |
+
1. Retrieve Dataset Chunks
|
| 21 |
+
Fetch smaller, manageable subsets of the dataset to explore or preprocess.
|
| 22 |
+
2. Search Data
|
| 23 |
+
Search for specific keywords in the dataset and retrieve relevant results.
|
| 24 |
+
3. Dataset Summary
|
| 25 |
+
Get an overview of the dataset’s structure, including available splits.
|
| 26 |
+
|
| 27 |
+
Endpoints
|
| 28 |
+
|
| 29 |
+
Endpoint Method Parameters Description
|
| 30 |
+
/ GET None Displays a welcome message.
|
| 31 |
+
/get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset.
|
| 32 |
+
/search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword.
|
| 33 |
+
/data_summary/ GET None Displays a summary of the dataset.
|
| 34 |
+
|
| 35 |
+
Getting Started
|
| 36 |
+
|
| 37 |
+
Prerequisites
|
| 38 |
+
• Python 3.8+
|
| 39 |
+
• Pip for dependency management
|
| 40 |
+
|
| 41 |
+
Setup
|
| 42 |
+
1. Clone the repository:
|
| 43 |
+
|
| 44 |
+
git clone https://huggingface.co/spaces/Canstralian/DockerTester
|
| 45 |
+
cd DockerTester
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
2. Install dependencies:
|
| 49 |
+
|
| 50 |
+
pip install -r requirements.txt
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
3. Run the application:
|
| 54 |
+
|
| 55 |
+
uvicorn app:app --host 0.0.0.0 --port 8000
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
4. Access the API in your browser or using tools like Postman at:
|
| 59 |
+
|
| 60 |
+
http://127.0.0.1:8000
|
| 61 |
+
|
| 62 |
+
Example Usage
|
| 63 |
+
1. Retrieve a Small Chunk of Data
|
| 64 |
+
Fetch 5 examples from the dataset:
|
| 65 |
+
|
| 66 |
+
curl "http://127.0.0.1:8000/get_data/?chunk_size=5"
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
2. Search the Dataset
|
| 70 |
+
Search for the keyword example and return up to 3 results:
|
| 71 |
+
|
| 72 |
+
curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
3. View Dataset Summary
|
| 76 |
+
Get an overview of available splits:
|
| 77 |
+
|
| 78 |
+
curl "http://127.0.0.1:8000/data_summary/"
|
| 79 |
+
|
| 80 |
+
Technologies Used
|
| 81 |
+
• FastAPI: For building the API.
|
| 82 |
+
• Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset.
|
| 83 |
+
• Uvicorn: For running the ASGI server.
|
| 84 |
+
• Python: Backend language.
|
| 85 |
+
|
| 86 |
+
Future Enhancements
|
| 87 |
+
• Add support for advanced filtering (e.g., by metadata or specific fields).
|
| 88 |
+
• Implement user authentication for restricted dataset access.
|
| 89 |
+
• Add visualization endpoints for dataset insights.
|
| 90 |
+
|
| 91 |
+
License
|
| 92 |
+
|
| 93 |
+
This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.
|
| 94 |
+
|
| 95 |
+
Feel free to reach out for questions, feature requests, or contributions!
|