Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -9,6 +9,7 @@ tags:
|
|
| 9 |
- research
|
| 10 |
datasets:
|
| 11 |
- junchenfu/llmpopcorn_prompts
|
|
|
|
| 12 |
pipeline_tag: text-generation
|
| 13 |
---
|
| 14 |
|
|
@@ -96,3 +97,35 @@ for item in dataset["train"]:
|
|
| 96 |
```
|
| 97 |
|
| 98 |
This dataset contains both abstract and concrete prompts, which you can use as input for the video generation scripts in Step 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- research
|
| 10 |
datasets:
|
| 11 |
- junchenfu/llmpopcorn_prompts
|
| 12 |
+
- junchenfu/microlens_rag
|
| 13 |
pipeline_tag: text-generation
|
| 14 |
---
|
| 15 |
|
|
|
|
| 97 |
```
|
| 98 |
|
| 99 |
This dataset contains both abstract and concrete prompts, which you can use as input for the video generation scripts in Step 2.
|
| 100 |
+
|
| 101 |
+
## RAG Reference Dataset: MicroLens
|
| 102 |
+
|
| 103 |
+
For the RAG-enhanced pipeline (`PE.py` + `pipline.py`), we provide a pre-processed version of the MicroLens dataset on Hugging Face so you don't need to download and process the raw files manually.
|
| 104 |
+
|
| 105 |
+
The dataset is available at: [**junchenfu/microlens_rag**](https://huggingface.co/datasets/junchenfu/microlens_rag)
|
| 106 |
+
|
| 107 |
+
It contains **19,560** video entries across **22 categories** with the following fields:
|
| 108 |
+
|
| 109 |
+
| Column | Description |
|
| 110 |
+
|--------|-------------|
|
| 111 |
+
| `video_id` | Unique video identifier |
|
| 112 |
+
| `title_en` | Cover image description (used as title) |
|
| 113 |
+
| `cover_desc` | Cover image description |
|
| 114 |
+
| `caption_en` | Full video caption in English |
|
| 115 |
+
| `partition` | Video category (e.g., Anime, Game, Delicacy) |
|
| 116 |
+
| `likes` | Number of likes |
|
| 117 |
+
| `views` | Number of views |
|
| 118 |
+
| `comment_count` | Number of comments (used as popularity signal) |
|
| 119 |
+
|
| 120 |
+
### Load the RAG Dataset in Python
|
| 121 |
+
|
| 122 |
+
```python
|
| 123 |
+
from datasets import load_dataset
|
| 124 |
+
|
| 125 |
+
rag_dataset = load_dataset("junchenfu/microlens_rag")
|
| 126 |
+
|
| 127 |
+
# Access as a pandas DataFrame
|
| 128 |
+
df = rag_dataset["train"].to_pandas()
|
| 129 |
+
print(df.head())
|
| 130 |
+
print(f"Total: {len(df)} videos, {df['partition'].nunique()} categories")
|
| 131 |
+
```
|