junchenfu commited on
Commit
a9e5ce0
·
verified ·
1 Parent(s): d0ad1cb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -9,6 +9,7 @@ tags:
9
  - research
10
  datasets:
11
  - junchenfu/llmpopcorn_prompts
 
12
  pipeline_tag: text-generation
13
  ---
14
 
@@ -96,3 +97,35 @@ for item in dataset["train"]:
96
  ```
97
 
98
  This dataset contains both abstract and concrete prompts, which you can use as input for the video generation scripts in Step 2.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - research
10
  datasets:
11
  - junchenfu/llmpopcorn_prompts
12
+ - junchenfu/microlens_rag
13
  pipeline_tag: text-generation
14
  ---
15
 
 
97
  ```
98
 
99
  This dataset contains both abstract and concrete prompts, which you can use as input for the video generation scripts in Step 2.
100
+
101
+ ## RAG Reference Dataset: MicroLens
102
+
103
+ For the RAG-enhanced pipeline (`PE.py` + `pipline.py`), we provide a pre-processed version of the MicroLens dataset on Hugging Face so you don't need to download and process the raw files manually.
104
+
105
+ The dataset is available at: [**junchenfu/microlens_rag**](https://huggingface.co/datasets/junchenfu/microlens_rag)
106
+
107
+ It contains **19,560** video entries across **22 categories** with the following fields:
108
+
109
+ | Column | Description |
110
+ |--------|-------------|
111
+ | `video_id` | Unique video identifier |
112
+ | `title_en` | Cover image description (used as title) |
113
+ | `cover_desc` | Cover image description |
114
+ | `caption_en` | Full video caption in English |
115
+ | `partition` | Video category (e.g., Anime, Game, Delicacy) |
116
+ | `likes` | Number of likes |
117
+ | `views` | Number of views |
118
+ | `comment_count` | Number of comments (used as popularity signal) |
119
+
120
+ ### Load the RAG Dataset in Python
121
+
122
+ ```python
123
+ from datasets import load_dataset
124
+
125
+ rag_dataset = load_dataset("junchenfu/microlens_rag")
126
+
127
+ # Access as a pandas DataFrame
128
+ df = rag_dataset["train"].to_pandas()
129
+ print(df.head())
130
+ print(f"Total: {len(df)} videos, {df['partition'].nunique()} categories")
131
+ ```