| ---
|
| license: apache-2.0
|
| language:
|
| - en
|
| ---
|
| ## 2 Million Bluesky Posts |
|
|
| This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. |
|
|
| The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. |
| Dataset Details |
| Dataset Description |
|
|
| This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships. |
|
|
| - **Curated by**: Alpin Dale |
| - **Language(s) (NLP)**: Multiple (primarily English) |
| - **License**: Dataset usage is subject to Bluesky's Terms of Service |
|
|
|
|
| ## Uses |
|
|
| This dataset could be used for: |
|
|
| - Training and testing language models on social media content |
| - Analyzing social media posting patterns |
| - Studying conversation structures and reply networks |
| - Research on social media content moderation |
| - Natural language processing tasks using social media datas |
|
|
| ## Dataset Structure |
|
|
| The dataset is available in two configurations: |
| ### Default Configuration |
|
|
| Contains the following fields for each post: |
|
|
| - **text**: The main content of the post |
| - **created_at**: Timestamp of post creation |
| - **author**: The Bluesky handle of the post author |
| - **uri**: Unique identifier for the post |
| - **has_images**: Boolean indicating if the post contains images |
| - **reply_to**: URI of the parent post if this is a reply (null otherwise) |
| |
| ### With Language Predictions Configuration |
| |
| Contains all fields from the default configuration plus: |
| |
| - **predicted_language**: The predicted language code (e.g., eng_Latn, deu_Latn) |
| - **language_confidence**: Confidence score for the language prediction (0-1) |
| |
| Language predictions were added using the [glotlid](https://huggingface.co/cis-lmu/glotlid) model via fasttext. |
| |
| ## Bias, Risks, and Limitations |
| The goal of this dataset is for you to have fun :) |