Edens-Gate
/

Bluesky-2M-backup

Model card Files Files and versions

Bluesky-2M-backup / README.md

Delta-Vector's picture

Upload folder using huggingface_hub

93c8633 verified over 1 year ago

|

history blame contribute delete

2.04 kB

	---
	license: apache-2.0
	language:
	- en
	---
	## 2 Million Bluesky Posts

	This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.

	The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model.
	Dataset Details
	Dataset Description

	This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships.

	- Curated by: Alpin Dale
	- Language(s) (NLP): Multiple (primarily English)
	- License: Dataset usage is subject to Bluesky's Terms of Service


	## Uses

	This dataset could be used for:

	- Training and testing language models on social media content
	- Analyzing social media posting patterns
	- Studying conversation structures and reply networks
	- Research on social media content moderation
	- Natural language processing tasks using social media datas

	## Dataset Structure

	The dataset is available in two configurations:
	### Default Configuration

	Contains the following fields for each post:

	- text: The main content of the post
	- created_at: Timestamp of post creation
	- author: The Bluesky handle of the post author
	- uri: Unique identifier for the post
	- has_images: Boolean indicating if the post contains images
	- reply_to: URI of the parent post if this is a reply (null otherwise)

	### With Language Predictions Configuration

	Contains all fields from the default configuration plus:

	- predicted_language: The predicted language code (e.g., eng_Latn, deu_Latn)
	- language_confidence: Confidence score for the language prediction (0-1)

	Language predictions were added using the [glotlid](https://huggingface.co/cis-lmu/glotlid) model via fasttext.

	## Bias, Risks, and Limitations
	The goal of this dataset is for you to have fun :)