Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
In a Training Loop π
8
15
23
Khoi Truong
Kiy-K
Follow
John6666's profile picture
Artartaid's profile picture
andyolivers's profile picture
5 followers
Β·
38 following
https://kiy-k.github.io/My-Portfolio-K/
Kiy-K
AI & ML interests
Building Computer Use for AI agents
Recent Activity
reacted
to
nyuuzyou
's
post
with π₯
about 21 hours ago
π NNTP Discussion Archives - 387M Messages from Public Newsgroups - https://huggingface.co/datasets/nyuuzyou/nntp-text-387m Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code. Key Stats: - 386,629,949 messages from 159,345 newsgroups - 191 GB compressed Parquet storage - Spans 2002-2026 - Multilingual: English, German, French, Italian, Dutch, Polish, Russian, and others - Email addresses redacted for privacy The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go. Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.
upvoted
a
collection
10 days ago
TranslateGemma
liked
a Space
10 days ago
google/ehr-navigator-agent-with-medgemma
View all activity
Organizations
Kiy-K
's Spaces
1
Sort:Β Recently updated
Paused
Fyodor Agent
π