Activity Feed

AI & ML interests

None defined yet.

Recent Activity

MikeDoes 
posted an update 9 days ago
view post
Post
234
AI4Privacy datasets are being used to decide what data should never leave the device.

A new paper on privacy-preserving cloud computing uses the AI4Privacy PII-Masking-65K dataset to train models that classify text as private or public before it’s ever sent to the cloud.

This is a subtle but important shift.

Instead of encrypting everything or trusting the cloud by default, the authors ask a simpler question:

Can we detect sensitive text early enough to keep it local?

Using DistilBERT, trained partly on AI4Privacy PII data, the system learns to:

route private text to local processing

send non-sensitive text to the cloud

train collaboratively using federated learning, without sharing raw data

The result:

99.9% accuracy in private vs public text detection

Near-centralized performance in downstream tasks like SMS spam detection

Privacy protection enforced by design, not policy

What stands out here is not just the model performance, but the architectural idea:
privacy as a routing decision, backed by large-scale PII annotations.

This work reinforces a pattern we keep seeing: scalable privacy systems don’t start with encryption, they start with good PII data.

📄 Full Paper here: https://dl.acm.org/doi/full/10.1145/3773276.3774872

#Ai4Privacy #DataPrivacy #PIIMasking #FederatedLearning #PrivacyEngineering #OpenSourceAI #ResponsibleAI #AcademicResearch #LLMSecurity
MikeDoes 
posted an update 10 days ago
view post
Post
175
This new preprint fine-tunes T5-small and Mistral-7B on the AI4Privacy PII-Masking-200K dataset and shows that lightweight models can match and sometimes rival much larger LLMs for privacy tasks.

The study tackles a real deployment question many teams face:

Is PII masking a model-size problem, or a data-quality problem?

Using AI4Privacy’s large-scale, standardized PII annotations, the authors systematically compare:

Encoder–decoder models (T5) vs

Decoder-only models (Mistral)

across accuracy, robustness, latency, and real-world conversational text.

What stood out:

Mistral-7B achieved higher recall and robustness across noisy, informal inputs but with 10× higher latency

T5-small, trained on the same AI4Privacy data, delivered fast, structured, low-cost masking, making it viable for real-time systems

Dataset normalization (not model size) was one of the biggest drivers of performance gains

The models were then deployed in a live Discord bot, where performance dropped under real-world conditions a reminder that benchmarks alone aren’t enough.

The takeaway is hard to ignore:

Privacy-preserving AI scales through data design, not just bigger models.

This work reinforces why open, well-curated datasets like AI4Privacy PII-Masking-200K are becoming foundational infrastructure for privacy-first AI especially for teams that need self-hosted, transparent solutions.

📄 Read the paper: https://arxiv.org/abs/2512.18608
MikeDoes 
posted an update 11 days ago
view post
Post
137
PII leakage isn’t just a model problem it’s a data problem.

A recent paper takes a hard look at how well current systems actually detect and redact personal data at scale. One of their key conclusions is something the privacy community keeps rediscovering: without large, structured, and diverse PII datasets, evaluation collapses into guesswork.

To ground their experiments, the authors benchmarked their approach using the 500K PII-Masking dataset from AI4Privacy, leveraging its scale and coverage to test real-world redaction behavior rather than toy examples.

What’s interesting here isn’t just the model performance it’s what the evaluation reveals.

The paper shows that many systems appear robust under narrow tests but fail once PII appears in varied formats, contexts, and combinations. This gap between “works in theory” and “works in practice” is exactly where privacy risks emerge.

This is the value of open, research-grade datasets:

They expose failure modes early

They make comparisons reproducible

They let the community measure progress honestly

When researchers build on shared data foundations, everyone benefits from academic insight to safer downstream applications.

🔗 Read the full paper here: https://arxiv.org/abs/2407.08792
Tonic 
posted an update 15 days ago
view post
Post
4146
🙋🏻‍♂️ Hey there folks,

since everyone liked my previous announcement post ( https://huggingface.co/posts/Tonic/338509028435394 ) so much , i'm back with more high quality proceedural datasets in the Geospacial domain for SFT training !

Check this one out :
NuTonic/sat-bbox-metadata-sft-v1

the goal is to be able to train vision models on multiple images for remote sensing analysis with one shot .

hope you like it ! 🚀
  • 2 replies
·
Tonic 
posted an update 19 days ago
view post
Post
3563
🙋🏻‍♂️ Hey there folks ,

I'm sharing huggingface's largest dataset of annotated statelite images today.

check it out here : NuTonic/sat-image-boundingbox-sft-full

I hope you like it , the idea is to be able to use this with small vision models 🚀
MikeDoes 
posted an update about 1 month ago
view post
Post
2081
What happens when PII masking is treated as a trainable behavior, not just a detection task?

A new reinforcement learning environment tackles this question using a dataset derived from ai4privacy/open-pii-masking-500k-ai4privacy, transformed into a verifier-based training and evaluation setup.

Instead of evaluating PII masking as a one-off redaction step, this environment frames privacy as something models must consistently optimize for under feedback. The task requires models to correctly identify sensitive spans, replace them with [PII] tags, and comply with strict output formatting — all scored through explicit reward signals.

To make this realistic, the author filtered and normalized the dataset to focus on US-English examples, ensuring consistent masking targets while preserving the structural diversity needed to expose failure modes.

What's notable here isn't just the environment itself, but the shift in perspective.

By turning PII masking into a reinforcement learning problem, privacy stops being a static rule and becomes a behavior models are trained to maintain even under optimization pressure.

This is a strong example of how open privacy datasets can move beyond benchmarks and become infrastructure for new learning paradigms.

🔗 Explore the PII Masking RL environment on Prime Intellect:
https://app.primeintellect.ai/dashboard/environments/adamlucek/pii-masking
MikeDoes 
posted an update about 1 month ago
view post
Post
156
PII leakage isn't just a model problem — it's a data problem.

A recent paper takes a hard look at how well current systems actually detect and redact personal data at scale. One of their key conclusions is something the privacy community keeps rediscovering: without large, structured, and diverse PII datasets, evaluation collapses into guesswork.

To ground their experiments, the authors benchmarked their approach using the 500K PII-Masking dataset from AI4Privacy, leveraging its scale and coverage to test real-world redaction behavior rather than toy examples.

What's interesting here isn't just the model performance — it's what the evaluation reveals.

The paper shows that many systems appear robust under narrow tests but fail once PII appears in varied formats, contexts, and combinations. This gap between "works in theory" and "works in practice" is exactly where privacy risks emerge.

This is the value of open, research-grade datasets:

They expose failure modes early

They make comparisons reproducible

They let the community measure progress honestly

When researchers build on shared data foundations, everyone benefits — from academic insight to safer downstream applications.

🔗 Read the full paper here: https://arxiv.org/abs/2407.08792
MikeDoes 
posted an update about 1 month ago
view post
Post
2603
Things our clients and open source actually said to us this year:

"Finally, someone built a synthetic PII training data for German."

"Does it cover have localised information? Not just the language, the actual format. That must have been a lot of work that we can save from our side."

"We operate in 12 EU countries. Your dataset is the only one that covers all of them which has helped us out a lot in compliance especially because it's synthetic."

Every language has strong PII localization names, addresses, IDs, phone numbers, dates in the real format of that country.

23 languages. 29 regions. 3 scripts. 1,428,143 examples.

100% synthetic. Zero real personal data. Free on Hugging Face.
MikeDoes 
posted an update about 1 month ago
view post
Post
625
Ai4Privacy has been working on this for the past year. 🙏

Today we're releasing the PII Masking 2M Series, the world's largest open source privacy masking dataset. (Again. 🚀🚀)

🔢 2M+ synthetic examples
🌍 32 locales across Europe
🏷️ 98 entity types
🏥💬🏦💼📍 5 industry verticals: Health, Finance, Digital, Work, Location
✅ 1M+ entries freely available on Hugging Face

Every example is 100% synthetic. No real personal data. Built so you can train and evaluate PII detection models without the legal headaches. 🔒

Thank you for 15,000,000+ downloads across our datasets, models, and libraries. This one's for you. ❤️


hashtag#privacy hashtag#ai hashtag#opensource hashtag#nlp hashtag#gdpr hashtag#pii hashtag#huggingface hashtag#machinelearning
MikeDoes 
posted an update 3 months ago
view post
Post
2284
Stop sending sensitive data across the network. Sanitize it directly in the browser. 💡

A recent blog post by A. Christmas provides a practical guide on how to achieve exactly that. They demonstrated a powerful form of anonymization: PII masking at the edge. The vision is simple but profound: keep sensitive data off the network entirely by sanitizing it in the browser.

With the Ai4Privacy pii-masking-200k dataset serving as the foundation for their work. It provided the high-quality, diverse examples of PII needed to fine-tune a specialized DistilBERT model, one that is accurate, fast, and light enough to run client-side.

This is the future we are working towards: a world where developers are empowered with the tools and data to build powerful AI systems that respect user privacy by design. This is exactly why we build our datasets, and we're thrilled to showcase this project that turns the principles of data privacy into a practical, deployable solution.

🔗 See their innovative approach in action: https://ronathan.esr-inc.com/automatically-sanitize-data-in-the-users-browser-with-ai/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #World's largest open privacy masking dataset
MikeDoes 
posted an update 3 months ago
view post
Post
4588
At Ai4Privacy, our goal is to empower researchers to build a safer AI ecosystem. Today, we're highlighting crucial research that does just that by exposing a new vulnerability.

The paper "Forget to Flourish" details a new model poisoning technique. It's a reminder that as we fine-tune LLMs, our anonymization and privacy strategies must evolve to counter increasingly sophisticated threats.

We're proud that the Ai4Privacy dataset was instrumental in this study. It served two key purposes:

Provided a Realistic Testbed: It gave the researchers access to a diverse set of synthetic and realistic PII samples in a safe, controlled environment.

Enabled Impactful Benchmarking: It allowed them to measure the actual effectiveness of their data extraction attack, proving it could compromise specific, high-value information.

This work reinforces our belief that progress in AI security is a community effort. By providing robust tools for benchmarking, we can collectively identify weaknesses and build stronger, more resilient systems. A huge congratulations to the authors on this important contribution.

🔗 Read the full paper: https://arxiv.org/html/2408.17354v1

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #Worldslargestopensourceprivacymaskingdataset
Tonic 
posted an update 3 months ago
view post
Post
3714
🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named
kurakurai
?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .
  • 4 replies
·
MikeDoes 
posted an update 3 months ago
view post
Post
1827
How do you prove a new AI privacy tool actually works? You test it against a world-class benchmark.

That's why we're proud our data played a key role in the research for "Rescriber," a new browser extension for user-led anonymization. To objectively measure their tool's performance against other methods, the researchers needed a diverse and challenging evaluation set.

They built their benchmark using 240 samples from the Ai4Privacy open dataset.

This is a win-win for the ecosystem: our open-source data helps researchers validate their innovative solutions, and in turn, their work pushes the entire field of privacy-preserving AI forward. The "Rescriber" tool is a fantastic step towards on-device, user-controlled privacy.

🔗 Learn more about their data-driven findings in the full paper: https://arxiv.org/pdf/2410.11876

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #HealthcareAI #Ai4Privacy
Tonic 
posted an update 3 months ago
view post
Post
3429
🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest


just a few quite minor details i'll take care of but i wanted to share here first
  • 2 replies
·
MikeDoes 
posted an update 3 months ago
view post
Post
242
State-of-the-art AI doesn't start with a model. It starts with the data.

Achieving near-perfect accuracy for PII & PHI

anonymization is one of the toughest challenges in NLP. A model is only as good as the data it learns from, providing this foundational layer is central to our mission. The

ai4privacy/pii-masking-400k dataset was built for this exact purpose: to serve as a robust, large-scale, open-source training ground for building high-precision privacy tools.


To see the direct impact of this data-first approach, look at the ner_deid_aipii model for Healthcare NLP by johnsnow lab. By training on our 400,000 labeled examples, the model achieved incredible performance:

100% F1-score on EMAIL detection.

99% F1-score on PHONE detection.

97% F1-score on NAME detection.

This is the result of combining a cutting-edge architecture with a comprehensive, high-quality dataset. We provide the open-source foundation so developers can build better, safer solutions.


Explore the dataset that helps power these next-generation privacy tools: ai4privacy/pii-masking-400k

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #HealthcareAI #Ai4Privacy
MikeDoes 
posted an update 3 months ago
view post
Post
5441
Can you teach a giant like Google's Gemini to protect user privacy? A new step-by-step guide shows that the answer is a resounding "yes."

While powerful, large language models aren't specialized for privacy tasks. This tutorial by Analytics Vidhya walks through how to fine-tune Gemini into a dedicated tool for PII anonymization.

To teach the model this critical skill, the author needed a robust dataset with thousands of clear 'before' and 'after' examples.

We're thrilled they chose the Ai4Privacy pii-masking-200k dataset for this task. Our data provided the high-quality, paired examples of masked and unmasked text necessary to effectively train Gemini to identify and hide sensitive information accurately.

This is a perfect example of how the community can use open-source data to add a crucial layer of safety to the world's most powerful models. Great work!

🔗 Check out the full tutorial here: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #LLM #FineTuning #Anonymization #GoogleGemini #Ai4Privacy #World's largest open privacy masking dataset
MikeDoes 
posted an update 3 months ago
view post
Post
3723
You don't need a massive research lab to build a privacy-preserving AI tool thanks to open datasets. With the right ingredients, anyone can.

A fantastic new guide shows how the democratization of AI is helping to advance safety. It walks through how to use Google's new fine-tuning API to turn Gemini into a powerful tool for PII anonymization.

This project was powered by two key components:

An accessible platform from Google.

High-quality, open-source training data.

We are honored that the author chose the Ai4Privacy pii-masking-200k dataset to provide the crucial data foundation. Our dataset delivered the volume and structure needed to successfully teach a state-of-the-art model how to perform a critical privacy function.

This is the future we're working towards: powerful platforms combined with open, safety-focused data to create tools that benefit everyone. Kudos to the author for showcasing what's possible!

🔗 Read the full step-by-step guide: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#AIforGood #DemocratizeAI #DataPrivacy #Anonymization #OpenSource #LLM #Ai4Privacy
  • 2 replies
·
MikeDoes 
posted an update 3 months ago
view post
Post
1147
Are you sure the open-source model you just downloaded is safe?

A recent paper on "Privacy Backdoors" reports a new vulnerability where pre-trained models can be poisoned before fine-tuning them. This is a serious challenge for everyone building on open-source AI.

Instead of just pointing out problems, we believe in finding better solutions. To understand this threat, the researchers needed to test their attack on realistic data structures. They needed a dataset that could effectively simulate a high-stakes privacy attack, and we're proud that our Ai4Privacy dataset was used to provide this crucial benchmark. The paper reports that for our complex dataset, the privacy leakage on a non-poisoned model was almost zero. After the backdoor attack, that number reportedly jumped to 87%.

Ai4Privacy dataset provided a realistic benchmark for their research. Our dataset, composed of synthetic identities, helped them demonstrate how a poisoned model could dramatically amplify privacy leakage.

This is why we champion open source: it enables the community to identify these issues and develop better, safer solutions together.
Kudos to the authors Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini, University of Maryland and Google DeepMind.

🔗 Read the research to understand this new challenge: https://arxiv.org/pdf/2404.01231

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/