Spaces:
Sleeping
title: Open Chatbot
emoji: π€
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
Open Chatbot
Zero File Leakage AI Chatbot β Privacy by Design
Your documents never leave your server. Only text goes to the cloud.
This architectural approach adopts the principle of Privacy by Design, where all documents are processed locally on the user's server before interacting with AI models through an API. The original files β whether PDF, Word, Excel, or OCR-extracted text from images β are never transmitted to the AI provider. The system only sends parsed and sliced text segments structured as JSON. Only the textual content relevant for reasoning is transmitted to the API, without including raw files, full document structures, or any sensitive metadata.
The Problem Β· The Solution Β· How It Works Β· Quick Start Β· Features
π¨ The Problem: Document Data Leakage
Many companies and individuals want to leverage AI to analyze internal documents β financial reports, contracts, HR data, medical records, legal documents β but face serious risks:
| Risk | Explanation |
|---|---|
| Files sent raw to cloud | When uploading PDF/Word to ChatGPT, Claude, or other AI, the original file is sent to their servers |
| Binary files stored on AI servers | .pdf, .docx, .xlsx files are stored temporarily or permanently on AI provider infrastructure |
| Metadata leakage | Filenames, author info, revision history, hidden comments are all transmitted |
| No control | Once a file is sent, you have no control over data retention and usage |
| Compliance violation | Violates GDPR, HIPAA, SOC 2, or internal corporate policies |
Real-world example: You upload a Q4 financial report to ChatGPT. The 5MB PDF is sent in full to OpenAI's servers β including metadata, embedded images, hidden text, and revision history. You have no idea how long that file is retained.
β How Open Chatbot Solves It
Open Chatbot uses a Local Processing + Text-Only Inference architecture that ensures original files never leave your infrastructure:
YOUR INFRASTRUCTURE (On-Premise / VPS) CLOUD (AI Provider)
========================================== =======================
π PDF π DOCX π XLSX πΌοΈ Image OpenAI / Claude /
| DeepSeek API
v
+-----------------------+ Only receives:
| LOCAL FILE PROCESSOR | β
Sliced text (JSON)
| β’ PDF β pdftotext | β
User questions
| β’ DOCX β mammoth | text JSON β
System prompt
| β’ XLSX β SheetJS | ββββββββββββββββββββββΊ
| β’ Image β Tesseract | (max 30KB/file) Never receives:
| β’ OCR scanned docs | β Original files (binary)
+-----------------------+ β Images / scans
| β Document metadata
v β Revision history
ποΈ File deleted from β Hidden content
memory after β Embedded objects
text extraction
Security Principles
| Principle | Implementation |
|---|---|
| Files never sent to AI | Files are processed locally β only extracted text is transmitted |
| Text is sliced | Each file is capped at max 30KB of text before being sent as a JSON chunk |
| No server storage | Files are buffered in memory, extracted, then immediately deleted from temp |
| Stateless API inference | DeepSeek, OpenAI, Claude APIs do not store data from API calls |
| API keys in browser | Keys stored in browser localStorage, never on the server |
| Zero binary transfer | What's sent to AI: {"role":"system","content":"text..."} β not files |
π¬ How It Works Technically
1. Upload & Local Processing
When a user uploads a file, all processing happens on your own server:
POST /api/upload β FormData { files: [File] }
|
v
βββββββββββββββββββββββ
β file-processor.ts β
β β
β PDF βββΊ pdftotext β CLI tool (poppler)
β PDF βββΊ OCR β tesseract CLI (fallback for scanned docs)
β DOCX βββΊ mammoth β npm library
β DOC βββΊ word-ext β npm library
β XLSX βββΊ SheetJS β npm library
β IMG βββΊ tesseract β OCR engine
β TXT βββΊ buffer β native Node.js
βββββββββββ¬ββββββββββββ
β
v
{ id, filename, text, size } β JSON response
(text only, not the file)
What happens in memory:
- File is buffered as a
Bufferin RAM - Text is extracted by the appropriate library
- Buffer and temp files are immediately deleted after extraction
- Only a
stringof text is returned to the client
2. Sending to AI Provider
When the user sends a message, only JSON text is sent to the AI API:
// What is ACTUALLY sent to the API (from chat/route.ts)
{
"model": "deepseek-chat",
"system": "You are an AI assistant...\n\n=== File: report.pdf ===\nExtracted text here (max 30KB)...\n=== End File ===",
"messages": [
{ "role": "user", "content": "Analyze this financial report" }
],
"max_tokens": 8192,
"stream": true
}
Notice:
- There is no
file,attachment, orbinaryin the payload - The
systemfield contains plain text from extraction, not a file - Each file context is truncated to max 30,000 characters (
fc.text.slice(0, 30000)) - Responses are streamed as
text-deltachunks β not stored on the server
3. Why AI Providers Don't Store Your Data
| Provider | API Policy |
|---|---|
| OpenAI | Data from API calls is not used for training and not permanently stored (unlike the ChatGPT web interface) |
| Anthropic | API calls are not stored for model training. Limited retention for abuse monitoring only |
| DeepSeek | API follows minimal data retention policy for inference |
Note: This applies to usage via API (which Open Chatbot uses), not via web interfaces (ChatGPT/Claude web). Web interfaces have different policies.
π Comparison: Open Chatbot vs Direct Upload to AI
| Aspect | Upload to ChatGPT/Claude Web | Open Chatbot |
|---|---|---|
| Original file sent to cloud | β Yes, full file | β No, text only |
| Binary/images transmitted | β Yes | β No |
| Document metadata transmitted | β Yes (author, revisions) | β No |
| Data used for training | β οΈ Possibly (depends on settings) | β No (API mode) |
| File stored on AI server | β οΈ Temporarily/permanently | β No file at all |
| Data retention control | β Minimal | β Full (self-hosted) |
| Compliance friendly | β οΈ Requires review | β Data stays on-premise |
| Data size transmitted | Full file (MBs) | Text only (max 30KB/file) |
π Quick Start
Prerequisites
- Node.js 18+
- An API key from at least one provider: DeepSeek, OpenAI, or Anthropic
- (Optional)
poppler-utilsandtesseract-ocrfor PDF OCR
Install & Run
# Clone the repository
git clone https://github.com/romizone/chatbot-next.git
cd chatbot-next
# Install dependencies
npm install
# Start development server
npm run dev
Open http://localhost:3000 β click Settings β select a provider β enter your API key β start chatting!
Environment Variables (Optional)
Create a .env.local file for server-side fallback keys:
DEEPSEEK_API_KEY=sk-...
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
Install OCR Tools (Optional, for scanned PDFs & images)
# macOS
brew install poppler tesseract tesseract-lang
# Ubuntu/Debian
sudo apt install poppler-utils tesseract-ocr tesseract-ocr-eng
# Windows (via chocolatey)
choco install poppler tesseract
π― Full Features
Multi-Provider AI Engine
| Provider | Models | Max Output |
|---|---|---|
| DeepSeek | DeepSeek Chat, DeepSeek Reasoner | 8K - 16K tokens |
| OpenAI | GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini | 16K - 32K tokens |
| Anthropic | Claude Sonnet 4.5, Claude Haiku 4.5 | 8K - 16K tokens |
Local Document Processing
| Format | Engine | Capability |
|---|---|---|
pdftotext CLI + Tesseract OCR |
Text extraction + OCR for scanned PDFs (max 20 pages) | |
| DOCX | mammoth |
Full text and formatting extraction |
| DOC | word-extractor |
Legacy Word document support |
| XLSX/XLS | xlsx (SheetJS) |
Spreadsheet to structured text (CSV per sheet) |
| CSV | xlsx (SheetJS) |
Direct parsing |
| Images | tesseract CLI |
OCR: PNG, JPG, BMP, TIFF, WebP |
| Text | Native Buffer |
TXT, MD, JSON, XML, HTML, source code (20+ formats) |
Core Features
- Zero File Leakage β Files processed locally, only text sent to AI via JSON
- Per-Provider API Keys β Each provider has its own key slot in browser localStorage
- Real-time Connection Indicator β Green/red status in sidebar
- Auto-Continue β Detects truncated responses and automatically requests continuation
- LaTeX Math (KaTeX) β Mathematical formula rendering:
$inline$and$$block$$ - Syntax Highlighting β Code blocks with language detection (Prism theme)
- Multi-Session β Chat history with multiple sessions
- File Context Persistence β File contexts preserved per session
- Responsive UI β Collapsible sidebar, Tailwind CSS + Radix UI
- Streaming Response β Real-time responses via Vercel AI SDK
streamText
ποΈ Architecture
chatbot-next/
βββ src/
β βββ app/
β β βββ api/
β β β βββ chat/route.ts # Streaming chat endpoint (multi-provider)
β β β βββ upload/route.ts # Local file processing endpoint
β β βββ layout.tsx
β β βββ page.tsx
β βββ components/
β β βββ chat/
β β β βββ chat-page.tsx # Main orchestrator
β β β βββ chat-area.tsx # Message display area
β β β βββ chat-input.tsx # Input with file upload
β β β βββ chat-message.tsx # Message bubble
β β β βββ markdown-renderer.tsx # Markdown + KaTeX + syntax highlight
β β β βββ settings-dialog.tsx # Provider & API key settings
β β β βββ sidebar.tsx # Session list + connection indicator
β β β βββ welcome-screen.tsx # Landing screen
β β βββ ui/ # Radix UI primitives
β βββ hooks/
β β βββ use-chat-store.ts # State management (localStorage)
β βββ lib/
β βββ constants.ts # Model list, system prompt, defaults
β βββ file-processor.ts # Local document processing engine
β βββ types.ts # TypeScript interfaces
βββ public/
βββ package.json
βββ README.md
Data Flow
User uploads file βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
POST /api/upload
β
βΌ
file-processor.ts ββ [PDF β pdftotext] ββ [DOCX β mammoth]
β [XLSX β SheetJS] [IMG β tesseract OCR]
β
βΌ
JSON response: { filename, text, size } β text only, file deleted
β
βΌ
Browser stores text in memory
β
βΌ
User sends message
β
βΌ
POST /api/chat βββΊ { messages, provider, model, apiKey, fileContexts }
β
βΌ
Build system prompt + file text (sliced max 30KB/file)
β
βΌ
streamText() to AI Provider βββΊ text-delta chunks βββΊ UI
β
βΌ
AI only receives JSON text, never file binaries
π³ Deployment
Self-Hosted (Recommended for Enterprise)
For maximum data privacy, deploy on your own infrastructure:
npm run build
npm start
Docker
FROM node:18-alpine
# Install OCR tools (optional)
RUN apk add --no-cache poppler-utils tesseract-ocr
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
docker build -t open-chatbot .
docker run -p 3000:3000 open-chatbot
Vercel
npx vercel
Note: On Vercel (serverless), OCR features require the Tesseract binary which may not be available. Use Docker deployment for full OCR support.
π Security Summary
| Aspect | Detail |
|---|---|
| File Processing | 100% local on your server, files deleted after extraction |
| Data to AI Provider | Plain text JSON chunks only (max 30KB/file) |
| API Keys | Stored in browser localStorage, sent per-request via HTTPS |
| File Binaries | Never sent to AI providers |
| Document Metadata | Stripped during extraction β only text content |
| Temp Files | Auto-cleanup after processing (using finally blocks) |
| Data Retention | Chat history in browser localStorage only |
| Network Payload | JSON { role, content } β not multipart/form-data files |
π οΈ Tech Stack
| Layer | Technology |
|---|---|
| Framework | Next.js 16 (App Router, Turbopack) |
| UI | React 19, Tailwind CSS 4, Radix UI, Lucide Icons |
| AI Integration | Vercel AI SDK 6 (streamText, createUIMessageStream) |
| Document Processing | pdftotext (poppler), mammoth, word-extractor, SheetJS, Tesseract |
| Math Rendering | KaTeX, remark-math, rehype-katex |
| Code Highlighting | react-syntax-highlighter (Prism) |
| Language | TypeScript 5 (strict mode) |
π€ Contributing
Contributions are welcome! Feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the MIT License. See the LICENSE file for details.
Built to prevent document data leakage when using AI.
Your files stay on your server. Only text goes to the cloud.
Made with β€οΈ by Romi Nur Ismanto