sachnun commited on
Commit
e7faa09
·
1 Parent(s): f98d63d

Add file existence check before upload for deduplication

Browse files

- Create reusable checkFileExists() helper function
- Add file existence check in /upload endpoint to skip uploading duplicate files
- Refactor /upload/remote to use the shared checkFileExists() function
- Add CLAUDE.md with codebase architecture documentation

This prevents unnecessary uploads to HuggingFace Dataset when files with
the same hash already exist, saving bandwidth and processing time.

Files changed (2) hide show
  1. CLAUDE.md +171 -0
  2. src/index.ts +42 -22
CLAUDE.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Hugstream Upload Server is an external upload proxy for the Hugstream file storage system. It runs as a Docker container (typically on Hugging Face Spaces) and handles direct file uploads from browsers to Hugging Face Datasets, offloading bandwidth from the main VPS and keeping the HF_TOKEN secure on the server side.
8
+
9
+ ## Key Architecture Concepts
10
+
11
+ ### Upload Flow
12
+ 1. Browser sends file + sessionToken directly to this upload server (not through main VPS)
13
+ 2. Upload server validates sessionToken against PostgreSQL database (shared with main app)
14
+ 3. File is uploaded to HF Dataset using content-addressed storage (hash-based paths)
15
+ 4. Database record created with fileId, which can be used to generate download URLs via proxy
16
+
17
+ ### Authentication Pattern
18
+ - Uses session token authentication (from `auth-session` cookie)
19
+ - Session tokens are SHA-256 hashed to produce sessionId
20
+ - Database validates session and retrieves userId
21
+ - No separate upload server token validation for user requests (legacy UPLOAD_SERVER_TOKEN may be removed)
22
+
23
+ ### Content-Addressed Storage
24
+ - Files are stored by MD5 hash for global deduplication
25
+ - HF path format: `{hash}` (just the hash, no userId prefix)
26
+ - Database `file` table maps fileId -> hash -> HfPath
27
+ - Same file uploaded by different users = single HF storage + multiple DB records
28
+
29
+ ### Remote Upload System (URL Downloads)
30
+ - Supports downloading files from remote URLs and uploading to HF
31
+ - Uses aria2c (if available) for multi-connection downloads with progress tracking
32
+ - Falls back to standard fetch if aria2c not installed
33
+ - For large files (>5GB), uses Git LFS with `huggingface-cli lfs-enable-largefiles`
34
+ - Progress tracked via Server-Sent Events (SSE) or polling API
35
+ - Automatic cleanup of temporary files after upload
36
+
37
+ ## Development Commands
38
+
39
+ ```bash
40
+ # Development server (watch mode)
41
+ npm run dev
42
+
43
+ # Build TypeScript
44
+ npm run build
45
+
46
+ # Production server
47
+ npm start
48
+
49
+ # Docker build & run
50
+ docker build -t hugstream-upload .
51
+ docker run -p 7860:7860 --env-file .env hugstream-upload
52
+ ```
53
+
54
+ ## Database Schema
55
+
56
+ Uses Drizzle ORM with PostgreSQL (via pg driver). Three main tables:
57
+ - `user`: User accounts (id, username, passwordHash, storageQuota)
58
+ - `session`: Session tokens (id=hashed token, userId, expiresAt)
59
+ - `file`: File metadata (id, name, userId, parentId, hash, hfPath, isUploaded, etc.)
60
+
61
+ Database connection uses singleton pool pattern in `src/lib/db/index.ts` for performance.
62
+
63
+ ## DNS Workaround for Koyeb
64
+ The codebase includes a DNS resolution workaround for Koyeb-hosted PostgreSQL (which uses Neon). If DATABASE_URL contains `koyeb.app`, the DNS lookup is intercepted and mapped to the underlying Neon infrastructure. See `src/lib/db/index.ts:setupDnsWorkaround()`.
65
+
66
+ ## File Upload Methods
67
+
68
+ ### 1. Direct Upload (`POST /upload`)
69
+ - Accepts multipart/form-data with file, sessionToken, parentId
70
+ - Validates session, calculates hash, uploads to HF with retry logic
71
+ - Creates database record and returns downloadUrl (if PUBLIC_DOWNLOAD_PROXY_HOST configured)
72
+
73
+ ### 2. Remote Upload (`POST /upload/remote`)
74
+ - Downloads file from URL, then uploads to HF
75
+ - Returns SSE stream for real-time progress updates
76
+ - Uses aria2c for multi-connection downloads when available
77
+ - For files >5GB: Uses Git clone + LFS + `huggingface-cli lfs-enable-largefiles`
78
+ - For smaller files or when aria2 unavailable: Uses HuggingFace Hub commit API
79
+ - Supports cancellation via `DELETE /upload/remote/:uploadId`
80
+
81
+ ### 3. Progress Monitoring
82
+ - SSE endpoint: `GET /upload/remote/progress/:uploadId`
83
+ - Polling endpoint: `GET /upload/remote/progress-poll/:uploadId`
84
+
85
+ ## HuggingFace Upload Strategy
86
+
87
+ The upload strategy varies by file size and method:
88
+
89
+ 1. **Small files (<5GB) in direct upload**: Uses `@huggingface/hub` uploadFile API with retry logic for concurrent conflicts
90
+ 2. **Large files in remote upload**: Uses Git clone + Git LFS + push workflow
91
+ 3. **Retry logic**: Exponential backoff (1s, 2s, 4s, 8s, 16s) for concurrent commit conflicts
92
+
93
+ ## Environment Variables
94
+
95
+ Required:
96
+ - `DATABASE_URL`: PostgreSQL connection string
97
+ - `HF_TOKEN`: Hugging Face API token with write access
98
+ - `HF_DATASET_REPO`: Dataset repository (format: `username/repo`)
99
+
100
+ Optional:
101
+ - `UPLOAD_SERVER_TOKEN`: Legacy token (may be unused)
102
+ - `ALLOWED_ORIGIN`: CORS origin (default: `*`)
103
+ - `PUBLIC_DOWNLOAD_PROXY_HOST`: Base URL for download links
104
+ - `PORT`: Server port (default: 7860)
105
+
106
+ ## Important Implementation Details
107
+
108
+ ### Concurrent Upload Handling
109
+ - Multiple users may upload the same file simultaneously
110
+ - Retry logic with exponential backoff handles HuggingFace "commit operation in progress" errors
111
+ - Deduplication is detected via conflict errors or pre-upload fileExists check
112
+
113
+ ### Git Configuration
114
+ - Git identity configured at runtime via `start.sh` script
115
+ - Required for Git-based uploads to HuggingFace Dataset
116
+ - Identity: "Hugstream Upload Bot <bot@hugstream.upload>"
117
+
118
+ ### Aria2 Integration
119
+ - Spawns temporary aria2c RPC server for each download
120
+ - Uses random port (6800-7800 range) to avoid conflicts
121
+ - 16 concurrent connections per download for speed
122
+ - Automatic cleanup of aria2 process and temp files
123
+
124
+ ### Hash Calculation
125
+ - For direct uploads: Buffer-based MD5 (in-memory)
126
+ - For remote uploads with aria2: Streaming hash calculation (memory efficient)
127
+ - Used for deduplication and file integrity
128
+
129
+ ## Testing Endpoints
130
+
131
+ - `GET /health`: Health check (returns status, service name, timestamp)
132
+ - `GET /test-db`: Database connection test (queries user and file tables)
133
+
134
+ ## Common Patterns
135
+
136
+ ### Creating Database Records
137
+ ```typescript
138
+ const fileId = encodeBase32LowerCase(crypto.getRandomValues(new Uint8Array(15)))
139
+ await db.insert(fileTable).values({
140
+ id: fileId,
141
+ name: filename,
142
+ hash: calculatedHash,
143
+ hfPath,
144
+ userId,
145
+ // ... other fields
146
+ })
147
+ ```
148
+
149
+ ### Validating Sessions
150
+ ```typescript
151
+ const userId = await validateSessionToken(sessionToken)
152
+ if (!userId) {
153
+ return c.json({ error: 'Unauthorized' }, 401)
154
+ }
155
+ ```
156
+
157
+ ### Upload with Retry
158
+ ```typescript
159
+ const result = await uploadToHFWithRetry(hfPath, buffer, maxRetries)
160
+ // result.deduplicated = true if file already existed
161
+ ```
162
+
163
+ ## Docker Image
164
+
165
+ Based on `node:20-slim` with:
166
+ - aria2 for fast downloads
167
+ - git + git-lfs for HF uploads
168
+ - python3 + pip for huggingface-cli (large file support)
169
+ - ca-certificates for SSL
170
+
171
+ Build cache invalidation comment in Dockerfile should be updated when dependencies change.
src/index.ts CHANGED
@@ -146,6 +146,27 @@ function getMimeType(filename: string): string {
146
  return mimeTypes[ext || ''] || 'application/octet-stream'
147
  }
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  // Helper function to retry HF upload with exponential backoff
150
  async function uploadToHFWithRetry(
151
  hfPath: string,
@@ -447,11 +468,26 @@ app.post('/upload', async (c) => {
447
  // Build HF path using only hash (global deduplication)
448
  const hfPath = calculatedHash
449
 
450
- console.log(`[UPLOAD] Hash: ${calculatedHash}, uploading to HF`)
 
 
 
 
451
 
452
- // Upload to Hugging Face Dataset with retry logic
 
453
  try {
454
- const result = await uploadToHFWithRetry(hfPath, buffer)
 
 
 
 
 
 
 
 
 
 
455
 
456
  // Generate download URL only if proxy is configured
457
  // Note: Download proxy expects fileId, not hash
@@ -788,25 +824,9 @@ async function processRemoteUploadWithStream(
788
 
789
  console.log(`[REMOTE_UPLOAD] Hash: ${calculatedHash}`)
790
 
791
- // Check if file already exists in dataset using @huggingface/hub
792
- let fileExistsInDataset = false
793
- try {
794
- const { fileExists } = await import('@huggingface/hub')
795
- fileExistsInDataset = await fileExists({
796
- repo: {
797
- type: 'dataset',
798
- name: HF_DATASET_REPO
799
- },
800
- path: hfPath,
801
- credentials: {
802
- accessToken: HF_TOKEN
803
- }
804
- })
805
-
806
- console.log(`[REMOTE_UPLOAD] File exists check: ${fileExistsInDataset ? 'yes (deduplicated)' : 'no (will upload)'}`)
807
- } catch (checkError: any) {
808
- console.log(`[REMOTE_UPLOAD] Error checking file existence (will attempt upload anyway): ${checkError.message}`)
809
- }
810
 
811
  // Update progress to uploading (99%)
812
  sendProgress({
 
146
  return mimeTypes[ext || ''] || 'application/octet-stream'
147
  }
148
 
149
+ // Helper function to check if file exists in HF dataset
150
+ async function checkFileExists(hfPath: string): Promise<boolean> {
151
+ try {
152
+ const { fileExists } = await import('@huggingface/hub')
153
+ const exists = await fileExists({
154
+ repo: {
155
+ type: 'dataset',
156
+ name: HF_DATASET_REPO
157
+ },
158
+ path: hfPath,
159
+ credentials: {
160
+ accessToken: HF_TOKEN
161
+ }
162
+ })
163
+ return exists
164
+ } catch (error: any) {
165
+ console.log(`[CHECK] Error checking file existence (will attempt upload anyway): ${error.message}`)
166
+ return false
167
+ }
168
+ }
169
+
170
  // Helper function to retry HF upload with exponential backoff
171
  async function uploadToHFWithRetry(
172
  hfPath: string,
 
468
  // Build HF path using only hash (global deduplication)
469
  const hfPath = calculatedHash
470
 
471
+ console.log(`[UPLOAD] Hash: ${calculatedHash}`)
472
+
473
+ // Check if file already exists in dataset
474
+ const fileExistsInDataset = await checkFileExists(hfPath)
475
+ console.log(`[UPLOAD] File exists check: ${fileExistsInDataset ? 'yes (deduplicated)' : 'no (will upload)'}`)
476
 
477
+ // Upload to Hugging Face Dataset with retry logic (skip if file already exists)
478
+ let deduplicated = false
479
  try {
480
+ let result: { success: boolean; deduplicated: boolean }
481
+
482
+ if (fileExistsInDataset) {
483
+ console.log(`[UPLOAD] Deduplication detected, skipping upload`)
484
+ result = { success: true, deduplicated: true }
485
+ deduplicated = true
486
+ } else {
487
+ console.log(`[UPLOAD] Uploading to HF`)
488
+ result = await uploadToHFWithRetry(hfPath, buffer)
489
+ deduplicated = result.deduplicated
490
+ }
491
 
492
  // Generate download URL only if proxy is configured
493
  // Note: Download proxy expects fileId, not hash
 
824
 
825
  console.log(`[REMOTE_UPLOAD] Hash: ${calculatedHash}`)
826
 
827
+ // Check if file already exists in dataset
828
+ const fileExistsInDataset = await checkFileExists(hfPath)
829
+ console.log(`[REMOTE_UPLOAD] File exists check: ${fileExistsInDataset ? 'yes (deduplicated)' : 'no (will upload)'}`)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
830
 
831
  // Update progress to uploading (99%)
832
  sendProgress({