Spaces:
Paused
Paused
| # Adding Sources - Getting Content Into Your Notebook | |
| Sources are the raw materials of your research. This guide covers how to add different types of content. | |
| --- | |
| ## Quick-Start: Add Your First Source | |
| ### Option 1: Upload a File (PDF, Word, etc.) | |
| ``` | |
| 1. In your notebook, click "Add Source" | |
| 2. Select "Upload File" | |
| 3. Choose a file from your computer | |
| 4. Click "Upload" | |
| 5. Wait 30-60 seconds for processing | |
| 6. Done! Source appears in your notebook | |
| ``` | |
| ### Option 2: Add a Web Link | |
| ``` | |
| 1. Click "Add Source" | |
| 2. Select "Web Link" | |
| 3. Paste URL: https://example.com/article | |
| 4. Click "Add" | |
| 5. Wait for processing (usually faster than files) | |
| 6. Done! | |
| ``` | |
| ### Option 3: Paste Text | |
| ``` | |
| 1. Click "Add Source" | |
| 2. Select "Text" | |
| 3. Paste or type your content | |
| 4. Click "Save" | |
| 5. Done! Immediately available | |
| ``` | |
| --- | |
| ## Supported File Types | |
| ### Documents | |
| - **PDF** (.pdf) β Best support, including scanned PDFs with OCR | |
| - **Word** (.docx, .doc) β Full support | |
| - **PowerPoint** (.pptx) β Slides converted to text | |
| - **Excel** (.xlsx, .xls) β Spreadsheet data | |
| - **EPUB** (.epub) β eBook files | |
| - **Markdown** (.md, .txt) β Plain text formats | |
| - **HTML** (.html, .htm) β Web page files | |
| **File size limits:** Up to ~100MB (varies by system) | |
| **Processing time:** 10 seconds - 2 minutes (depending on length and file type) | |
| ### Audio & Video | |
| - **Audio**: MP3, WAV, M4A, OGG, FLAC (~30 seconds - 3 minutes per hour) | |
| - **Video**: MP4, AVI, MOV, MKV, WebM (~3-10 minutes per hour) | |
| - **YouTube**: Direct URL support | |
| - **Podcasts**: RSS feed URL | |
| **Automatic transcription**: Audio/video is transcribed to text automatically. This requires enabling speech-to-text in settings. | |
| ### Web Content | |
| - **Articles**: Blog posts, news articles, Medium | |
| - **YouTube**: Full videos or playlists | |
| - **PDFs online**: Direct PDF links | |
| - **News**: News site articles | |
| **Just paste the URL** in "Web Link" section. | |
| ### What Doesn't Work | |
| - Paywalled content (WSJ, FT, etc.) β Can't extract | |
| - Password-protected PDFs β Can't open | |
| - Pure image files (.jpg, .png) β Except scanned PDFs which have OCR | |
| - Very large files (>100MB) β Timeout | |
| --- | |
| ## What Happens When You Add a Source | |
| The system automatically does four things: | |
| ``` | |
| 1. EXTRACT TEXT | |
| File/URL β Readable text | |
| (PDFs get OCR if scanned) | |
| (Videos get transcribed if enabled) | |
| 2. BREAK INTO CHUNKS | |
| Long text β ~500-word pieces | |
| (So search finds specific parts, not whole document) | |
| 3. CREATE EMBEDDINGS | |
| Each chunk β Vector representation | |
| (Enables semantic/concept search) | |
| 4. INDEX & STORE | |
| Everything β Database | |
| (Ready to search and retrieve) | |
| ``` | |
| **Time to use:** After the progress bar completes, the source is ready immediately. Embeddings are created in the background. | |
| --- | |
| ## Step-by-Step for Different Types | |
| ### PDFs | |
| **Best practices:** | |
| ``` | |
| Clean PDFs: | |
| 1. Upload β Done | |
| 2. Processing time: ~30-60 seconds | |
| Scanned/Image PDFs: | |
| 1. Upload same way | |
| 2. System auto-detects and uses OCR | |
| 3. Processing time: ~2-3 minutes | |
| 4. (Higher, due to OCR overhead) | |
| Large PDFs (50+ pages): | |
| 1. Consider splitting into smaller files | |
| 2. Or upload as-is (system handles it) | |
| 3. Processing time scales with size | |
| ``` | |
| **Common issues:** | |
| - "Can't extract text" β PDF is corrupted or has copy protection | |
| - Solution: Try opening in Adobe. If it won't, the PDF is likely protected. | |
| ### Web Links / Articles | |
| **Best practices:** | |
| ``` | |
| 1. Copy full URL from browser: https://example.com/article-title | |
| 2. Paste in "Web Link" | |
| 3. Click Add | |
| 4. Wait for extraction | |
| Processing time: Usually 5-15 seconds | |
| ``` | |
| **What works:** | |
| - Standard web articles | |
| - Blog posts | |
| - News articles | |
| - Wikipedia pages | |
| - Medium posts | |
| - Substack articles | |
| **What doesn't work:** | |
| - Twitter threads (unreliable) | |
| - Paywalled articles (can't access) | |
| - JavaScript-heavy sites (content not extracted) | |
| **Pro tip:** If it doesn't work, copy the article text and paste as "Text" instead. | |
| ### Audio Files | |
| **Best practices:** | |
| ``` | |
| 1. Ensure speech-to-text is enabled in Settings | |
| 2. Upload MP3, WAV, or M4A file | |
| 3. System automatically transcribes to text | |
| 4. Processing time: ~1 minute per 5 minutes of audio | |
| Example: | |
| - 1-hour podcast β 12 minutes processing | |
| - 10-minute recording β 2 minutes processing | |
| ``` | |
| **Quality matters:** | |
| - Clear audio: Fast transcription | |
| - Muffled/noisy audio: Slower, less accurate transcription | |
| - Background noise: Try to minimize before uploading | |
| **Tip:** If audio quality is poor, the AI might misinterpret content. You can manually correct transcription if needed. | |
| ### YouTube Videos | |
| **Best practices:** | |
| ``` | |
| Two ways to add: | |
| Method 1: Direct URL | |
| 1. Copy YouTube URL: https://www.youtube.com/watch?v=... | |
| 2. Paste in "Web Link" | |
| 3. Click Add | |
| 4. System extracts captions (if available) + transcript | |
| Method 2: Playlist | |
| 1. Paste playlist URL | |
| 2. System adds all videos as separate sources | |
| 3. Each video processed separately | |
| 4. Takes longer (multiple videos) | |
| ``` | |
| **What's extracted:** | |
| - Captions/subtitles (if available) | |
| - Transcription (if captions aren't available) | |
| - Basic metadata (title, channel, length) | |
| **Processing:** | |
| - 10-minute video: ~2-3 minutes | |
| - 1-hour video: ~10-15 minutes | |
| ### Text / Paste Content | |
| **Best practices:** | |
| ``` | |
| 1. Select "Text" when adding source | |
| 2. Paste or type content | |
| 3. System processes immediately | |
| 4. No wait time needed | |
| Good for: | |
| - Notes you want to reference | |
| - Quotes from books | |
| - Transcripts you have handy | |
| - Quick research snippets | |
| ``` | |
| --- | |
| ## Managing Your Sources | |
| ### Viewing Source Details | |
| ``` | |
| Click on source β See: | |
| - Original file name/title | |
| - When it was added | |
| - Size and format | |
| - Processing status | |
| - Number of chunks | |
| ``` | |
| ### Organizing with Metadata | |
| You can add to each source: | |
| - **Title**: Better name than original filename | |
| - **Tags**: Category labels ("primary research", "background", "competitor analysis") | |
| - **Description**: A few notes about what it contains | |
| **Why this matters:** | |
| - Makes sources easier to find | |
| - Helps when contextualizing for Chat | |
| - Useful for organizing large notebooks | |
| ### Searching Within Sources | |
| ``` | |
| After sources are added, you can: | |
| Text search: "Find exact phrase" | |
| Vector search: "Find conceptually similar" | |
| Both search across all sources in notebook. | |
| Results show: | |
| - Which source | |
| - Which section | |
| - Relevance score | |
| ``` | |
| --- | |
| ## Context Management: How Sources Get Used | |
| You control how AI accesses sources: | |
| ### Three Levels (for Chat) | |
| **Full Content:** | |
| ``` | |
| AI sees: Complete source text | |
| Cost: 100% of tokens | |
| Use when: Analyzing in detail, need precise citations | |
| Example: "Analyze this methodology paper closely" | |
| ``` | |
| **Summary Only:** | |
| ``` | |
| AI sees: AI-generated summary (not full text) | |
| Cost: ~10-20% of tokens | |
| Use when: Background material, reference context | |
| Example: "Use this as context but focus on the main source" | |
| ``` | |
| **Not in Context:** | |
| ``` | |
| AI sees: Nothing (excluded) | |
| Cost: 0 tokens | |
| Use when: Confidential, not relevant, or archived | |
| Example: "Keep this in notebook but don't use in this conversation" | |
| ``` | |
| ### How to Set Context (in Chat) | |
| ``` | |
| 1. Go to Chat | |
| 2. Click "Select Context Sources" | |
| 3. For each source: | |
| - Toggle ON/OFF (include/exclude) | |
| - Choose level (Full/Summary/Excluded) | |
| 4. Click "Save" | |
| 5. Now chat uses these settings | |
| ``` | |
| --- | |
| ## Common Mistakes | |
| | Mistake | What Happens | How to Fix | | |
| |---------|--------------|-----------| | |
| | Upload 200 sources at once | System gets slow, processing stalls | Add 10-20 at a time, wait for processing | | |
| | Use full content for all sources | Token usage skyrockets, expensive | Use "Summary" or "Excluded" for background material | | |
| | Add huge PDFs without splitting | Processing is slow, search results less precise | Consider splitting large PDFs into chapters | | |
| | Forget source titles | Can't distinguish between similar sources | Rename sources with descriptive titles right after uploading | | |
| | Don't tag sources | Hard to find and organize later | Add tags immediately: "primary", "background", etc. | | |
| | Mix languages in one source | Transcription/embedding quality drops | Keep each language in separate sources | | |
| | Use same source multiple times | Takes up space, creates confusion | Add once; reuse in multiple chats/notebooks | | |
| --- | |
| ## Processing Status & Troubleshooting | |
| ### What the Status Indicators Mean | |
| ``` | |
| π‘ Processing | |
| β Source is being extracted and embedded | |
| β Wait 30 seconds - 3 minutes depending on size | |
| β Don't use in Chat yet | |
| π’ Ready | |
| β Source is processed and searchable | |
| β Can use immediately in Chat | |
| β Can apply transformations | |
| π΄ Error | |
| β Something went wrong | |
| β Common reasons: | |
| - Unsupported file format | |
| - File too large or corrupted | |
| - Network timeout | |
| βͺ Not in Context | |
| β Source added but excluded from Chat | |
| β Still searchable, not sent to AI | |
| ``` | |
| ### Common Errors & Solutions | |
| **"Unsupported file type"** | |
| - You tried to upload a format not in the list (e.g., `.webp` image) | |
| - Solution: Convert to supported format (PDF for documents, MP3 for audio) | |
| **"Processing timeout"** | |
| - Very large file (>100MB) or very long audio | |
| - Solution: Split into smaller pieces or try uploading again | |
| **"Transcription failed"** | |
| - Audio quality too poor or language not detected | |
| - Solution: Re-record with better quality, or paste text transcript manually | |
| **"Web link won't extract"** | |
| - Website blocks automated access or uses JavaScript for content | |
| - Solution: Copy the article text and paste as "Text" instead | |
| --- | |
| ## Tips for Best Results | |
| ### For PDFs | |
| - Clean, digital PDFs work best | |
| - Remove copy protection if present (legally) | |
| - Scanned PDFs work but take longer | |
| ### For Web Articles | |
| - Use full URL including domain | |
| - Avoid cookie/popup-laden sites | |
| - If extraction fails, copy-paste text instead | |
| ### For Audio | |
| - Clear, well-recorded audio transcribes better | |
| - Remove background noise if possible | |
| - YouTube videos usually have good transcriptions built-in | |
| ### For Large Documents | |
| - Consider splitting into smaller sources | |
| - Gives more precise search results | |
| - Processing is faster for smaller pieces | |
| ### For Organization | |
| - Name sources clearly (not "document_2.pdf") | |
| - Add tags immediately after uploading | |
| - Use descriptions for complex documents | |
| --- | |
| ## What Comes After: Using Your Sources | |
| Once you've added sources, you can: | |
| - **Chat** β Ask questions (see [Chat Effectively](chat-effectively.md)) | |
| - **Search** β Find specific content (see [Search Effectively](search.md)) | |
| - **Transformations** β Extract structured insights (see [Working with Notes](working-with-notes.md)) | |
| - **Ask** β Get comprehensive answers (see [Search Effectively](search.md)) | |
| - **Podcasts** β Turn into audio (see [Creating Podcasts](creating-podcasts.md)) | |
| --- | |
| ## Summary Checklist | |
| Before adding sources, confirm: | |
| - [ ] File is in supported format | |
| - [ ] File is under 100MB (or splitting large ones) | |
| - [ ] Web links are full URLs (not shortened) | |
| - [ ] Audio files have clear speech (if transcription-dependent) | |
| - [ ] You've named source clearly | |
| - [ ] You've added tags for organization | |
| - [ ] You understand context levels (Full/Summary/Excluded) | |
| Done! Sources are now ready for Chat, Search, Transformations, and more. | |