| # Course Project: FileOrganizer | |
| **A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.** | |
| --- | |
| ## Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FileOrganizer CLI β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β $ fileorg scan ~/Downloads β | |
| β $ fileorg organize ~/Papers --strategy=by-topic β | |
| β $ fileorg deduplicate ~/Research --similarity=0.9 β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Architecture | |
| ``` | |
| Files βββΊ Content Analysis βββΊ AI Classification βββΊ Organized Structure | |
| β β | |
| βΌ βΌ | |
| PDF Extraction Docker Model Runner | |
| Metadata Tools (Local LLM) | |
| β β | |
| ββββββββββΊMCPβββββββββββββ | |
| ``` | |
| ### Data Flow | |
| ``` | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β Files/PDFs ββββββΊβ Content ββββββΊβ MCP Server β | |
| β (Input) β β Extraction β β (Tools) β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β Organized βββββββ Agent Crew βββββββ Local LLM β | |
| β Structure β β (CrewAI) β β (Docker) β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| ``` | |
| --- | |
| ## Agent System | |
| | Agent | Role | Tools | Output | | |
| |-------|------|-------|--------| | |
| | **Scanner Agent** | Discovers files, extracts metadata | File I/O, PDF extraction, hash generation | File inventory, metadata catalog | | |
| | **Classifier Agent** | Categorizes files by content and context | LLM analysis, embeddings, similarity | Category assignments, topic tags | | |
| | **Organizer Agent** | Creates folder structure and moves files | File operations, naming strategies | Organized directory tree | | |
| | **Deduplicator Agent** | Finds and handles duplicate files | Hash comparison, content similarity | Duplicate reports, cleanup actions | | |
| ### Agent Workflow | |
| ``` | |
| User Request: "Organize research papers by topic" | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Scanner Agent β | |
| β "What files do we β | |
| β have and what β | |
| β are they about?" β | |
| ββββββββββββ¬βββββββββββ | |
| β File Inventory | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Classifier Agent β | |
| β "What topics and β | |
| β categories emerge β | |
| β from the content?"β | |
| ββββββββββββ¬βββββββββββ | |
| β Categories | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Organizer Agent β | |
| β "Create folder β | |
| β structure and β | |
| β move files" β | |
| ββββββββββββ¬βββββββββββ | |
| β Organization Plan | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Deduplicator Agent β | |
| β "Find and handle β | |
| β duplicate files" β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| Organized Directory | |
| ``` | |
| --- | |
| ## CLI Commands | |
| ### `fileorg scan` | |
| Scan a directory and analyze its contents. | |
| ```bash | |
| # Scan a directory | |
| fileorg scan ~/Downloads | |
| # Scan with detailed analysis | |
| fileorg scan ~/Papers --analyze-content | |
| # Scan and export inventory | |
| fileorg scan ~/Research --export inventory.json | |
| # Scan specific file types | |
| fileorg scan ~/Documents --types pdf,docx,txt | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--analyze-content` | Extract and analyze file contents | `false` | | |
| | `--export` | Export inventory to JSON/CSV | None | | |
| | `--types` | Comma-separated file extensions to scan | All | | |
| | `--recursive` | Scan subdirectories | `true` | | |
| | `--max-depth` | Maximum directory depth | `10` | | |
| ### `fileorg organize` | |
| Organize files using AI-powered strategies. | |
| ```bash | |
| # Organize by topic (AI-powered) | |
| fileorg organize ~/Papers --strategy=by-topic | |
| # Organize by date | |
| fileorg organize ~/Photos --strategy=by-date --format="%Y/%m" | |
| # Organize with custom naming | |
| fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}" | |
| # Dry run to preview changes | |
| fileorg organize ~/Downloads --dry-run | |
| # Interactive mode | |
| fileorg organize ~/Research --interactive | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--strategy` | Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart` | `smart` | | |
| | `--rename` | Rename files intelligently | `false` | | |
| | `--pattern` | Naming pattern for renamed files | `{original}` | | |
| | `--dry-run` | Preview changes without executing | `false` | | |
| | `--interactive` | Confirm each action | `false` | | |
| | `--output` | Output directory | Same as input | | |
| ### `fileorg deduplicate` | |
| Find and handle duplicate files. | |
| ```bash | |
| # Find duplicates by hash | |
| fileorg deduplicate ~/Downloads | |
| # Find similar files (content-based) | |
| fileorg deduplicate ~/Papers --similarity=0.9 | |
| # Auto-delete duplicates (keep newest) | |
| fileorg deduplicate ~/Photos --auto-delete --keep=newest | |
| # Move duplicates to folder | |
| fileorg deduplicate ~/Documents --move-to=./duplicates | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--similarity` | Similarity threshold (0.0-1.0) for content matching | `1.0` (exact) | | |
| | `--method` | Detection method: `hash`, `content`, `metadata` | `hash` | | |
| | `--auto-delete` | Automatically delete duplicates | `false` | | |
| | `--keep` | Which to keep: `newest`, `oldest`, `largest`, `smallest` | `newest` | | |
| | `--move-to` | Move duplicates to directory instead of deleting | None | | |
| ### `fileorg research` | |
| Special commands for research paper management. | |
| ```bash | |
| # Extract metadata from PDFs | |
| fileorg research extract ~/Papers | |
| # Generate bibliography | |
| fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib | |
| # Find related papers | |
| fileorg research related "attention mechanisms" --in ~/Papers | |
| # Create reading list | |
| fileorg research reading-list ~/Papers --topic "transformers" --order=citations | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--format` | Bibliography format: `bibtex`, `apa`, `mla` | `bibtex` | | |
| | `--output` | Output file path | `stdout` | | |
| | `--order` | Sort order: `date`, `citations`, `relevance` | `relevance` | | |
| ### `fileorg config` | |
| Manage configuration settings. | |
| ```bash | |
| # Show current config | |
| fileorg config show | |
| # Set LLM model | |
| fileorg config set llm.model "llama3.2:3b" | |
| # Set default strategy | |
| fileorg config set organize.default_strategy "by-topic" | |
| # Reset to defaults | |
| fileorg config reset | |
| ``` | |
| ### `fileorg stats` | |
| Show statistics about files and organization. | |
| ```bash | |
| # Show directory statistics | |
| fileorg stats ~/Papers | |
| # Show organization suggestions | |
| fileorg stats ~/Downloads --suggest | |
| # Export statistics | |
| fileorg stats ~/Research --export stats.json | |
| ``` | |
| --- | |
| ## Configuration | |
| Configuration is stored in `~/.config/fileorg/config.toml` or `./fileorg.toml` in the project directory. | |
| ```toml | |
| [fileorg] | |
| version = "1.0.0" | |
| [llm] | |
| provider = "docker" # docker, ollama, openai | |
| model = "llama3.2:3b" | |
| temperature = 0.7 | |
| max_tokens = 4096 | |
| base_url = "http://localhost:11434" | |
| [llm.docker] | |
| runtime = "nvidia" # nvidia, cpu | |
| memory_limit = "8g" | |
| [agents] | |
| verbose = false | |
| max_iterations = 10 | |
| [agents.scanner] | |
| role = "File Scanner" | |
| goal = "Discover and catalog all files with metadata" | |
| [agents.classifier] | |
| role = "Content Classifier" | |
| goal = "Categorize files by content and context" | |
| [agents.organizer] | |
| role = "File Organizer" | |
| goal = "Create optimal folder structure and organize files" | |
| [agents.deduplicator] | |
| role = "Duplicate Detector" | |
| goal = "Find and handle duplicate files efficiently" | |
| [organize] | |
| default_strategy = "smart" | |
| create_backups = true | |
| backup_dir = "./.fileorg_backup" | |
| [organize.naming] | |
| sanitize = true | |
| max_length = 255 | |
| replace_spaces = "_" | |
| [research] | |
| extract_metadata = true | |
| auto_rename = true | |
| naming_pattern = "{year}_{author}_{title}" | |
| generate_bibliography = true | |
| [deduplication] | |
| default_method = "hash" | |
| similarity_threshold = 0.95 | |
| auto_delete = false | |
| keep_strategy = "newest" | |
| [pdf] | |
| extract_text = true | |
| extract_metadata = true | |
| ocr_enabled = false # Enable OCR for scanned PDFs | |
| [observability] | |
| enabled = true | |
| provider = "langfuse" # langfuse, langsmith, console | |
| trace_agents = true | |
| log_tokens = true | |
| ``` | |
| --- | |
| ## Docker Stack | |
| ### docker-compose.yml | |
| ```yaml | |
| version: "3.9" | |
| services: | |
| # Local LLM via Docker Model Runner | |
| llm: | |
| image: ollama/ollama:latest | |
| runtime: nvidia | |
| environment: | |
| - OLLAMA_HOST=0.0.0.0 | |
| volumes: | |
| - ollama_data:/root/.ollama | |
| ports: | |
| - "11434:11434" | |
| deploy: | |
| resources: | |
| reservations: | |
| devices: | |
| - driver: nvidia | |
| count: 1 | |
| capabilities: [gpu] | |
| healthcheck: | |
| test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] | |
| interval: 30s | |
| timeout: 10s | |
| retries: 3 | |
| # MCP Server for file operations and PDF tools | |
| mcp-server: | |
| build: | |
| context: ./src/fileorg/mcp | |
| dockerfile: Dockerfile | |
| environment: | |
| - MCP_PORT=3000 | |
| volumes: | |
| - ./workspace:/workspace | |
| ports: | |
| - "3000:3000" | |
| depends_on: | |
| - llm | |
| # Main application (for containerized usage) | |
| fileorg: | |
| build: | |
| context: . | |
| dockerfile: Dockerfile | |
| environment: | |
| - LLM_BASE_URL=http://llm:11434 | |
| - MCP_SERVER_URL=http://mcp-server:3000 | |
| volumes: | |
| - ./workspace:/workspace | |
| - ./config:/config:ro | |
| depends_on: | |
| llm: | |
| condition: service_healthy | |
| mcp-server: | |
| condition: service_started | |
| profiles: | |
| - cli | |
| volumes: | |
| ollama_data: | |
| ``` | |
| ### Running the Stack | |
| ```bash | |
| # Start LLM and MCP server | |
| docker compose up -d llm mcp-server | |
| # Pull the model (first time only) | |
| docker compose exec llm ollama pull llama3.2:3b | |
| # Run FileOrganizer commands | |
| docker compose run --rm fileorg scan /workspace/papers | |
| docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic | |
| docker compose run --rm fileorg deduplicate /workspace/downloads | |
| # Or run locally with Docker backend | |
| fileorg scan ~/Papers | |
| fileorg organize ~/Papers --strategy=by-topic | |
| fileorg deduplicate ~/Downloads | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| fileorg/ | |
| βββ pyproject.toml # pixi/uv project config | |
| βββ pixi.lock | |
| βββ docker-compose.yml # Full stack orchestration | |
| βββ Dockerfile | |
| βββ fileorg.toml # Default configuration | |
| βββ README.md | |
| β | |
| βββ src/ | |
| β βββ fileorg/ | |
| β βββ __init__.py | |
| β βββ __main__.py # Entry point | |
| β βββ cli.py # Typer CLI commands | |
| β βββ config.py # TOML configuration loader | |
| β β | |
| β βββ scanner/ # File discovery and analysis | |
| β β βββ __init__.py | |
| β β βββ discovery.py # File system traversal | |
| β β βββ metadata.py # Metadata extraction | |
| β β βββ pdf_reader.py # PDF text/metadata extraction | |
| β β βββ hashing.py # File hashing utilities | |
| β β | |
| β βββ classifier/ # Content classification | |
| β β βββ __init__.py | |
| β β βββ embeddings.py # Generate embeddings | |
| β β βββ clustering.py # Topic clustering | |
| β β βββ categorizer.py # AI-powered categorization | |
| β β βββ similarity.py # Content similarity | |
| β β | |
| β βββ organizer/ # File organization | |
| β β βββ __init__.py | |
| β β βββ strategies.py # Organization strategies | |
| β β βββ naming.py # File naming logic | |
| β β βββ structure.py # Directory structure creation | |
| β β βββ mover.py # Safe file operations | |
| β β | |
| β βββ deduplicator/ # Duplicate detection | |
| β β βββ __init__.py | |
| β β βββ hash_based.py # Hash-based detection | |
| β β βββ content_based.py # Content similarity detection | |
| β β βββ handler.py # Duplicate handling | |
| β β | |
| β βββ research/ # Research paper tools | |
| β β βββ __init__.py | |
| β β βββ extractor.py # PDF metadata extraction | |
| β β βββ bibliography.py # Bibliography generation | |
| β β βββ citation.py # Citation parsing | |
| β β βββ scholar.py # Academic search integration | |
| β β | |
| β βββ agents/ # CrewAI agents | |
| β β βββ __init__.py | |
| β β βββ crew.py # Crew orchestration | |
| β β βββ scanner.py # Scanner agent | |
| β β βββ classifier.py # Classifier agent | |
| β β βββ organizer.py # Organizer agent | |
| β β βββ deduplicator.py # Deduplicator agent | |
| β β | |
| β βββ tools/ # Agent tools | |
| β β βββ __init__.py | |
| β β βββ file_tools.py # File operation tools | |
| β β βββ pdf_tools.py # PDF processing tools | |
| β β βββ search_tools.py # Search and query tools | |
| β β βββ analysis.py # Content analysis tools | |
| β β | |
| β βββ mcp/ # MCP server | |
| β β βββ __init__.py | |
| β β βββ server.py # MCP server implementation | |
| β β βββ tools.py # MCP tool definitions | |
| β β βββ Dockerfile # MCP server container | |
| β β | |
| β βββ llm/ # LLM integration | |
| β β βββ __init__.py | |
| β β βββ client.py # LLM client (Docker/Ollama/OpenAI) | |
| β β βββ prompts.py # Prompt templates | |
| β β | |
| β βββ observability/ # Logging & tracing | |
| β βββ __init__.py | |
| β βββ tracing.py # Distributed tracing | |
| β βββ metrics.py # Token/cost tracking | |
| β | |
| βββ tests/ | |
| β βββ __init__.py | |
| β βββ conftest.py # Pytest fixtures | |
| β βββ test_cli.py | |
| β βββ test_scanner.py | |
| β βββ test_classifier.py | |
| β βββ test_organizer.py | |
| β βββ test_deduplicator.py | |
| β βββ test_research.py | |
| β βββ fixtures/ | |
| β βββ sample_papers/ | |
| β β βββ paper1.pdf | |
| β β βββ paper2.pdf | |
| β β βββ paper3.pdf | |
| β βββ sample_files/ | |
| β βββ expected_outputs/ | |
| β | |
| βββ workspace/ # Working directory | |
| β βββ .gitkeep | |
| β | |
| βββ docs/ # Documentation (Quarto) | |
| βββ _quarto.yml | |
| βββ index.qmd | |
| βββ chapters/ | |
| ``` | |
| --- | |
| ## Technology Stack | |
| | Category | Tools | | |
| |----------|-------| | |
| | **Package Management** | pixi, uv | | |
| | **CLI Framework** | Typer, Rich | | |
| | **Local LLM** | Docker Model Runner, Ollama | | |
| | **LLM Framework** | LangChain | | |
| | **Multi-Agent** | CrewAI | | |
| | **MCP** | Docker MCP Toolkit | | |
| | **PDF Processing** | PyPDF2, pdfplumber, pypdf | | |
| | **Embeddings** | sentence-transformers | | |
| | **File Operations** | pathlib, shutil | | |
| | **Hashing** | hashlib, xxhash | | |
| | **Metadata** | exifread, mutagen | | |
| | **Similarity** | scikit-learn, faiss | | |
| | **Observability** | Langfuse, OpenTelemetry | | |
| | **Testing** | pytest, DeepEval | | |
| | **Containerization** | Docker, Docker Compose | | |
| --- | |
| ## Example Usage | |
| ### End-to-End Workflow | |
| ```bash | |
| # 1. Start the Docker stack | |
| docker compose up -d | |
| # 2. Scan your messy Downloads folder | |
| fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json | |
| # 3. Organize files by type and date | |
| fileorg organize ~/Downloads --strategy=smart --dry-run | |
| # Review the plan, then execute | |
| fileorg organize ~/Downloads --strategy=smart | |
| # 4. Organize research papers by topic | |
| fileorg scan ~/Papers --types=pdf --analyze-content | |
| fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}" | |
| # 5. Find and handle duplicates | |
| fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates | |
| # 6. Extract metadata and generate bibliography | |
| fileorg research extract ~/Papers | |
| fileorg research bibliography ~/Papers --format=bibtex --output=references.bib | |
| # 7. Create a reading list on a specific topic | |
| fileorg research reading-list ~/Papers --topic "transformers" --order=citations | |
| # 8. View statistics | |
| fileorg stats ~/Papers | |
| ``` | |
| ### Research Paper Organization Example | |
| ```bash | |
| # Before: | |
| ~/Papers/ | |
| βββ paper_final.pdf | |
| βββ attention_is_all_you_need.pdf | |
| βββ bert_paper.pdf | |
| βββ gpt3.pdf | |
| βββ vision_transformer.pdf | |
| βββ download (1).pdf | |
| βββ download (2).pdf | |
| βββ thesis_draft_v5.pdf | |
| # Run organization | |
| fileorg organize ~/Papers --strategy=by-topic --rename | |
| # After: | |
| ~/Papers/ | |
| βββ Natural_Language_Processing/ | |
| β βββ Transformers/ | |
| β β βββ 2017_Vaswani_Attention_Is_All_You_Need.pdf | |
| β β βββ 2018_Devlin_BERT_Pretraining.pdf | |
| β β βββ 2020_Brown_GPT3_Language_Models.pdf | |
| β βββ Other/ | |
| β βββ 2023_Smith_Thesis_Draft.pdf | |
| βββ Computer_Vision/ | |
| β βββ Transformers/ | |
| β βββ 2020_Dosovitskiy_Vision_Transformer.pdf | |
| βββ Uncategorized/ | |
| βββ 2024_Unknown_Document.pdf | |
| ``` | |
| ### Duplicate Detection Example | |
| ```bash | |
| # Find exact duplicates | |
| fileorg deduplicate ~/Downloads | |
| # Found 15 duplicate files (45 MB) | |
| # β’ download.pdf (3 copies) | |
| # β’ image.jpg (2 copies) | |
| # β’ report.docx (2 copies) | |
| # Find similar papers (different versions) | |
| fileorg deduplicate ~/Papers --similarity=0.9 --method=content | |
| # Found 3 similar file groups: | |
| # β’ attention_paper.pdf, attention_is_all_you_need.pdf (95% similar) | |
| # β’ bert_preprint.pdf, bert_final.pdf (98% similar) | |
| # Auto-cleanup (keep newest) | |
| fileorg deduplicate ~/Downloads --auto-delete --keep=newest | |
| # β Deleted 15 duplicate files, freed 45 MB | |
| ``` | |
| --- | |
| ## Learning Outcomes | |
| By building FileOrganizer, learners will be able to: | |
| 1. β Set up modern Python projects with pixi and reproducible environments | |
| 2. β Build professional CLI tools with Typer and Rich | |
| 3. β Run local LLMs using Docker Model Runner | |
| 4. β Process and extract content from PDF files | |
| 5. β Build MCP servers to connect AI agents to file systems | |
| 6. β Design multi-agent systems with CrewAI | |
| 7. β Implement content-based similarity and clustering | |
| 8. β Generate embeddings for semantic search | |
| 9. β Handle file operations safely with backups and dry-run modes | |
| 10. β Implement observability for AI applications | |
| 11. β Test non-deterministic systems effectively | |
| 12. β Deploy self-hosted AI applications with Docker Compose | |
| --- | |
| ## Advanced Features | |
| ### Smart Organization Strategy | |
| The `smart` strategy uses AI to analyze file content and context to determine the best organization approach: | |
| ```python | |
| # Pseudocode for smart strategy | |
| def smart_organize(files): | |
| # 1. Analyze file types and content | |
| file_analysis = scanner_agent.analyze(files) | |
| # 2. Determine optimal strategy | |
| if mostly_pdfs_with_academic_content: | |
| strategy = "by-topic-hierarchical" | |
| elif mostly_media_files: | |
| strategy = "by-date-and-type" | |
| elif mixed_work_documents: | |
| strategy = "by-project-and-date" | |
| # 3. Execute with AI-powered categorization | |
| classifier_agent.categorize(files, strategy) | |
| organizer_agent.execute(strategy) | |
| ``` | |
| ### Research Paper Features | |
| Special handling for academic PDFs: | |
| - **Metadata Extraction**: Title, authors, year, abstract, keywords | |
| - **Citation Parsing**: Extract and parse references | |
| - **Smart Naming**: `{year}_{first_author}_{short_title}.pdf` | |
| - **Topic Clustering**: Group papers by research area | |
| - **Citation Network**: Identify related papers | |
| - **Bibliography Generation**: BibTeX, APA, MLA formats | |
| ### Deduplication Strategies | |
| Multiple methods for finding duplicates: | |
| 1. **Hash-based**: Exact file matches (fastest) | |
| 2. **Content-based**: Similar content using embeddings | |
| 3. **Metadata-based**: Same title/author but different files | |
| 4. **Fuzzy matching**: Handle renamed or modified files | |
| --- | |
| *This project serves as the main example in the [Learning Path](../learning-path.md) for building AI-powered CLI tools.* | |