CV-Extractor

Sleeping

App Files Files Community

CV-Extractor / README.md

Sher1988

update sdk_version: 1.37.1

e811837 12 days ago

preview code

raw

history blame contribute delete

2.19 kB

	---
	title: CV-Extractor
	emoji: 📸
	sdk: streamlit
	sdk_version: 1.37.1
	app_file: app.py
	---


	# CV Analyzer (AI-Powered Resume Parser)

	A Streamlit-based app that extracts structured data from CVs (PDF) using Docling + Agentic AI + Pydantic schema, and converts it into a clean, downloadable CSV.

	---

	## Features

	- Upload CV (PDF)
	- Parse document using Docling
	- Extract structured data using LLM agent
	- Validate with Pydantic schema
	- Convert to Pandas DataFrame
	- View extracted data in UI
	- Download as CSV

	---

	## Tech Stack

	- Streamlit – UI
	- Docling – PDF parsing
	- Pydantic / pydantic-ai – structured extraction
	- Hugging Face / LLM – inference
	- Pandas – data processing

	---

	## Setup

	### 1. Clone repo
	```bash
	git clone https://github.com/your-username/cv-analyzer.git
	cd cv-analyzer
	````

	### 2. Create virtual environment

	```bash
	python -m venv .venv
	source .venv/bin/activate # Linux/macOS
	.venv\Scripts\activate # Windows
	```

	### 3. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 4. Environment variables

	Create a `.env` file:

	```
	HF_TOKEN=your_huggingface_token
	```

	> `.env` is ignored via `.gitignore`

	---

	## Run App

	```bash
	streamlit run app.py
	```

	---

	## How it works

	1. User uploads CV (PDF)
	2. Docling converts PDF → structured text/markdown
	3. LLM agent extracts data using predefined schema
	4. Output is validated via Pydantic
	5. Data is converted into a DataFrame
	6. User can view and download CSV

	---

	## Notes

	* Schema is designed for AI/ML-focused resumes
	* Missing fields are returned as `null` (no hallucination policy)
	* Dates are stored as strings to avoid parsing errors
	* Validation is relaxed to improve LLM compatibility

	---

	## Limitations

	* LLM may still produce inconsistent outputs for poorly formatted CVs
	* Complex layouts (tables, multi-column PDFs) may affect parsing quality
	* Requires internet access for model inference

	---

	## Future Improvements

	* Multi-CV batch processing
	* Candidate scoring & ranking
	* Semantic search over resumes (FAISS)
	* UI improvements (filters, charts)
	* Export to JSON / Excel

	---

	## License

	MIT License

	```
	```