File size: 4,834 Bytes
20c2301
 
 
 
95c340e
20c2301
 
 
 
16a9128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: Medical Document Validator
emoji: πŸ₯
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
---

# Medical Document Validator

A robust backend service that validates medical documents (PDF, DOCX, PPTX) against predefined templates using Large Language Models (LLM).

## Features

- **Multi-format Support**: Validates PDF, DOCX, and PPTX documents
- **Template-based Validation**: Uses structured JSON templates to define required elements
- **LLM-powered**: Uses Anthropic's Claude API for context-aware document validation
- **RESTful API**: FastAPI-based endpoints for easy integration

## Project Structure

```
medical-validator/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py          # FastAPI app with /templates and /validate endpoints
β”‚   β”œβ”€β”€ validator.py     # Core validation logic with document extraction and LLM interaction
β”‚   └── templates.json   # Template configuration (18 templates)
β”œβ”€β”€ requirements.txt     # Python dependencies
β”œβ”€β”€ .env                 # Environment variables (create from .env.example)
└── README.md           # This file
```

## Setup Instructions

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Configure API Key

1. Copy the example environment file:
   ```bash
   cp .env.example .env
   ```

2. Get your Anthropic API key from [https://console.anthropic.com/](https://console.anthropic.com/)

3. Edit `.env` and replace `your_anthropic_api_key_here` with your actual API key:
   ```
   LLM_API_KEY=sk-ant-api03-...
   ```

### 3. Run the Server

```bash
uvicorn app.main:app --reload
```

The API will be available at `http://localhost:8000`

## API Endpoints

### GET `/templates`

Returns a list of all available templates with their keys and friendly names.

**Response:**
```json
[
  {
    "template_key": "certificate_appreciation_speaker",
    "friendly_name": "Certificate of Appreciation (Speaker/Chairperson)"
  },
  ...
]
```

### POST `/validate`

Validates a document against a specified template.

**Parameters:**
- `file` (form-data): The document file to validate (PDF, DOCX, or PPTX)
- `template_key` (query): The template key to validate against

**Example using curl:**
```bash
curl -X POST "http://localhost:8000/validate?template_key=certificate_appreciation_speaker" \
  -F "file=@document.pdf"
```

**Response:**
```json
{
  "template_key": "certificate_appreciation_speaker",
  "status": "PASS",
  "summary": "All required elements found",
  "elements_report": [
    {
      "id": "certificate_title",
      "label": "Certificate Title",
      "required": true,
      "is_present": true,
      "reason": "Found phrase 'Certificate of Appreciation' in document"
    },
    ...
  ]
}
```

### GET `/health`

Health check endpoint to verify API key configuration.

## Available Templates

The system includes 18 predefined templates:

1. Certificate of Appreciation (Speaker/Chairperson)
2. Certificate of Attendance
3. CPD Certificate of Accreditation (Generic)
4. HTML Email Reminder
5. HTML Invitation
6. PDF Invitation
7. PDF Save the Date
8. Printed Invitation
9. RCP Certificate of Attendance
10. Agenda Page
11. DHA Certificate of Accreditation (President + Chairs)
12. Certificate of Appreciation (Sponsor)
13. Evaluation Form (Post-Event)
14. Event Booklet
15. Landing Page & Registration
16. Slides Permission Form

## Validation Logic

The validator:

1. **Extracts text** from the uploaded document based on file type
2. **Loads the template** configuration for the specified template key
3. **Generates a detailed prompt** for the LLM with all template requirements
4. **Calls Claude API** to analyze the document against the template
5. **Returns a structured report** with element-by-element validation results

### Limitations

- **Visual Elements**: Logos, signatures, and QR codes require image/OCR processing beyond basic text extraction. The validator will note these limitations in the report.
- **Table Structure**: Complex table structures with specific column validation may need advanced parsing. Basic text extraction may not preserve table structure perfectly.
- **Image-based PDFs**: PDFs that are image scans (not text-based) will require OCR preprocessing.

## Development

### Running in Development Mode

```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```

### API Documentation

Once the server is running, visit:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`

## Error Handling

The API returns appropriate HTTP status codes:

- `200`: Success
- `400`: Bad request (unsupported file format, empty file)
- `404`: Template not found
- `422`: Validation error (extraction failure, LLM parsing error)
- `500`: Internal server error

## License

This project is for internal use.