4 Practical Application: Advisor Index Card Extraction

4.1 Introduction

This chapter demonstrates a practical application of VLM-based structured extraction on a real-world GLAM digitization project: extracting structured metadata from historical index cards from the National Library of Scotland’s Advocate’s Library collection.

Unlike the previous chapter which focused on explaining VLM concepts and setup, this chapter assumes you’re familiar with the basics and focuses on:

Designing schemas for real catalog requirements
Running extractions at scale
Evaluating extraction quality - different strategies for assessing accuracy
Handling edge cases and failures

4.2 The Task: Advisor Index Cards

The National Library of Scotland has a collection of historical index cards documenting manuscripts and correspondence. Each card follows a fairly consistent format:

Surname: Family name
Forenames: Given names
Epithet: Role, title, or occupation
MS no: Manuscript reference number
Description: Document type and date
Folios: Page references

The goal is to extract this structured information to enable: - Searchable digital catalog - Integration with library management systems - Research access to historical collections

4.2.1 Example Cards

Let’s look at a few sample cards from the collection:

from pathlib import Path
import matplotlib.pyplot as plt

images = list(Path("../../assets/vllm-structured-generation/indexes/").rglob("*.JPG"))
images

[PosixPath('../../assets/vllm-structured-generation/indexes/DSC00172.JPG'),
 PosixPath('../../assets/vllm-structured-generation/indexes/DSC00173.JPG'),
 PosixPath('../../assets/vllm-structured-generation/indexes/DSC00171.JPG'),
 PosixPath('../../assets/vllm-structured-generation/indexes/DSC00170.JPG'),
 PosixPath('../../assets/vllm-structured-generation/indexes/DSC00169.JPG'),
 PosixPath('../../assets/vllm-structured-generation/indexes/DSC00168.JPG')]

# display a grid of images using matplotlib (len of images)
number_of_images = len(images)
cols = 3
rows = (number_of_images + cols - 1) // cols
fig, axs = plt.subplots(rows, cols, figsize=(15, 5 * rows))
for i, img_path in enumerate(images):
    img = plt.imread(img_path)
    ax = axs[i // cols, i % cols] if rows > 1 else axs[i % cols]
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(img_path.stem)
plt.tight_layout()
plt.show()

4.3 Schema Design

Working with the library curators, we designed a schema that matches their cataloging requirements. The schema is intentionally simple - complex schemas are harder for VLMs to extract reliably.

This schema is something we can iterate on later based on extraction quality but gives us a solid starting point.

from pydantic import BaseModel, Field
from typing import Optional

class IndexCardEntry(BaseModel):
    """Schema for index card extraction matching curator specification"""
    
    surname: str = Field(..., description="Family name as written on card")
    forenames: Optional[str] = Field(None, description="Given names")
    epithet: Optional[str] = Field(None, description="Title, occupation, or role")
    ms_no: str = Field(..., description="Manuscript number")
    description: str = Field(..., description="Document description with date")
    folios: str = Field(..., description="Folio reference")
    
    failed_to_parse: bool = Field(
        False,
        description="Set to True if the card cannot be reliably extracted (illegible, damaged, etc.)"
    )
    notes: Optional[str] = Field(
        None, 
        description="Optional notes about the card: handwritten annotations, ambiguities, "
                    "corrections, or reasons for failed parsing."
    )

Let’s take a look at the schema definition we’ll use for extraction:

# Display the schema
from rich import print
print(IndexCardEntry.model_json_schema())

{
    'description': 'Schema for index card extraction matching curator specification',
    'properties': {
        'surname': {'description': 'Family name as written on card', 'title': 'Surname', 'type': 'string'},
        'forenames': {
            'anyOf': [{'type': 'string'}, {'type': 'null'}],
            'default': None,
            'description': 'Given names',
            'title': 'Forenames'
        },
        'epithet': {
            'anyOf': [{'type': 'string'}, {'type': 'null'}],
            'default': None,
            'description': 'Title, occupation, or role',
            'title': 'Epithet'
        },
        'ms_no': {'description': 'Manuscript number', 'title': 'Ms No', 'type': 'string'},
        'description': {
            'description': 'Document description with date',
            'title': 'Description',
            'type': 'string'
        },
        'folios': {'description': 'Folio reference', 'title': 'Folios', 'type': 'string'},
        'failed_to_parse': {
            'default': False,
            'description': 'Set to True if the card cannot be reliably extracted (illegible, damaged, etc.)',
            'title': 'Failed To Parse',
            'type': 'boolean'
        },
        'notes': {
            'anyOf': [{'type': 'string'}, {'type': 'null'}],
            'default': None,
            'description': 'Optional notes about the card: handwritten annotations, ambiguities, corrections, or 
reasons for failed parsing.',
            'title': 'Notes'
        }
    },
    'required': ['surname', 'ms_no', 'description', 'folios'],
    'title': 'IndexCardEntry',
    'type': 'object'
}

4.4 Setup

We’ll reuse the VLM setup from the previous chapter. If you haven’t already, make sure LM Studio is running with a VLM loaded.

from openai import OpenAI
import base64
from io import BytesIO
from PIL import Image as PILImage


client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)

client.models.list()

SyncPage[Model](data=[Model(id='qwen3-vl-2b-instruct-mlx', created=None, object='model', owned_by='organization_owner'), Model(id='qwen/qwen3-vl-8b', created=None, object='model', owned_by='organization_owner'), Model(id='qwen/qwen3-vl-4b', created=None, object='model', owned_by='organization_owner'), Model(id='text-embedding-nomic-embed-text-v1.5', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-vl-30b-a3b-instruct', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-vl-30b-a3b-thinking@4bit', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-vl-30b-a3b-thinking@3bit', created=None, object='model', owned_by='organization_owner'), Model(id='qwen/qwen3-4b-thinking-2507', created=None, object='model', owned_by='organization_owner'), Model(id='google/gemma-3-12b', created=None, object='model', owned_by='organization_owner'), Model(id='google/gemma-3-4b', created=None, object='model', owned_by='organization_owner'), Model(id='qwen2-0.5b-instruct-fingreylit', created=None, object='model', owned_by='organization_owner'), Model(id='google/gemma-3n-e4b', created=None, object='model', owned_by='organization_owner'), Model(id='granite-vision-3.3-2b', created=None, object='model', owned_by='organization_owner'), Model(id='ibm/granite-4-h-tiny', created=None, object='model', owned_by='organization_owner'), Model(id='iconclass-vlm', created=None, object='model', owned_by='organization_owner'), Model(id='mlx-community/qwen2.5-vl-3b-instruct', created=None, object='model', owned_by='organization_owner'), Model(id='lmstudio-community/qwen2.5-vl-3b-instruct', created=None, object='model', owned_by='organization_owner'), Model(id='lfm2-vl-1.6b', created=None, object='model', owned_by='organization_owner'), Model(id='mimo-vl-7b-rl-2508@q4_k_s', created=None, object='model', owned_by='organization_owner'), Model(id='mimo-vl-7b-rl-2508@q8_0', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-30b-a3b-instruct-2507', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-4b-instruct-2507-mlx', created=None, object='model', owned_by='organization_owner'), Model(id='openai/gpt-oss-20b', created=None, object='model', owned_by='organization_owner'), Model(id='qwen/qwen2.5-vl-7b', created=None, object='model', owned_by='organization_owner'), Model(id='mistralai/mistral-small-3.2', created=None, object='model', owned_by='organization_owner'), Model(id='qwen3-30b-a3b-instruct-2507-mlx', created=None, object='model', owned_by='organization_owner'), Model(id='liquid/lfm2-1.2b', created=None, object='model', owned_by='organization_owner'), Model(id='smollm3-3b-mlx', created=None, object='model', owned_by='organization_owner'), Model(id='unsloth/smollm3-3b', created=None, object='model', owned_by='organization_owner'), Model(id='ggml-org/smollm3-3b', created=None, object='model', owned_by='organization_owner'), Model(id='mlx-community/smollm3-3b', created=None, object='model', owned_by='organization_owner')], object='list')

from typing import Union
def query_image_structured(image: Union[PILImage.Image, str], prompt: str, schema: BaseModel, model='qwen/qwen3-vl-4b'):
    """
    Query VLM with an image and get structured output based on a Pydantic schema.
    
    Args:
        image: PIL Image or file path to the image
        prompt: Text prompt describing what to extract
    schema: Pydantic model class defining the expected output structure
        model: Model ID to use for the query
    
    Returns:
        Parsed Pydantic model instance with the extracted data
    """
    # Convert image to base64
    if isinstance(image, PILImage.Image):
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        image_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
    else:
        with open(image, "rb") as f:
            image_base64 = base64.b64encode(f.read()).decode('utf-8')
    
    # Query with structured output
    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }],
        response_format=schema,
        temperature=0.3  # Lower temperature for more consistent extraction
    )
    
    # Return the parsed structured data
    return completion.choices[0].message.parsed

4.5 Extraction Examples

Let’s run extraction on several sample cards to see how the model performs.

prompt = """Extract structured information from this historical library index card and return it as JSON.

  This is an index card from the National Library of Scotland's Advocate's Library collection. Each card documents a person and associated manuscript references.

  Return a JSON object with these exact fields:

  {
    "surname": "Family name exactly as typed (e.g., 'ABAD', 'ABARACA Y BOLEA')",
    "forenames": "Given names (e.g., 'Joseph', 'Thomas') or null if not present",
    "epithet": "Title, occupation, or role (e.g., 'Captain, Spanish Army') or null if not present",
    "ms_no": "Manuscript number exactly as written (e.g., '5538', '5529')",
    "description": "Document description with date (e.g., 'letter of (1783)', 'copy of petition of (ca. 1783)')",
    "folios": "Folio reference exactly as written (e.g., 'f.11', 'f.169')",
    "failed_to_parse": false (or true if card is illegible/severely damaged),
    "notes": "Optional notes about handwritten corrections, ambiguities, or parsing issues"
  }

  Guidelines:
  - Extract text exactly as it appears - do not correct spelling or expand abbreviations
  - Preserve original punctuation and formatting
  - If a field is unclear but you can make a reasonable inference, extract it and note the ambiguity in "notes"
  - Only set "failed_to_parse" to true if you genuinely cannot extract the required fields
  - Use null for optional fields (forenames, epithet, notes) if they are not present or marked with a line"""

image = PILImage.open(images[0])
image

from rich import print
result = query_image_structured(image, prompt, IndexCardEntry, model='qwen/qwen3-vl-4b')

print(result)

IndexCardEntry(
    surname='ABBAATE',
    forenames='Itala',
    epithet='Daughter of the Physician',
    ms_no='2633',
    description='letter of (1878)',
    folios='f. 38',
    failed_to_parse=False,
    notes="Handwritten corrections and annotations present: 'Cairo' (instead of 'ABBAATE'), 'Cairo' (instead of 
'ABBAATE'), 'Physician' (instead of 'Physician'), '2633' (instead of '2633'), 'f. 38' (instead of 'f. 38'). Also, 
'Cairo' appears to be a scribbled correction or miswriting of 'ABBAATE'."
)

4.5.1 Comparing Extraction to Ground Truth

Let’s compare a few extractions to the actual card content:

from tqdm.auto import tqdm

results = []
for img_path in tqdm(images):
    image = PILImage.open(img_path)
    result = query_image_structured(image, prompt, IndexCardEntry, model='qwen/qwen3-vl-4b')
    results.append((img_path.stem, result))

4.6 Evaluation Strategies

How do we know if the extraction is working well? There are several approaches to evaluation, each with different tradeoffs.

4.6.1 Looking at lots of samples

It sounds simple, but looking at a large number of random samples can give a good sense of overall quality. You can spot common errors and get a feel for how reliable the extraction is. You can quickly build intuition about what might be going wrong and where to focus improvement efforts. Realistically you will usually spend some time iterating on the prompt and schema at this stage. Looking at more than one example is important to avoid overfitting to a single case but you don’t immediately need to look at hundreds of examples or set up complex metrics or evaluations. This can come later.

for i, (img_stem, result) in enumerate(results):
    fig, (ax_img, ax_text) = plt.subplots(1, 2, figsize=(16, 6), 
                                           gridspec_kw={'width_ratios': [1, 1]})

    # Left: Display image
    img = plt.imread(images[i])
    ax_img.imshow(img)
    ax_img.axis('off')
    ax_img.set_title(f"Card {i+1}: {img_stem}", fontsize=14, fontweight='bold')

    # Right: Display extracted data as formatted text
    ax_text.axis('off')

    # Format the extracted data nicely
    text_lines = [
        "Extracted Data:",
        "",
        f"Surname: {result.surname}",
        f"Forenames: {result.forenames or 'N/A'}",
        f"Epithet: {result.epithet or 'N/A'}",
        f"MS No: {result.ms_no}",
        f"Description: {result.description}",
        f"Folios: {result.folios}",
        "",
        f"Failed to Parse: {result.failed_to_parse}",
    ]

    # Add notes if present
    if result.notes:
        text_lines.extend(("", "Notes:"))
        # Wrap long notes
        import textwrap
        wrapped_notes = textwrap.fill(result.notes, width=60)
        text_lines.append(wrapped_notes)

    # Join and display
    formatted_text = "\n".join(text_lines)
    ax_text.text(0.05, 0.95, formatted_text, 
                 transform=ax_text.transAxes,
                 fontsize=11,
                 verticalalignment='top',
                 fontfamily='monospace',
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))

    plt.tight_layout()
    plt.show()

4.6.1.1 What we learned from these samples

It seems that in these examples the notes field isn’t really adding much value and potentially it just adds noise.
While the failed_to_parse flag sounds useful, we may want to rely on other approaches to identify failures since the model may not always set this flag correctly (and in this case we probably have some other ways to identify failures like looking for missing critical fields).
Overall, we should prioritize extracting the most relevant information and avoid including fields that do not contribute to the understanding of the index card content. The simpler the schema the less for us to have to check and the fewer tokens the model has to generate. When we’re testing with small batches it doesn’t seem so important but when scaling to thousands of cards it can make a bigger difference.

from pydantic import BaseModel, Field
from typing import Optional

class IndexCardEntry(BaseModel):
    """Schema for index card extraction matching curator specification"""
    
    surname: str = Field(..., description="Family name as written on card")
    forenames: Optional[str] = Field(None, description="Given names")
    epithet: Optional[str] = Field(None, description="Title, occupation, or role")
    ms_no: str = Field(..., description="Manuscript number")
    description: str = Field(..., description="Document description with date")
    folios: str = Field(..., description="Folio reference")
    

prompt = """Extract structured information from this historical library index card and return it as JSON.

  This is an index card from the National Library of Scotland's Advocate's Library collection. Each card documents a person and associated manuscript references.

  Return a JSON object with these exact fields:

  {
    "surname": "Family name exactly as typed (e.g., 'ABAD', 'ABARACA Y BOLEA')",
    "forenames": "Given names (e.g., 'Joseph', 'Thomas') or null if not present",
    "epithet": "Title, occupation, or role (e.g., 'Captain, Spanish Army') or null if not present",
    "ms_no": "Manuscript number exactly as written (e.g., '5538', '5529')",
    "description": "Document description with date (e.g., 'letter of (1783)', 'copy of petition of (ca. 1783)')",
    "folios": "Folio reference exactly as written (e.g., 'f.11', 'f.169')",
  }

  Guidelines:
  - Extract text exactly as it appears - do not correct spelling or expand abbreviations
  - Preserve original punctuation and formatting
  - Use null for optional fields (forenames, epithet, notes) if they are not present or marked with a line"""

results = []
for img_path in tqdm(images):
    image = PILImage.open(img_path)
    result = query_image_structured(image, prompt, IndexCardEntry, model='qwen/qwen3-vl-8b')
    results.append((img_path.stem, result))

# Display images with extracted data side-by-side
# Two columns: left = image, right = extracted text

for i, (img_stem, result) in enumerate(results):
    fig, (ax_img, ax_text) = plt.subplots(1, 2, figsize=(16, 6), 
                                           gridspec_kw={'width_ratios': [1, 1]})
    
    # Left: Display image
    img = plt.imread(images[i])
    ax_img.imshow(img)
    ax_img.axis('off')
    ax_img.set_title(f"Card {i+1}: {img_stem}", fontsize=14, fontweight='bold')
    
    # Right: Display extracted data as formatted text
    ax_text.axis('off')
    
    # Format the extracted data nicely
    text_lines = [
        "Extracted Data:",
        "",
        f"Surname: {result.surname}",
        f"Forenames: {result.forenames or 'N/A'}",
        f"Epithet: {result.epithet or 'N/A'}",
        f"MS No: {result.ms_no}",
        f"Description: {result.description}",
        f"Folios: {result.folios}",
        "",
    ]
    # Join and display
    formatted_text = "\n".join(text_lines)
    ax_text.text(0.05, 0.95, formatted_text, 
                 transform=ax_text.transAxes,
                 fontsize=11,
                 verticalalignment='top',
                 fontfamily='monospace',
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))
    
    plt.tight_layout()
    plt.show()

4.6.2 1. Manual Ground Truth Evaluation

The Gold Standard: Manually annotate a sample of cards and compare.

Pros: - Most accurate measure of performance - Catches all types of errors - Builds training data for future improvements

Cons: - Time consuming - Requires expert annotators - Limited sample size

Best for: Final validation, establishing baselines, understanding failure modes

# TODO: Load manually annotated ground truth
# Compare predictions to ground truth
# Calculate field-level accuracy

# Example metrics:
# - Exact match accuracy per field
# - Character error rate
# - Common error patterns

4.6.3 2. Cross-Model Evaluation (Model-as-Judge)

The Pragmatic Approach: Use a stronger/different model to evaluate outputs.

Pros: - Much faster than manual annotation - Can evaluate full dataset - Good for catching obvious errors

Cons: - Requires access to multiple models - May miss subtle errors - Judge model can be wrong too

Best for: Large-scale quality monitoring, automated testing, identifying problem areas for manual review

# TODO: Implement model-as-judge evaluation
# - Extract with Model A (e.g., local Qwen)
# - Show image + extraction to Model B (e.g., Claude/GPT-4)
# - Ask Model B to rate accuracy and identify errors
# - Aggregate results

# Example judge prompt:
# """
# Compare this extracted data to the index card image:
# [extraction]
# 
# For each field, rate accuracy:
# - Correct: Field matches card exactly
# - Minor error: Small typo or formatting difference
# - Major error: Wrong information
# - Missing: Field is on card but not extracted
# """

4.6.4 3. Internal Consistency Checks

The Automated Approach: Use business rules and patterns to identify suspicious outputs.

Examples: - Manuscript numbers should follow known patterns - Dates should be within expected ranges - Folio references have consistent formats - Certain fields should always be present

Pros: - Completely automated - Fast - can run on full dataset - No additional model costs

Cons: - Only catches specific error types - Requires domain knowledge to design rules - Can miss errors that follow valid patterns

Best for: Flagging outliers for review, automated quality gates, monitoring production systems

# TODO: Implement consistency checks

# def validate_extraction(entry: IndexCardEntry) -> list[str]:
#     """Run validation checks and return list of warnings."""
#     warnings = []
#     
#     # Check MS number format
#     if not re.match(r'^\d+', entry.ms_no):
#         warnings.append(f"Unusual MS number format: {entry.ms_no}")
#     
#     # Check for dates in expected range
#     dates = re.findall(r'\d{4}', entry.description)
#     for date in dates:
#         if not (1500 <= int(date) <= 1950):
#             warnings.append(f"Date outside expected range: {date}")
#     
#     # Check folio format
#     if not re.match(r'^f+\.?\s*\d+', entry.folios, re.IGNORECASE):
#         warnings.append(f"Unusual folio format: {entry.folios}")
#     
#     return warnings

4.6.5 4. Confidence Scoring

Many VLM APIs return confidence scores or logprobs. We can use these to identify uncertain extractions.

Pros: - No additional cost or models needed - Can prioritize review efforts - Helps establish quality thresholds

Cons: - Not all models/APIs provide confidence scores - High confidence doesn’t guarantee correctness - Requires calibration

Best for: Prioritizing manual review, quality-based routing, understanding model uncertainty

# TODO: If available, extract and analyze confidence scores
# Plot distribution of confidence scores
# Correlate confidence with manual evaluation results

4.6.6 Combining Evaluation Approaches

In practice, a robust evaluation strategy uses multiple approaches:

Start with manual ground truth on a small sample (~50-100 cards) to establish baseline accuracy
Use consistency checks to automatically flag suspicious outputs
Apply model-as-judge on a larger sample to monitor quality
Prioritize review using confidence scores or validation warnings
Continuous monitoring as you process the full collection

This gives you both rigorous accuracy metrics and practical quality assurance at scale.

4.7 Batch Processing

Now let’s process a larger batch of cards and analyze the results.

# TODO: Process all available cards
# Track timing, failures, warnings
# Save results to file

4.7.1 Results Analysis

# TODO: Analyze batch results
# - Success rate
# - Failed to parse rate
# - Validation warnings distribution
# - Processing time statistics

4.8 Edge Cases and Failure Modes

What kinds of cards are hard for the model to process?

# TODO: Examine failed/problematic extractions
# Common patterns:
# - Handwritten corrections/additions
# - Faded or damaged cards
# - Unusual formats or layouts
# - Multiple entries per card

4.9 Export for Cataloging

Convert the extracted data to formats suitable for library systems.

# TODO: Export to CSV/JSON/XML
# Consider catalog system requirements (MARC, Dublin Core, etc.)

4.10 Next Steps

This notebook demonstrates the core extraction and evaluation workflow. For production deployment, you would need:

Robust error handling - retry logic, fallbacks, logging
Quality assurance workflow - human review interface for flagged items
Batch processing infrastructure - queue management, progress tracking
Model optimization - prompt tuning, model selection, cost optimization

These production considerations are covered in the appendices and separate infrastructure documentation.

4.11 Key Takeaways

Simple schemas work better - Don’t over-engineer the structure
Multiple evaluation strategies - Combine automated and manual approaches
Plan for failure - Build in quality flags and review workflows
Domain expertise matters - Work closely with catalogers to define requirements
Iterate based on results - Start small, evaluate, adjust, scale