satomitheito's picture
hf streamlit
908ff10

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade

Data

CUAD (Contract Understanding Atticus Dataset)

Contains:

  • 510 commercial contracts (NDAs, service agreements, licensing agreements, etc.)
  • 13,000+ expert annotations
  • 41 clause type categories (e.g., "Governing Law", "Termination for Convenience", "Limitation of Liability", "Non-Compete", "Indemnification", etc)
  • Each annotation is a span within a contract + the question it answers

Uses:

  1. For Classification Agent taxonomy: the 41 clause categories are the classification schema. When the classification agent receives a clause segment, it uses the 41 types as the set of all labels it can assign. Approach: put all 41 category names + brief descriptions in the system prompt as a reference list

  2. For Risk Analysis Agent: The annotations tell which categories expert annotators were trained to flag as legally significant, which can support the risk scoring logic (gives LLM structured guidance about what to look for)

  3. Benchmark corpus (secondary): the 510 CUAD contracts can be embedded and added to the vector store alongside EDGAR contracts

  4. Sample input contracts for testing/demo (maybe): Since CUAD contracts are real commercial contracts, a subset (20-30?) could be used as "uploaded contracts" during testing and demo, rather than needing to find separate test documents.

Need to collect:

  • Full taxonomy: list of 41 category names + description of each -> data/cuad/taxonomy.json
  • Contract texts: save the 510 contract full texts -> data/cuad/contracts/
  • Annotations (the labeled spans per contract) -> data/cuad/annotations.json