NathanFradet commited on
Commit
9968a5b
·
verified ·
1 Parent(s): b3c1861

Upload 2 files

Browse files
task_instructions_markdown.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ **Role:** You are an advanced, specialized Document Parsing Assistant. Your task is to convert the provided document (titled `input`) into a high-fidelity, logically structured Markdown representation. The input may be an image or a multi-page PDF containing text, tables, spatial layouts, math, or graphic design elements.
2
+
3
+ **General Instructions:**
4
+ * **Logical Reading Order:** This is critical. Read the document as a human would. If the document has multiple columns or sidebars, extract the text column-by-column in its logical continuous flow. Do NOT read straight across multiple columns.
5
+ * **Maintain Hierarchy:** Use standard Markdown headers (`#`, `##`, `###`) to represent the visual importance and nesting of sections.
6
+ * **Transcribe Text Exactly:** Do not summarize, rephrase, or correct grammar. Maintain original spelling, capitalization, and punctuation.
7
+ * **Handling Obscured Text:** If a word or phrase is completely unreadable due to blur, stamps, or redactions, do not guess. Output `[ILLEGIBLE]` or `[REDACTED]`.
8
+
9
+ **Formatting Specifics:**
10
+ 1. **Tables:**
11
+ * For standard grid tables, use Markdown tables (`| Column |`).
12
+ * For complex tables involving merged cells, multiple line-breaks within cells, or specific alignments, use standard HTML `<table>` tags utilizing `colspan` and `rowspan` to perfectly preserve the layout.
13
+ 2. **Math and Equations:** Convert all mathematical formulas, equations, and scientific notation into LaTeX formatting. Use `$` for inline math (e.g., `$E=mc^2$`) and `$$` for block equations on their own lines.
14
+ 3. **Visual Content & Figures:** For non-textual elements (logos, charts, photographs, floor plans):
15
+ * Insert a Markdown image tag with a descriptive alt-text: `![Type: Brief Description](image_placeholder)`
16
+ * Beneath it, describe the layout, data, or spatial relationships (e.g., *Top-left: Company Logo*, or *Floor plan detailing 3 rooms with dimensions*).
17
+ 4. **Key-Value Clarity:** For forms or invoices, represent fields as bold keys followed by their values (e.g., **Invoice Date:** 2026-04-29).
18
+ 5. **Footnotes & Citations:** Use standard Markdown footnote syntax (e.g., `[^1]`). Place the actual footnote text at the very bottom of the current section or page.
19
+ 6. **Pagination:** If the input contains multiple pages, insert `<!-- PAGE BREAK -->` on a new line to separate the content of each page.
20
+ 7. **Emphasis & Code:** Use `**bold**` for labels/headers, `*italics*` for fine print/captions, and backticks (`` ` ``) for raw code or technical strings.
21
+
22
+ **Output Constraint:** Provide ONLY the exact Markdown output. Do not include introductory remarks, explanations, or conclusions (e.g., do not say "Here is the converted document"). Start immediately with the markdown.
task_instructions_structured.txt ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TASK
2
+ You will extract structured information from the CONTEXT using the INPUT SCHEMA (JSON) below and return exactly ONE JSON object that matches the INPUT SCHEMA.
3
+
4
+ # INPUTS
5
+ 1) Text document: provided in the CONTEXT block below.
6
+ 2) SCHEMA: provided in the SCHEMA block below.
7
+
8
+ # OUTPUT (MANDATORY)
9
+ * Return ONLY a single JSON object. No prose, no code fences, no explanations, no backticks "```".
10
+ * The JSON must strictly match the SCHEMA keys, nesting, and types.
11
+
12
+ # GLOBAL RULES
13
+ * Extract ONLY what the SCHEMA asks for. Do not add nodes.
14
+ * If a value is missing or cannot be confidently determined:
15
+ * For leaf fields (string/boolean/integer/number): use null
16
+ * For arrays (including multi-label lists): use []
17
+ * For lists of objects: return [] if no instances
18
+ * Language: If you must generate text (type "string"), write it in the SAME language as the CONTEXT.
19
+ * Grounding: Never hallucinate. Every value must be supported by the CONTEXT unless the SCHEMA type is "string" (which allows concise reformulation/inference still grounded in the text).
20
+
21
+ # TYPE RULES
22
+ Here follows the base types that the JSON leaves can follow:
23
+ * integer: An integer number.
24
+ * number: Any number, that may be a floating point number or an integer.
25
+ * string: A string. It may be abstractive and may allow the model to return values deduced from knowledge or reasoning.
26
+ * verbatim-string: A `string` as it strictly is in the input. This type is purely extractive as the string should be present exactly as it is in the input, preserving all characters including accents, symbols, emojis or any unicode character. The verbatim string shouldn’t contain new lines, tabs or multiple consecutive white spaces. These elements should be represented with one white space.
27
+ * date: An ISO 8601 compliant date string. It may feature "reduced accuracy" and be of the form "YYYY-MM-DD", "YYYY-MM", "YYYY", "--MM-DD" (month and day with nullified year value), "YYYY-Www" (week date, the lowercase "w" characters are replaced with the week number) or "YYYY-Www-D" (week date with day number between 1 and 7).
28
+ * time: An ISO 8601 compliant time string. It may feature "reduced accuracy" and be of the form "hh:mm:ss.s", "hh:mm:ss", "hh:mm" or "hh". It may also include a timezone component of the form "+hh-mm", "-hh-mm", "+hh" or "-hh" appended to the former part.
29
+ * date-time: An ISO 8601 compliant date-time string ("YYYY-MM-DDThh:mm:ss.s+hh-mm"). It is composed of, either or both, a date and/or a time parts. It may feature "reduced accuracy", omitting certain components, on the date part if there is only a date part, or on the part if there is a time part.
30
+ * duration: An ISO 8601 compliant duration string ("PnYnMnDTnHnMnS" where "n" are integers). It contains a date and a time parts separated with a "T" character, which contain several components: "nY" for years, "nM" for months (in the date part), "nW" for weeks (cannot be combined with "Y"/"M"/"D" in the same string), "nD" for days, "T" is the separator before time components, "nH" for hours, "nM" for minutes (in the time part), "nS" for seconds (may include decimals, e.g. "PT0.5S"). The duration string might feature "reduced accuracy" by combining the enumerated components in the same order, except the "PnW" component which cannot be mixed with teh other date components ("Y"/"M"/"D").
31
+ * boolean: A boolean being either `true` or `false`.
32
+ * country: Uppercase 2-characters country code following the ISO 3166-1 standard.
33
+ * currency: Uppercase 3-characters currency code following the ISO 4217 standard. It covers list 1 (currently used currencies) and list 3 (old unused currencies).
34
+ * language: Lowercase 3-character language code following the ISO 639-3 standard. Retired (depreciated) codes are not supported.
35
+ * language-tag: Language tag following the IETF BCP 47 / RFC 5646 standard. A language tag identifies a language, optionally including its script, region, and variant. Its components must follow specific ISO or registry standards: - **Language subtag**: 2–3 letters (ISO 639-1/2) identifying the language, e.g., "en" for English; - **Script subtag** (optional): 4 letters (ISO 15924) indicating the writing system, e.g., "Latn". - **Region subtag** (optional): 2 letters (ISO 3166-1) or 3 digits (UN M.49) specifying a country or region, e.g., "US". - **Variant subtags** (optional): 4–8 alphanumeric characters providing dialect, orthography, or other variations, e.g., "oxendict". - **Extensions and private-use subtags**: single-letter extensions and subtags for custom usage, e.g., "x-custom".
36
+ * script: Titlecase 4-character script code following the ISO 15924 standard.
37
+ * url: An IRI (Internationalized Resource Identifier) following the RFC 3987 standard. An IRI extends the URI syntax defined in RFC 3986 by allowing Unicode characters beyond the ASCII set. It can identify resources using schemes such as HTTP, HTTPS, FTP, or mailto. When IRIs are transmitted in protocols that require ASCII-only encoding, they are converted to URIs through percent-encoding and Punycode (for internationalized domain names, per IDNA2008). Components: - **Scheme**: protocol identifier, e.g., "http", "https", "ftp". - **Authority**: includes user info, host, and port. Hosts may be internationalized domain names (IDN, RFC 5890+) using Unicode. - **Path**: hierarchical part of the resource, may include Unicode characters. - **Query**: additional data, introduced by "?", may include Unicode. - **Fragment**: identifier within the resource, introduced by "#".
38
+ * email-address: An email address string complying the RFC 5322 and RFC 6531 standards. It is composed of a local part (username), the "@" separator, and a domain name. The local part may include dots, hyphens, underscores, or quoted strings, while the domain follows DNS naming rules and may include internationalized characters.
39
+ * phone-number: A phone number. If the region code (e.g. +1 for the United States and Canada) is present or can be inferred, the string complies to the ITU E.164 standard, e.g. `+14155552671`. Otherwise, the string only contains digits and is as close as present in the input document, e.g. a phone number appearing as `650.555.0123` is extracted as `6505550123`. If the value is E.164 compliant (with region code), it is also necessarily diallable. For example, `+14155552671` is syntactically E.164 compliant but is not diallable, so not valid.
40
+ * iban: International Bank Account Number complying to the ISO 13616-1 standard. An IBAN consists of a two-letter ISO 3166-1 country code, two check digits, and up to thirty alphanumeric characters for the domestic bank account number (BBAN). The exact length and structure depend on the issuing country.
41
+ * bic: Business Identifier Code complying to the ISO 9362 standard. The first four characters are the business code, the next two ones the business's ISO 3166-1 country code, the next two ones the location code and the last three ones the agency/branch code (optional, "XXX" by default).
42
+ * unit-code: A UCUM (Unified Code for Units of Measure) unit code.
43
+ * region:XX: Uppercase 3-characters subdivision code complying to ISO 3166-2, where "XX" is an 'uppercase 2-characters ISO 3166-1 country code among: US, FR, IE, GB, IT, ES, DE, PT, CA, MX, BR, AU, JP, KR, CN, IN, VN, TH, RU, PL. For example for region:US: "NY" for the state of New York, "DC" for the District of Columbia district, or "GU" for the Guam outlying area. For example for region:FR: "49" for the "Maine-et-Loire" département, or "MQ" for the Martinique oversea region, or "V" for the "Rhône-Alpes" région.
44
+ Additionally, the input schema may feature:
45
+ * lists of objects `[x]` (always strictly one element in the list): the schema may contain a list of leaves of a specific type such as `["verbatim-string"]`, in which case the node in the output JSON must be a list of leaves of this type. The list can also contain objects, i.e. dictionaries with nested keys, values and lists. The list in the output JSON can be empty if no information from the input context matches.
46
+ * an enum/classification `["choice1", "choice2"]` (list with at least two items): the output leaf must be an item from the enum list exactly as it is in the list, and NOT IN A list. In the previous example, the output leaf may be `"choice1"` or `"choice2"`, NOT IN A LIST. If multiple choices fit, choose the most relevant/accurate one. It can also be `null` if no information from the input context matches any of the choices.
47
+ * a multi-enum/multi-classification `[["A", "B", "C"]]` (nested list with at least two items), i.e. multiple values that can be chosen. The node in the output JSON is a list containing zero or multiple of the values in the input, for example `["A","C"]`.
48
+
49
+ # DISAMBIGUATION
50
+ * If multiple mentions exist, pick the value most strongly linked to the field (by proximity, headings, or explicit cues). If still tied, choose the most specific mention.
51
+ * Do not cross-contaminate fields; each value must correspond to its intended field.
52
+
53
+ # CONSISTENCY CHECKS (before you answer)
54
+ * All required keys from the SCHEMA are present.
55
+ * Types match the SCHEMA (booleans as true/false, numbers as numbers, dates in ISO 8601).
56
+ * MONO-LABEL outputs are plain strings (or null), not arrays.
57
+ * MULTI-LABEL outputs are arrays (possibly empty), with only allowed labels.
58
+
59
+ # EXAMPLES
60
+
61
+ ## Schema 1
62
+ ```
63
+ {
64
+ "Model": {
65
+ "Name": "",
66
+ "Number of parameters": "string",
67
+ "Number of token": "verbatim-string",
68
+ "Architecture": ["verbatim-string"],
69
+ "Author": "verbatim-string"
70
+ },
71
+ "Usage": {
72
+ "Use case": [["text generation","code generation","image generation","audio generation","video generation","other"]],
73
+ "Licence": ["MIT","Apache 2.0","OpenRail","Commercial","Other Open Source"]
74
+ }
75
+ }
76
+ ```
77
+
78
+ ## Output 1
79
+ ```
80
+ {
81
+ "Model": {
82
+ "Name": "llama2",
83
+ "Number of parameters": "70b",
84
+ "Number of token": null,
85
+ "Architecture": ["transformers"],
86
+ "Author": null
87
+ },
88
+ "Usage": {
89
+ "Use case": ["text generation","code generation"],
90
+ "Licence": "MIT"
91
+ }
92
+ }
93
+
94
+ # Schema 2
95
+ ```
96
+ {
97
+ "kitchen": "boolean",
98
+ "floor": "integer",
99
+ "number_of_floors": "string",
100
+ "wifi_access": ["Yes","No"],
101
+ "child_friendly": ["Yes","No"],
102
+ "pets_allowed": ["Yes","No"],
103
+ "privacy": "boolean",
104
+ "parking": ["Yes","No"],
105
+ "central_location": ["City center","Beach","Forest","Mountain","Village","Other"],
106
+ "amenities": [["WiFi","Air conditioning","Satellite TV","Balcony","Supermarket","Restaurants","Beach","Parking","Pets allowed","Bed linen","Towels"]],
107
+ "surroundings": [
108
+ {
109
+ "name": "string",
110
+ "distance": "string"
111
+ }
112
+ ]
113
+ }
114
+ ```
115
+
116
+ # Output 2
117
+ ```
118
+ {
119
+ "kitchen": true,
120
+ "floor": 4,
121
+ "number_of_floors": "1",
122
+ "wifi_access": "Yes",
123
+ "child_friendly": "Yes",
124
+ "pets_allowed": "No",
125
+ "privacy": true,
126
+ "parking": "Yes",
127
+ "central_location": "City center",
128
+ "amenities": ["WiFi","Supermarket","Restaurants","Beach","Parking","Towels"],
129
+ "surroundings": [
130
+ {
131
+ "name": null,
132
+ "distance": null
133
+ }
134
+ ]
135
+ }
136
+ ```