writinwaters
commited on
Commit
·
4b50c07
1
Parent(s):
25a190f
Updated parser_config description (#3104)
Browse files### What problem does this PR solve?
### Type of change
- [x] Documentation Update
- api/http_api_reference.md +31 -16
- api/python_api_reference.md +64 -15
api/http_api_reference.md
CHANGED
|
@@ -78,7 +78,7 @@ curl --request POST \
|
|
| 78 |
- `"chunk_method"`: (*Body parameter*), `enum<string>`
|
| 79 |
The chunking method of the dataset to create. Available options:
|
| 80 |
- `"naive"`: General (default)
|
| 81 |
-
- `"manual`: Manual
|
| 82 |
- `"qa"`: Q&A
|
| 83 |
- `"table"`: Table
|
| 84 |
- `"paper"`: Paper
|
|
@@ -88,16 +88,23 @@ curl --request POST \
|
|
| 88 |
- `"picture"`: Picture
|
| 89 |
- `"one"`: One
|
| 90 |
- `"knowledge_graph"`: Knowledge Graph
|
| 91 |
-
- `"email"`: Email
|
| 92 |
|
| 93 |
- `"parser_config"`: (*Body parameter*), `object`
|
| 94 |
-
The configuration settings for the dataset parser
|
| 95 |
-
- `"
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
### Response
|
| 103 |
|
|
@@ -256,7 +263,6 @@ curl --request PUT \
|
|
| 256 |
- `"picture"`: Picture
|
| 257 |
- `"one"`:One
|
| 258 |
- `"knowledge_graph"`: Knowledge Graph
|
| 259 |
-
- `"email"`: Email
|
| 260 |
|
| 261 |
### Response
|
| 262 |
|
|
@@ -511,13 +517,22 @@ curl --request PUT \
|
|
| 511 |
- `"picture"`: Picture
|
| 512 |
- `"one"`: One
|
| 513 |
- `"knowledge_graph"`: Knowledge Graph
|
| 514 |
-
- `"email"`: Email
|
| 515 |
- `"parser_config"`: (*Body parameter*), `object`
|
| 516 |
-
The
|
| 517 |
-
- `"
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 521 |
|
| 522 |
### Response
|
| 523 |
|
|
|
|
| 78 |
- `"chunk_method"`: (*Body parameter*), `enum<string>`
|
| 79 |
The chunking method of the dataset to create. Available options:
|
| 80 |
- `"naive"`: General (default)
|
| 81 |
+
- `"manual"`: Manual
|
| 82 |
- `"qa"`: Q&A
|
| 83 |
- `"table"`: Table
|
| 84 |
- `"paper"`: Paper
|
|
|
|
| 88 |
- `"picture"`: Picture
|
| 89 |
- `"one"`: One
|
| 90 |
- `"knowledge_graph"`: Knowledge Graph
|
|
|
|
| 91 |
|
| 92 |
- `"parser_config"`: (*Body parameter*), `object`
|
| 93 |
+
The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
|
| 94 |
+
- If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
|
| 95 |
+
- `"chunk_token_count"`: Defaults to `128`.
|
| 96 |
+
- `"layout_recognize"`: Defaults to `true`.
|
| 97 |
+
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
|
| 98 |
+
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
|
| 99 |
+
- `"task_page_size"`: Defaults to `12`. For PDF only.
|
| 100 |
+
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
|
| 101 |
+
- If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
|
| 102 |
+
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
|
| 103 |
+
- If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
|
| 104 |
+
- If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
|
| 105 |
+
- `"chunk_token_count"`: Defaults to `128`.
|
| 106 |
+
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
|
| 107 |
+
- `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
|
| 108 |
|
| 109 |
### Response
|
| 110 |
|
|
|
|
| 263 |
- `"picture"`: Picture
|
| 264 |
- `"one"`:One
|
| 265 |
- `"knowledge_graph"`: Knowledge Graph
|
|
|
|
| 266 |
|
| 267 |
### Response
|
| 268 |
|
|
|
|
| 517 |
- `"picture"`: Picture
|
| 518 |
- `"one"`: One
|
| 519 |
- `"knowledge_graph"`: Knowledge Graph
|
|
|
|
| 520 |
- `"parser_config"`: (*Body parameter*), `object`
|
| 521 |
+
The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
|
| 522 |
+
- If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
|
| 523 |
+
- `"chunk_token_count"`: Defaults to `128`.
|
| 524 |
+
- `"layout_recognize"`: Defaults to `true`.
|
| 525 |
+
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
|
| 526 |
+
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
|
| 527 |
+
- `"task_page_size"`: Defaults to `12`. For PDF only.
|
| 528 |
+
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
|
| 529 |
+
- If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
|
| 530 |
+
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
|
| 531 |
+
- If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
|
| 532 |
+
- If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
|
| 533 |
+
- `"chunk_token_count"`: Defaults to `128`.
|
| 534 |
+
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
|
| 535 |
+
- `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
|
| 536 |
|
| 537 |
### Response
|
| 538 |
|
api/python_api_reference.md
CHANGED
|
@@ -75,16 +75,31 @@ The chunking method of the dataset to create. Available options:
|
|
| 75 |
- `"picture"`: Picture
|
| 76 |
- `"one"`: One
|
| 77 |
- `"knowledge_graph"`: Knowledge Graph
|
| 78 |
-
- `"email"`: Email
|
| 79 |
|
| 80 |
#### parser_config
|
| 81 |
|
| 82 |
-
The parser configuration of the dataset. A `ParserConfig` object
|
| 83 |
-
|
| 84 |
-
- `
|
| 85 |
-
|
| 86 |
-
- `
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
### Returns
|
| 90 |
|
|
@@ -225,7 +240,6 @@ A dictionary representing the attributes to update, with the following keys:
|
|
| 225 |
- `"picture"`: Picture
|
| 226 |
- `"one"`: One
|
| 227 |
- `"knowledge_graph"`: Knowledge Graph
|
| 228 |
-
- `"email"`: Email
|
| 229 |
|
| 230 |
### Returns
|
| 231 |
|
|
@@ -296,11 +310,6 @@ Updates configurations for the current document.
|
|
| 296 |
A dictionary representing the attributes to update, with the following keys:
|
| 297 |
|
| 298 |
- `"display_name"`: `str` The name of the document to update.
|
| 299 |
-
- `"parser_config"`: `dict[str, Any]` The parsing configuration for the document:
|
| 300 |
-
- `"chunk_token_count"`: Defaults to `128`.
|
| 301 |
-
- `"layout_recognize"`: Defaults to `True`.
|
| 302 |
-
- `"delimiter"`: Defaults to `'\n!?。;!?'`.
|
| 303 |
-
- `"task_page_size"`: Defaults to `12`.
|
| 304 |
- `"chunk_method"`: `str` The parsing method to apply to the document.
|
| 305 |
- `"naive"`: General
|
| 306 |
- `"manual`: Manual
|
|
@@ -313,7 +322,27 @@ A dictionary representing the attributes to update, with the following keys:
|
|
| 313 |
- `"picture"`: Picture
|
| 314 |
- `"one"`: One
|
| 315 |
- `"knowledge_graph"`: Knowledge Graph
|
| 316 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 317 |
|
| 318 |
### Returns
|
| 319 |
|
|
@@ -412,7 +441,6 @@ A `Document` object contains the following attributes:
|
|
| 412 |
- `thumbnail`: The thumbnail image of the document. Defaults to `None`.
|
| 413 |
- `dataset_id`: The dataset ID associated with the document. Defaults to `None`.
|
| 414 |
- `chunk_method` The chunk method name. Defaults to `"naive"`.
|
| 415 |
-
- `parser_config`: `ParserConfig` Configuration object for the parser. Defaults to `{"pages": [[1, 1000000]]}`.
|
| 416 |
- `source_type`: The source type of the document. Defaults to `"local"`.
|
| 417 |
- `type`: Type or category of the document. Defaults to `""`. Reserved for future use.
|
| 418 |
- `created_by`: `str` The creator of the document. Defaults to `""`.
|
|
@@ -430,6 +458,27 @@ A `Document` object contains the following attributes:
|
|
| 430 |
- `"DONE"`
|
| 431 |
- `"FAIL"`
|
| 432 |
- `status`: `str` Reserved for future use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 433 |
|
| 434 |
### Examples
|
| 435 |
|
|
|
|
| 75 |
- `"picture"`: Picture
|
| 76 |
- `"one"`: One
|
| 77 |
- `"knowledge_graph"`: Knowledge Graph
|
|
|
|
| 78 |
|
| 79 |
#### parser_config
|
| 80 |
|
| 81 |
+
The parser configuration of the dataset. A `ParserConfig` object's attributes vary based on the selected `"chunk_method"`:
|
| 82 |
+
|
| 83 |
+
- `"chunk_method"`=`"naive"`:
|
| 84 |
+
`{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
|
| 85 |
+
- `chunk_method`=`"qa"`:
|
| 86 |
+
`{"raptor": {"user_raptor": False}}`
|
| 87 |
+
- `chunk_method`=`"manuel"`:
|
| 88 |
+
`{"raptor": {"user_raptor": False}}`
|
| 89 |
+
- `chunk_method`=`"table"`:
|
| 90 |
+
`None`
|
| 91 |
+
- `chunk_method`=`"paper"`:
|
| 92 |
+
`{"raptor": {"user_raptor": False}}`
|
| 93 |
+
- `chunk_method`=`"book"`:
|
| 94 |
+
`{"raptor": {"user_raptor": False}}`
|
| 95 |
+
- `chunk_method`=`"laws"`:
|
| 96 |
+
`{"raptor": {"user_raptor": False}}`
|
| 97 |
+
- `chunk_method`=`"presentation"`:
|
| 98 |
+
`{"raptor": {"user_raptor": False}}`
|
| 99 |
+
- `chunk_method`=`"one"`:
|
| 100 |
+
`None`
|
| 101 |
+
- `chunk_method`=`"knowledge-graph"`:
|
| 102 |
+
`{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
|
| 103 |
|
| 104 |
### Returns
|
| 105 |
|
|
|
|
| 240 |
- `"picture"`: Picture
|
| 241 |
- `"one"`: One
|
| 242 |
- `"knowledge_graph"`: Knowledge Graph
|
|
|
|
| 243 |
|
| 244 |
### Returns
|
| 245 |
|
|
|
|
| 310 |
A dictionary representing the attributes to update, with the following keys:
|
| 311 |
|
| 312 |
- `"display_name"`: `str` The name of the document to update.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
- `"chunk_method"`: `str` The parsing method to apply to the document.
|
| 314 |
- `"naive"`: General
|
| 315 |
- `"manual`: Manual
|
|
|
|
| 322 |
- `"picture"`: Picture
|
| 323 |
- `"one"`: One
|
| 324 |
- `"knowledge_graph"`: Knowledge Graph
|
| 325 |
+
- `"parser_config"`: `dict[str, Any]` The parsing configuration for the document. Its attributes vary based on the selected `"chunk_method"`:
|
| 326 |
+
- `"chunk_method"`=`"naive"`:
|
| 327 |
+
`{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
|
| 328 |
+
- `chunk_method`=`"qa"`:
|
| 329 |
+
`{"raptor": {"user_raptor": False}}`
|
| 330 |
+
- `chunk_method`=`"manuel"`:
|
| 331 |
+
`{"raptor": {"user_raptor": False}}`
|
| 332 |
+
- `chunk_method`=`"table"`:
|
| 333 |
+
`None`
|
| 334 |
+
- `chunk_method`=`"paper"`:
|
| 335 |
+
`{"raptor": {"user_raptor": False}}`
|
| 336 |
+
- `chunk_method`=`"book"`:
|
| 337 |
+
`{"raptor": {"user_raptor": False}}`
|
| 338 |
+
- `chunk_method`=`"laws"`:
|
| 339 |
+
`{"raptor": {"user_raptor": False}}`
|
| 340 |
+
- `chunk_method`=`"presentation"`:
|
| 341 |
+
`{"raptor": {"user_raptor": False}}`
|
| 342 |
+
- `chunk_method`=`"one"`:
|
| 343 |
+
`None`
|
| 344 |
+
- `chunk_method`=`"knowledge-graph"`:
|
| 345 |
+
`{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
|
| 346 |
|
| 347 |
### Returns
|
| 348 |
|
|
|
|
| 441 |
- `thumbnail`: The thumbnail image of the document. Defaults to `None`.
|
| 442 |
- `dataset_id`: The dataset ID associated with the document. Defaults to `None`.
|
| 443 |
- `chunk_method` The chunk method name. Defaults to `"naive"`.
|
|
|
|
| 444 |
- `source_type`: The source type of the document. Defaults to `"local"`.
|
| 445 |
- `type`: Type or category of the document. Defaults to `""`. Reserved for future use.
|
| 446 |
- `created_by`: `str` The creator of the document. Defaults to `""`.
|
|
|
|
| 458 |
- `"DONE"`
|
| 459 |
- `"FAIL"`
|
| 460 |
- `status`: `str` Reserved for future use.
|
| 461 |
+
- `parser_config`: `ParserConfig` Configuration object for the parser. Its attributes vary based on the selected `chunk_method`:
|
| 462 |
+
- `chunk_method`=`"naive"`:
|
| 463 |
+
`{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
|
| 464 |
+
- `chunk_method`=`"qa"`:
|
| 465 |
+
`{"raptor": {"user_raptor": False}}`
|
| 466 |
+
- `chunk_method`=`"manuel"`:
|
| 467 |
+
`{"raptor": {"user_raptor": False}}`
|
| 468 |
+
- `chunk_method`=`"table"`:
|
| 469 |
+
`None`
|
| 470 |
+
- `chunk_method`=`"paper"`:
|
| 471 |
+
`{"raptor": {"user_raptor": False}}`
|
| 472 |
+
- `chunk_method`=`"book"`:
|
| 473 |
+
`{"raptor": {"user_raptor": False}}`
|
| 474 |
+
- `chunk_method`=`"laws"`:
|
| 475 |
+
`{"raptor": {"user_raptor": False}}`
|
| 476 |
+
- `chunk_method`=`"presentation"`:
|
| 477 |
+
`{"raptor": {"user_raptor": False}}`
|
| 478 |
+
- `chunk_method`=`"one"`:
|
| 479 |
+
`None`
|
| 480 |
+
- `chunk_method`=`"knowledge-graph"`:
|
| 481 |
+
`{"chunk_token_num":128,"delimiter": "\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
|
| 482 |
|
| 483 |
### Examples
|
| 484 |
|