| # Parsing Model Output | |
| The `common` library contains a PEG parser implementation suitable for parsing | |
| model output. | |
| Types with the prefix `common_peg_*` are intended for general use and may have | |
| applications beyond parsing model output, such as parsing user-provided regex | |
| patterns. | |
| Types with the prefix `common_chat_peg_*` are specialized helpers for model | |
| output. | |
| The parser features: | |
| - Partial parsing of streaming input | |
| - Built-in JSON parsers | |
| - AST generation with semantics via "tagged" nodes | |
| ## Example | |
| Below is a contrived example demonstrating how to use the PEG parser to parse | |
| output from a model that emits arguments as JSON. | |
| ```cpp | |
| auto parser = build_chat_peg_native_parser([&](common_chat_peg_native_builder & p) { | |
| // Build a choice of all available tools | |
| auto tool_choice = p.choice(); | |
| for (const auto & tool : tools) { | |
| const auto & function = tool.at("function"); | |
| std::string name = function.at("name"); | |
| const auto & schema = function.at("parameters"); | |
| auto tool_name = p.json_member("name", "\"" + p.literal(name) + "\""); | |
| auto tool_args = p.json_member("arguments", p.schema(p.json(), "tool-" + name + "-schema", schema)); | |
| tool_choice |= p.rule("tool-" + name, "{" << tool_name << "," << tool_args << "}"); | |
| } | |
| // Define the tool call structure: <tool_call>[{tool}]</tool_call> | |
| auto tool_call = p.trigger_rule("tool-call", | |
| p.sequence({ | |
| p.literal("<tool_call>["), | |
| tool_choice, | |
| p.literal("]</tool_call>") | |
| }) | |
| ); | |
| // Parser accepts content, optionally followed by a tool call | |
| return p.sequence({ | |
| p.content(p.until("<tool_call>")), | |
| p.optional(tool_call), | |
| p.end() | |
| }); | |
| }); | |
| ``` | |
| For a more complete example, see `test_example_native()` in | |
| [tests/test-chat-peg-parser.cpp](/tests/test-chat-peg-parser.cpp). | |
| ## Parsers/Combinators | |
| ### Basic Matchers | |
| - **`eps()`** - Matches nothing and always succeeds (epsilon/empty match) | |
| - **`start()`** - Matches the start of input (anchor `^`) | |
| - **`end()`** - Matches the end of input (anchor `$`) | |
| - **`literal(string)`** - Matches an exact literal string | |
| - **`any()`** - Matches any single character (`.`) | |
| ### Combinators | |
| - **`sequence(...)`** - Matches parsers in order; all must succeed | |
| - **`choice(...)`** - Matches the first parser that succeeds from alternatives (ordered choice) | |
| - **`one_or_more(p)`** - Matches one or more repetitions (`+`) | |
| - **`zero_or_more(p)`** - Matches zero or more repetitions (`*`) | |
| - **`optional(p)`** - Matches zero or one occurrence (`?`) | |
| - **`repeat(p, min, max)`** - Matches between min and max repetitions (use `-1` for unbounded) | |
| - **`repeat(p, n)`** - Matches exactly n repetitions | |
| ### Lookahead | |
| - **`peek(p)`** - Positive lookahead: succeeds if parser succeeds without consuming input (`&`) | |
| - **`negate(p)`** - Negative lookahead: succeeds if parser fails without consuming input (`!`) | |
| ### Character Classes & Utilities | |
| - **`chars(classes, min, max)`** - Matches repetitions of characters from a character class | |
| - **`space()`** - Matches zero or more whitespace characters (space, tab, newline) | |
| - **`until(delimiter)`** - Matches characters until delimiter is found (delimiter not consumed) | |
| - **`until_one_of(delimiters)`** - Matches characters until any delimiter in the list is found | |
| - **`rest()`** - Matches everything remaining (`.*`) | |
| ### JSON Parsers | |
| - **`json()`** - Complete JSON parser (objects, arrays, strings, numbers, booleans, null) | |
| - **`json_object()`** - JSON object parser | |
| - **`json_array()`** - JSON array parser | |
| - **`json_string()`** - JSON string parser | |
| - **`json_number()`** - JSON number parser | |
| - **`json_bool()`** - JSON boolean parser | |
| - **`json_null()`** - JSON null parser | |
| - **`json_string_content()`** - JSON string content without surrounding quotes | |
| - **`json_member(key, p)`** - JSON object member with specific key and value parser | |
| ### Grammar Building | |
| - **`ref(name)`** - Creates a lightweight reference to a named rule (for recursive grammars) | |
| - **`rule(name, p, trigger)`** - Creates a named rule and returns a reference | |
| - **`trigger_rule(name, p)`** - Creates a trigger rule (entry point for lazy grammar generation) | |
| - **`schema(p, name, schema, raw)`** - Wraps parser with JSON schema metadata for grammar generation | |
| ### AST Control | |
| - **`atomic(p)`** - Prevents AST node creation for partial parses | |
| - **`tag(tag, p)`** - Creates AST nodes with semantic tags (multiple nodes can share tags) | |
| ## GBNF Grammar Generation | |
| The PEG parser also acts as a convenient DSL for generating GBNF grammars, with | |
| some exceptions. | |
| ```cpp | |
| data.grammar = build_grammar([&](const common_grammar_builder & builder) { | |
| foreach_function(params.tools, [&](const json & fn) { | |
| builder.resolve_refs(fn.at("parameters")); | |
| }); | |
| parser.build_grammar(builder, data.grammar_lazy); | |
| }); | |
| ``` | |
| The notable exception is the `negate(p)` lookahead parser, which cannot be | |
| defined as a CFG grammar and therefore does not produce a rule. Its usage | |
| should be limited and preferably hidden behind a `schema()` parser. In many | |
| cases, `until(delimiter)` or `until_one_of(delimiters)` is a better choice. | |
| Another limitation is that the PEG parser requires an unambiguous grammar. In | |
| contrast, the `llama-grammar` implementation can support ambiguous grammars, | |
| though they are difficult to parse. | |
| ### Lazy Grammars | |
| During lazy grammar generation, only rules reachable from a `trigger_rule(p)` | |
| are emitted in the grammar. All trigger rules are added as alternations in the | |
| root rule. It is still necessary to define trigger patterns, as the parser has | |
| no interaction with the grammar sampling. | |
| ### JSON Schema | |
| The `schema(p, name, schema, raw)` parser will use the `json-schema-to-grammar` | |
| implementation to generate the grammar instead of the underlying parser. | |
| The `raw` option emits a grammar suitable for a raw string instead of a JSON | |
| string. In other words, it won't be wrapped in quotes or require escaping | |
| quotes. It should only be used when `type == "string"`. | |
| The downside is that it can potentially lead to ambiguous grammars. For | |
| example, if a user provides the pattern `^.*$`, the following grammar may be | |
| generated: | |
| ``` | |
| root ::= "<arg>" .* "</arg>" | |
| ``` | |
| This creates an ambiguous grammar that cannot be parsed by the PEG parser. To | |
| help mitigate this, if `.*` is found in the pattern, the grammar from the | |
| underlying parser will be emitted instead. | |
| ## Common AST Shapes for Chat Parsing | |
| Most model output can be placed in one of the following categories: | |
| - Content only | |
| - Tool calling with arguments emitted as a single JSON object | |
| - Tool calling with arguments emitted as separate entities, either XML | |
| (Qwen3-Coder, MiniMax M2) or pseudo-function calls (LFM2) | |
| To provide broad coverage, | |
| [`common/chat-peg-parser.h`](/common/chat-peg-parser.h) contains builders and | |
| mappers that help create parsers and visitors/extractors for these types. They | |
| require parsers to tag nodes to conform to an AST "shape". This normalization | |
| makes it easy to extract information and generalize parsing. | |
| ### Simple | |
| The `common_chat_peg_builder` builds a `simple` parser that supports | |
| content-only models with optional reasoning. | |
| - **`reasoning(p)`** - Tag node for extracting `reasoning_content` | |
| - **`content(p)`** - Tag node for extracting `content` | |
| ```cpp | |
| build_chat_peg_parser([&](common_chat_peg_parser & p) { | |
| return p.sequence({ | |
| p.optional("<think>" + p.reasoning(p.until("</think>")) + "</think>"), | |
| p.content(p.until("<tool_call>")), | |
| p.end() | |
| }); | |
| }); | |
| ``` | |
| Use `common_chat_peg_mapper` to extract the content. Note that this is already | |
| done for you in `common_chat_peg_parser` when | |
| `chat_format == COMMON_CHAT_FORMAT_PEG_SIMPLE`. | |
| ```cpp | |
| auto result = parser.parse(ctx); | |
| common_chat_msg msg; | |
| auto mapper = common_chat_peg_mapper(msg); | |
| mapper.from_ast(ctx.ast, result); | |
| ``` | |
| ### Native | |
| The `common_chat_peg_native_builder` builds a `native` parser suitable for | |
| models that emit tool arguments as a direct JSON object. | |
| - **`reasoning(p)`** - Tag node for `reasoning_content` | |
| - **`content(p)`** - Tag node for `content` | |
| - **`tool(p)`** - Tag entirety of a single tool call | |
| - **`tool_open(p)`** - Tag start of a tool call | |
| - **`tool_close(p)`** - Tag end of a tool call | |
| - **`tool_id(p)`** - Tag the tool call ID (optional) | |
| - **`tool_name(p)`** - Tag the tool name | |
| - **`tool_args(p)`** - Tag the tool arguments | |
| ```cpp | |
| build_chat_peg_native_parser([&](common_chat_peg_native_parser & p) { | |
| auto get_weather_tool = p.tool(p.sequence({ | |
| p.tool_open(p.literal("{")), | |
| p.json_member("name", "\"" + p.tool_name(p.literal("get_weather")) + "\""), | |
| p.literal(","), | |
| p.json_member("arguments", p.tool_args(p.json())), | |
| p.tool_close(p.literal("}")) | |
| })); | |
| return p.sequence({ | |
| p.content(p.until("<tool_call>")), | |
| p.literal("<tool_call>"), | |
| get_weather_tool, | |
| p.literal("</tool_call>"), | |
| p.end() | |
| }); | |
| }); | |
| ``` | |
| ### Constructed | |
| The `common_chat_peg_constructed_builder` builds a `constructed` parser | |
| suitable for models that emit tool arguments as separate entities, such as XML | |
| tags. | |
| - **`reasoning(p)`** - Tag node for `reasoning_content` | |
| - **`content(p)`** - Tag node for `content` | |
| - **`tool(p)`** - Tag entirety of a single tool call | |
| - **`tool_open(p)`** - Tag start of a tool call | |
| - **`tool_close(p)`** - Tag end of a tool call | |
| - **`tool_name(p)`** - Tag the tool name | |
| - **`tool_arg(p)`** - Tag a complete tool argument (name + value) | |
| - **`tool_arg_open(p)`** - Tag start of a tool argument | |
| - **`tool_arg_close(p)`** - Tag end of a tool argument | |
| - **`tool_arg_name(p)`** - Tag the argument name | |
| - **`tool_arg_string_value(p)`** - Tag string value for the argument | |
| - **`tool_arg_json_value(p)`** - Tag JSON value for the argument | |
| ```cpp | |
| build_chat_peg_constructed_parser([&](common_chat_peg_constructed_builder & p) { | |
| auto location_arg = p.tool_arg( | |
| p.tool_arg_open("<parameter name=\"" + p.tool_arg_name(p.literal("location")) + "\">"), | |
| p.tool_arg_string_value(p.until("</parameter>")), | |
| p.tool_arg_close(p.literal("</parameter>")) | |
| ); | |
| auto get_weather_tool = p.tool(p.sequence({ | |
| p.tool_open("<function name=\"" + p.tool_name(p.literal("get_weather")) + "\">"), | |
| location_arg, | |
| p.tool_close(p.literal("</function>")) | |
| })); | |
| return p.sequence({ | |
| p.content(p.until("<tool_call>")), | |
| p.literal("<tool_call>"), | |
| get_weather_tool, | |
| p.literal("</tool_call>"), | |
| p.end() | |
| }); | |
| }); | |
| ``` | |