Structured Output (JSON)

LoRAX can enforce that responses consist only of valid JSON and adhere to a provided JSON schema.

Background: Structured Generation

LoRAX enforces adherence to a schema through a process known as structured generation (also called constrained decoding). Unlike guess-and-check validation methods, structured generation manipulates the next token likelihoods (logits) to enforce adherence to a schema at the token level. During each forward pass of inference, LLMs produce a probability distribution over their vocabulary of tokens. The token that is actually generated is selected by sampling from this distribution.

Suppose you've tasked an LLM with generating some valid JSON, and so far the LLM has produced the text { "name". When considering the next token to output, it's clear that tokens like A or < will not result in valid JSON. structured generation prevents the LLM from selecting an invalid token by modifying the probability distribution and setting the likelihood of invalid tokens to -infinity. In this way, we can guarantee that, at each step, only tokens that will produce valid JSON can be selected.

Caveats

Structured generation does not guarantee the quality of generated text, only its form. structured generation may force the LLM to output valid JSON, but it can't ensure that the content of the JSON is desirable or accurate.
Even with structured generation enabled, LLM output may not be fully valid JSON if the number of max_new_tokens is too low, as this could result in necessary tokens (e.g., a closing }) being cut off.

Structured Generation with Outlines

Outlines is an open-source library supporting various ways of specifying and enforcing structured generation rules onto LLM outputs.

LoRAX uses Outlines to support structured generation following a user-provided JSON schema. This JSON schema is converted into a regular expression, and then into a finite-state machine (FSM). For each token, LoRAX then determines the set of valid next tokens using this FSM and sets the likelihood of invalid tokens to -infinity.

Example: Python client

This example follows the JSON-structured generation example in the Outlines quickstart.

We assume that you have already deployed LoRAX using a suitable base model and installed the LoRAX Python Client. Alternatively, see below for an example of structured generation using an OpenAI client.

import json
from enum import Enum
from lorax import Client
from pydantic import BaseModel, constr


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    strength: int


client = Client("http://127.0.0.1:8080")

prompt = "Generate a new character for my awesome game: name, age (between 1 and 99), armor and strength. "
response = client.generate(prompt, response_format={
    "type": "json_object",
    "schema": Character.model_json_schema(),
})

my_character = json.loads(response.generated_text)
print(my_character)

You can also specify the JSON schema directly rather than using Pydantic:

schema = {
    "$defs": {
        "Armor": {
            "enum": ["leather", "chainmail", "plate"],
            "title": "Armor",
            "type": "string"
        }
    },
    "properties": {
        "name": {"maxLength": 10, "title": "Name", "type": "string"},
        "age": {"title": "Age", "type": "integer"},
        "armor": {"$ref": "#/$defs/Armor"},
        "strength": {"title": "Strength", "type": "integer"}
    },
    "required": ["name", "age", "armor", "strength"],
    "title": "Character",
    "type": "object"
}

Example: OpenAI-compatible API

Structured generation of JSON following a schema is supported via the response_format parameter.

Note

Currently a schema is required. This differs from the existing OpenAI JSON mode, in which no schema is supported.

import json
from enum import Enum
from openai import OpenAI
from pydantic import BaseModel, constr


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    strength: int


client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
    model="",  # optional: specify an adapter ID here
    messages=[
        {
            "role": "user",
            "content": "Generate a new character for my awesome game: name, age (between 1 and 99), armor and strength. ",
        },
    ],
    max_tokens=100,
    response_format={
        "type": "json_object",
        "schema": Character.model_json_schema(),
    },
)

my_character = json.loads(resp.choices[0].message.content)
print(my_character)