Python API

Note: this API is in preview and is subject to change.

Install and import

The Python API is delivered by the onnxruntime-genai Python package.

pip install onnxruntime-genai
import onnxruntime_genai

Model class

Load a model

Loads the ONNX model(s) and configuration from a folder on disk.

onnxruntime_genai.Model(model_folder: str) -> onnxruntime_genai.Model

Parameters

  • model_folder: Location of model and configuration on disk

Returns

onnxruntime_genai.Model

Generate method

onnxruntime_genai.Model.generate(params: GeneratorParams) -> numpy.ndarray[int, int]

Parameters

  • params: (Required) Created by the GeneratorParams method.

Returns

numpy.ndarray[int, int]: a two dimensional numpy array with dimensions equal to the size of the batch passed in and the maximum length of the sequence of tokens.

Device type

Return the device type that the model has been configured to run on.

onnxruntime_genai.Model.device_type

Returns

str: a string describing the device that the loaded model will run on

Tokenizer class

Create tokenizer object

onnxruntime_genai.Model.Tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer

Parameters

  • model: (Required) The model that was loaded by the Model()

Returns

  • Tokenizer: The tokenizer object

Encode

onnxruntime_genai.Tokenizer.encode(text: str) -> numpy.ndarray[numpy.int32]

Parameters

  • text: (Required)

Returns

numpy.ndarray[numpy.int32]: an array of tokens representing the prompt

Decode

onnxruntime_genai.Tokenizer.decode(tokens: numpy.ndarry[int]) -> str 

Parameters

  • numpy.ndarray[numpy.int32]: (Required) a sequence of generated tokens

Returns

str: the decoded generated tokens

Encode batch

onnxruntime_genai.Tokenizer.encode_batch(texts: list[str]) -> numpy.ndarray[int, int]

Parameters

  • texts: A list of inputs

Returns

numpy.ndarray[int, int]: The batch of tokenized strings

Decode batch

onnxruntime_genai.Tokenize.decode_batch(tokens: [[numpy.int32]]) -> list[str]

Parameters

  • tokens

Returns

texts: a batch of decoded text

Create tokenizer decoding stream

onnxruntime_genai.Tokenizer.create_stream() -> TokenizerStream

Parameters

None

Returns

onnxruntime_genai.TokenizerStream The tokenizer stream object

TokenizerStream class

This class accumulates the next displayable string (according to the tokenizer’s vocabulary).

Decode method

onnxruntime_genai.TokenizerStream.decode(token: int32) -> str

Parameters

  • token: (Required) A token to decode

Returns

str: If a displayable string has accumulated, this method returns it. If not, this method returns the empty string.

GeneratorParams class

Create a Generator Params object

onnxruntime_genai.GeneratorParams(model: Model) -> GeneratorParams

Pad token id member

onnxruntime_genai.GeneratorParams.pad_token_id

EOS token id member

onnxruntime_genai.GeneratorParams.eos_token_id

vocab size member

onnxruntime_genai.GeneratorParams.vocab_size

input_ids member

onnxruntime_genai.GeneratorParams.input_ids: numpy.ndarray[numpy.int32, numpy.int32]

Set model input

onnxruntime_genai.GeneratorParams.set_model_input(name: str, value: [])

Set search options method

onnxruntime_genai.GeneratorParams.set_search_options(options: dict[str, Any])

Try graph capture with max batch size

onnxruntime_genai.GeneratorParams.try_graph_capture_with_max_batch_size(max_batch_size: int)

Generator class

Create a Generator

onnxruntime_genai.Generator(model: Model, params: GeneratorParams) -> Generator

Parameters

  • model: (Required) The model to use for generation
  • params: (Required) The set of parameters that control the generation

Returns

onnxruntime_genai.Generator The Generator object

Is generation done

onnxruntime_genai.Generator.is_done() -> bool

Returns

Returns true when all sequences are at max length, or have reached the end of sequence.

Compute logits

Runs the model through one iteration.

onnxruntime_genai.Generator.compute_logits()

Get output

Returns an output of the model.

onnxruntime_genai.Generator.get_output(str: name) -> numpy.ndarray

Parameters

  • name: the name of the model output

Returns

  • numpy.ndarray: a multi dimensional array of the model outputs. The shape of the array is shape of the output.

Example

The following code returns the output logits of a model.

logits = generator.get_output("logits")

Generate next token

Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top P sampling.

onnxruntime_genai.Generator.generate_next_token()

Get next tokens

onnxruntime_genai.Generator.get_next_tokens() -> numpy.ndarray[numpy.int32]

Returns

numpy.ndarray[numpy.int32]: The most recently generated tokens

Get sequence

onnxruntime_genai.Generator.get_sequence(index: int) -> numpy.ndarray[numpy.int32] 
  • index: (Required) The index of the sequence in the batch to return