(gpt-runtime)= # C++ GPT Runtime TensorRT-LLM includes a C++ component to execute TensorRT engines built with the Python API as described in the {ref}`architecture-overview` section. That component is called the C++ runtime. The API of the C++ runtime is composed of the classes declared in [`cpp/include/tensorrt_llm/runtime`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/include/tensorrt_llm/runtime) and implemented in [`cpp/tensorrt_llm/runtime`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/runtime). Even if the different components described in that document mention GPT in their name, they are not restricted to this specific model. Those classes can be used to implement auto-regressive models like BLOOM, GPT-J, GPT-NeoX or LLaMA, for example. Complete support of encoder-decoder models, like T5, will be added to TensorRT-LLM in a future release. An experimental version, only in Python for now, can be found in the [`examples/enc_dec`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) folder. ## Overview Runtime models are described by an instance of the [`ModelConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h) class and a pointer to the TensorRT engine that must be executed to perform the inference. The environment is configured through the [`WorldConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h) (that name comes from [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) and its "famous" `MPI_COMM_WORLD` default communicator). The [`SamplingConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h) class encapsulates parameters that control the [generation](https://huggingface.co/blog/how-to-generate) of new tokens. ### Model Configuration The model configuration is an instance of the [`ModelConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h) class. That class encapsulates the following parameters (they are declared as private member variables and exposed through getters and setters): * `vocabSize`, the size of the vocabulary, * `numLayers`, the number of layers in the model, * `numHeads`, the number of heads in the attention block, * `numKvHeads`, the number of heads for K and V in the attention component. When the number of K/V heads is the same as the number of (Q) heads, the model uses multi-head attention. When the number of K/V heads is 1, it uses multi-query attention. Otherwise, it uses group-query attention. Refer to {ref}`gpt-attention` for more information, * `hiddenSize`, the size of the hidden dimension, * `dataType`, the datatype that was used to build the TensorRT engine and that must be used to run the model during inference, * `useGptAttentionPlugin`, indicates if the {ref}`gpt-attention` operator was compiled using the [GPT Attention plugin](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/plugins/gptAttentionPlugin), * `inputPacked`, indicates that the input must be packed (or padded when set to `false`). For performance reasons, it is recommended to always use packed, even if its default is set to `false` (will be changed in a future release). Refer to {ref}`gpt-attention` for more information, * `pagedKvCache`, indicates if the K/V cache uses paging. Refer to {ref}`gpt-attention` for more information, * `tokensPerBlock`, is the number of tokens in each block of the K/V cache. It's relevant when the paged K/V cache is enabled. By default, the value is 64. Refer to {ref}`gpt-attention` for more information, * `quantMode`, controls the quantization method. Refer to {ref}`precision` for more information. * `maxBatchSize`, indicates the maximum batch size that the TensorRT engine was built for, * `maxInputLen`, the maximum size of the input sequences, * `maxSequenceLen`, the maximum total size (input+output) of the sequences. ### World Configuration Familiarity with [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface), is not required to utilize the TensorRT-LMM C++ runtime. There are two main things you need to know: * The C++ Runtime in TensorRT-LLM uses [processes](https://en.wikipedia.org/wiki/Process_(computing)) to execute TensorRT engines on the different GPUs. Those GPUs can be located on a single node as well as on different nodes in a cluster. Each process is called a *rank* in MPI. * The ranks are grouped in communication groups. The TensorRT-LLM C++ Runtime calls that group the *world*. The world configuration is an instance of the [`WorldConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h) class, which encapsulates the following parameters: * `tensorParallelism`, the number of ranks that collaborate together to implement Tensor Parallelism (TP). With TP, each GPU performs computations for all the layers of the model. Some of those computations are distributed across the GPU. TP is more balanced than Pipeline Parallelism (PP), in most cases, but requires higher bandwidth between the GPUs. It is the recommended setting in the presence of NVLINK between GPUs, * `pipelineParallelism`, the number of ranks that collaborate together to implement Pipeline Parallelism (PP). With PP, each GPU works on a subset of consecutive layers. Communications between the GPUs happen only at the boundaries of the subsets of layers. It is harder to guarantee the full utilization of the GPUs with PP but it requires less memory bandwidth. It is the recommended setting in the absence of NVLINK between GPUs, * `rank`, the unique identifier of the rank, * `gpusPerNode`, indicates the number of GPUs on each node. Having that information allows the C++ runtime to optimize communications between GPUs in a node (like taking advantage of the [NVLINK](https://www.nvidia.com/en-us/data-center/nvlink/) interconnect between GPUs of an A100 [DGX](https://www.nvidia.com/en-us/data-center/dgx-platform/) node). ### Sampling Parameters The [`SamplingConfig`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h) class encapsulates parameters that control the [generation](https://huggingface.co/blog/how-to-generate) of new tokens. Except for the `beamWidth` parameter, all the fields are optional and the runtime will use a default value if no values are provided by the user. For vector fields, the TensorRT-LLM runtime supports one value per sequence (that is, the vector contains `batchSize` values). If all the sequences use the same value for a given parameter, the vector can be limited to a single element (that is, `size() == 1`). ***General*** * `temperature`, a vector of floating-point numbers to control the modulation of logits when sampling new tokens. It can have any value `> 0.0f`. The default value is `1.0f`(no modulation). Note: the recommended way to enable greedy sampling is to set `temperature` to `1.0f` and `topK` to `1`. * `minLength`, a vector of integers to set a lower-bound on the number of tokens generated. It can have any value `>= 0`. Value `0` has no effect, the first generated token can be EOS. The default value is `1` (at least one non-EOS token is generated). * `repetitionPenalty`, a vector of float-point numbers to penalize tokens (irrespective of the number of appearances). It is multiplicative penalty. It can have any value `> 0.0f`. Repetition penalty `< 1.0f` encourages repetition, `> 1.0f` discourages it. The default value is `1.0f` (no effect). * `presencePenalty`, a vector of float-point numbers to penalize tokens already present in the sequence (irrespective of the number of appearances). It is additive penalty. It can have any value, values `< 0.0f` encourage repetition, `> 0.f` discourage it. The default value is `0.0f` (no effect). * `frequencyPenalty`, a vector of float-point numbers to penalize tokens already present in the sequence (dependent on the number of appearances). It is additive penalty. It can have any value, values `< 0.0f` encourage repetition, `> 0.0f` discourage it. The default value is `0.0f`(no effect). The parameters `repetitionPenalty`, `presencePenalty`, and `frequencyPenalty` are not mutually exclusive. ***Sampling*** * `randomSeed`, a vector of 64-bit integers to control the random seed used by the random number generator in sampling. Its default value is `0`, * `topK`, a vector of integers to control the number of logits to sample from. Must be in range of `[0, 1024]`. Its default value is `0`. Note that if different values are provided for the different sequences in the batch, the performance of the implementation will depend on the largest value. For efficiency reasons, we recommend to batch requests with similar `topK` values together, * `topP`, a vector of floating-point values to control the top-P probability to sample from. Must be in range of `[0.f, 1.f]`. Its default value is `0.f`, * `topPDecay`, `topPMin` and `topPResetIds`, vectors to control the decay in the `topP` algorithm. The `topP` values are modulated by a decay that exponentially depends on the length of the sequence as explained in [_Factuality Enhanced Language Models for Open-Ended Text Generation_](https://arxiv.org/abs/2206.04624). `topPDecay` is the decay, `topPMin` is the lower-bound and `topPResetIds` indicates where to reset the decay. `topPDecay`, `topPMin` must be in ranges of `(0.f, 1.f]` and `(0.f, 1.f]` respectively. Defaults are `1.f`, `1.0e-6,f` and `-1`, If both `topK` and `topP` fields are set, the `topK` method will be run for sequences with a `topK` value greater than `0.f`. In that case, the `topP` value for that sequence also influences the result. If the `topK` values for some sequences are `0.f`, the `topP` method will be used for those remaining sequences. If both `topK` and `topP` are zero, greedy search is performed. ***Beam-search*** * `beamWidth`, is the width used for the [beam search](https://en.wikipedia.org/wiki/Beam_search) sampling algorithm. There is no explicit upper-bound on the beam width but increasing the beam width will likely increase the latency. Use `1` to disable beam-search, * `beamSearchDiversityRate`, a floating-point value that controls the diversity in beam-search. It can have any value `>= 0.0f`. The default value is `0.f`, * `lengthPenalty`, a floating-point value that controls how to penalize the longer sequences in beam-search (the log-probability of a sequence will be penalized by a factor that depends on `1.f / (length ^ lengthPenalty)`). The default is value `0.f`, * `earlyStopping`, a integer value that controls whether the generation process finishes once `beamWidth` sentences are generated (end up with `end_token`). Default value `1` means `earlyStopping` is enabled, value `0` means `earlyStopping` is disable, other values means the generation process is depended on `length_penalty`. The `beamWidth` parameter is a scalar value. It means that in this release of TensorRT-LLM, it is not possible to specify a different width for each input sequence. This limitation is likely to be removed in a future release. ## The Session *The runtime session is deprecated in favor of the {ref}`executor`. It will be removed in a future release of TensorRT-LLM.* An example of how to use the `GptSession` to run a GPT-like auto-regressive model can be found in [`cpp/tests/runtime/gptSessionTest.cpp`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tests/runtime/gptSessionTest.cpp). ### Internal Components The `GptSession` class encapsulates two main components. The [`TllmRuntime`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/tllmRuntime.h) is in charge of the execution of the TensorRT engine. The [`GptDecoder`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/gptDecoder.h) does the generation of the tokens from the logits. The `TllmRuntime` class is an internal component and you are not expected to use that class directly. The `GptDecoder` can be used directly to implement custom generation loop and for use cases that cannot be satisfied by the implementation in `GptSession`. ## In-flight Batching Support In-flight batching is supported using separate decoders per request. The biggest difference compared to using a single decoder is in how the token generation from logits is managed. A batch is split into `batchSize` individual requests and kernels are issued using separated CUDA streams. This behavior may be revisited in a future release to maintain the structure of the batch and improve efficiency. ## Know Issues and Future Changes * In the current release of TensorRT-LLM, the C++ and Python runtimes are two separate software components and the C++ runtime is being more actively developed (with features like in-flight batching). An objective, for a future release, could be to rebuild the Python runtime on top of the C++ one.