Search — tensorrt_llm documentation

Getting Started

Overview
Quick Start Guide
Release Notes

Installation

Installing on Linux
Building from Source Code on Linux
Installing on Windows
Building from Source Code on Windows

Architecture

TensorRT-LLM Architecture
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Graph Rewriting Module
The Batch Manager in TensorRT-LLM
Inference Request
Responses
Run gpt-2b + LoRA using GptManager / cpp runtime
Expert Parallelism in TensorRT-LLM

Performance

Overview
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM

C++ API

Runtime

Python API

Layers
Functionals
Models
Plugin
Quantization
Runtime

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

tensorrt_llm

Search

Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact