tensorrt_llm
Getting Started
Overview
Quick Start Guide
Release Notes
Installation
Installing on Linux
Building from Source Code on Linux
Installing on Windows
Building from Source Code on Windows
Architecture
TensorRT-LLM Architecture
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model
Advanced
Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Graph Rewriting Module
The Batch Manager in TensorRT-LLM
Inference Request
Responses
Run gpt-2b + LoRA using GptManager / cpp runtime
Expert Parallelism in TensorRT-LLM
Performance
Overview
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis
Reference
Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM
C++ API
Runtime
Python API
Layers
Functionals
Models
Plugin
Quantization
Runtime
Blogs
H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
tensorrt_llm
Runtime
View page source
Runtime