tensorrt_llm

Getting Started

  • Overview
  • Quick Start Guide
  • Release Notes

Installation

  • Installing on Linux
  • Building from Source Code on Linux
  • Installing on Windows
  • Building from Source Code on Windows

Architecture

  • TensorRT-LLM Architecture
  • Model Definition
  • Compilation
  • Runtime
  • Multi-GPU and Multi-Node Support
  • TensorRT-LLM Checkpoint
  • TensorRT-LLM Build Workflow
  • Adding a Model

Advanced

  • Multi-Head, Multi-Query, and Group-Query Attention
  • C++ GPT Runtime
  • Graph Rewriting Module
  • The Batch Manager in TensorRT-LLM
  • Inference Request
  • Responses
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Expert Parallelism in TensorRT-LLM

Performance

  • Overview
  • Best Practices for Tuning the Performance of TensorRT-LLM
  • Performance Analysis

Reference

  • Troubleshooting
  • Support Matrix
  • Numerical Precision
  • Memory Usage of TensorRT-LLM

C++ API

  • Runtime

Python API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
tensorrt_llm
  • Search


Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact