← Back to search

Jetson Thor Made LLMs 3.5× Faster in 5 Weeks — But How?

JetsonHacks 9:04

3,348 views · 127 likes Watch on YouTube ↗

Join this channel to get access to perks:
https://www.youtube.com/channel/UCQs0lwV6E4p7LQaGJ6fgy5Q/join

Jetson Holiday Sale is live now! (11/26/2025)
Jetson AGX Thor Developer Kit (20% off): https://amzn.to/4oezEZn
Jetson AGX Orin Developer Kit (50% off): https://amzn.to/4inNE1A
Cyber Monday 2025 Deals : https://amzn.to/3GzOpWM

Full article on JetsonHacks : https://wp.me/p7ZgI9-3TU

Only 5 weeks after NVIDIA launched the Jetson AGX Developer Kit, the vLLM inference engine was able to generate tokens 3.5x faster! The question is how? In this video, we go though the process of how LLMs actually work, and how several bottlenecks were removed. This includes Automatic Prefix Caching, Paged Attention, implementing GPU kernels in FlashInfer and xFormers, and CUDA graphs amongst other techniques.

NVIDIA blog post :"Unlock Faster, Smarter Edge Models with 7x Gen AI Performance on NVIDIA Jetson AGX Thor" https://developer.nvidia.com/blog/unlock-faster-smarter-edge-models-with-7x-gen-ai-performance-on-nvidia-jetson-agx-thor/

Transformer Explainer
https://poloclub.github.io/transformer-explainer/

As an Amazon Associate I earn from qualifying purchases.
Visit the JetsonHacks storefront on Amazon: https://www.amazon.com/shop/jetsonhacks

Visit the website at https://jetsonhacks.com
Sign up for the newsletter! https://newsletter.jetsonhacks.com
Github accounts: https://github.com/jetsonhacks
https://github.com/jetsonhacksnano
Twitter: http://twitter.com/jetsonhacks

Some of these links here are affiliate links. As an Amazon Associate I earn from qualifying purchases at no extra cost to you.

Playback is via YouTube's official embedded player. Data from YouTube; Exumo is not affiliated with YouTube.