I Made Qwen 3.6 Long Prompts 7X Faster on Jetson Thor
JetsonHacks 11:12
8,426 views · 170 likes Watch on YouTube ↗
Join this channel to get access to perks:
https://www.youtube.com/channel/UCQs0lwV6E4p7LQaGJ6fgy5Q/join
Raw hardware is only half the battle. When running frontier models like Qwen 3.6 on the edge, your software stack and server configuration are what determine whether the experience is "real-time" or a total bottleneck.
Compared in the video:
NVIDIA Jetson AGX Thor: https://amzn.to/3QK6o1u
NVIDIA Jetson AGX Orin: https://amzn.to/4cVPJAR
In this video, we deploy Alibaba’s Qwen 3.6 27B dense and Qwen 3.6 (35B-A3B) Mixture-of-Experts (MoE) model on the NVIDIA Jetson AGX Thor. We move beyond simple token-per-second benchmarks to look at the "Hidden Wait"—the prefill time—and how different inference servers like vLLM and llama-server (llama.cpp) handle long-context inputs and multimodal vision tasks.
What we cover:
The Prefill Bottleneck: Why a "faster" stream can actually result in a slower total turnaround time.
vLLM vs. llama-server: Benchmarking deployment-ready stacks against exploration tools on the Blackwell architecture.
MoE & DeltaNet: How Qwen 3.6 uses hybrid linear attention to maintain a massive context window on edge memory.
Orin vs. Thor: Real-world performance scaling across the 410 GB/s memory bandwidth of the Jetson Thor.
Resources:
Jetson AI Lab: https://jetson-ai-lab.com
Model Card (Hugging Face): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
Leaderboard: https://artificialanalysis.ai
00:00 Introduction
00:50 Image Analysis
01:38 Summarize Article
02:15 Benchmarks
05:08 Faster Prefill
06:07 Jetson AGX Orin vs AGX Thor
06:40 Speculative Decoding
08:40 Additional Resources
10:01 Final Thoughts
As an Amazon Associate I earn from qualifying purchases.
Visit the JetsonHacks storefront on Amazon: https://www.amazon.com/shop/jetsonhacks
Visit the website at https://jetsonhacks.com
Sign up for the newsletter! https://newsletter.jetsonhacks.com
Github accounts: https://github.com/jetsonhacks
https://github.com/jetsonhacksnano
Twitter: http://twitter.com/jetsonhacks
Some of these links here are affiliate links. As an Amazon Associate I earn from qualifying purchases at no extra cost to you.
https://www.youtube.com/channel/UCQs0lwV6E4p7LQaGJ6fgy5Q/join
Raw hardware is only half the battle. When running frontier models like Qwen 3.6 on the edge, your software stack and server configuration are what determine whether the experience is "real-time" or a total bottleneck.
Compared in the video:
NVIDIA Jetson AGX Thor: https://amzn.to/3QK6o1u
NVIDIA Jetson AGX Orin: https://amzn.to/4cVPJAR
In this video, we deploy Alibaba’s Qwen 3.6 27B dense and Qwen 3.6 (35B-A3B) Mixture-of-Experts (MoE) model on the NVIDIA Jetson AGX Thor. We move beyond simple token-per-second benchmarks to look at the "Hidden Wait"—the prefill time—and how different inference servers like vLLM and llama-server (llama.cpp) handle long-context inputs and multimodal vision tasks.
What we cover:
The Prefill Bottleneck: Why a "faster" stream can actually result in a slower total turnaround time.
vLLM vs. llama-server: Benchmarking deployment-ready stacks against exploration tools on the Blackwell architecture.
MoE & DeltaNet: How Qwen 3.6 uses hybrid linear attention to maintain a massive context window on edge memory.
Orin vs. Thor: Real-world performance scaling across the 410 GB/s memory bandwidth of the Jetson Thor.
Resources:
Jetson AI Lab: https://jetson-ai-lab.com
Model Card (Hugging Face): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
Leaderboard: https://artificialanalysis.ai
00:00 Introduction
00:50 Image Analysis
01:38 Summarize Article
02:15 Benchmarks
05:08 Faster Prefill
06:07 Jetson AGX Orin vs AGX Thor
06:40 Speculative Decoding
08:40 Additional Resources
10:01 Final Thoughts
As an Amazon Associate I earn from qualifying purchases.
Visit the JetsonHacks storefront on Amazon: https://www.amazon.com/shop/jetsonhacks
Visit the website at https://jetsonhacks.com
Sign up for the newsletter! https://newsletter.jetsonhacks.com
Github accounts: https://github.com/jetsonhacks
https://github.com/jetsonhacksnano
Twitter: http://twitter.com/jetsonhacks
Some of these links here are affiliate links. As an Amazon Associate I earn from qualifying purchases at no extra cost to you.
Playback is via YouTube's official embedded player. Data from YouTube; Exumo is not affiliated with YouTube.