← Back to search

I Made Qwen 3.6 Long Prompts 7X Faster on Jetson Thor

JetsonHacks 11:12

8,426 views · 170 likes Watch on YouTube ↗

Join this channel to get access to perks:
https://www.youtube.com/channel/UCQs0lwV6E4p7LQaGJ6fgy5Q/join

Raw hardware is only half the battle. When running frontier models like Qwen 3.6 on the edge, your software stack and server configuration are what determine whether the experience is "real-time" or a total bottleneck.

Compared in the video:
NVIDIA Jetson AGX Thor: https://amzn.to/3QK6o1u
NVIDIA Jetson AGX Orin: https://amzn.to/4cVPJAR

In this video, we deploy Alibaba’s Qwen 3.6 27B dense and Qwen 3.6 (35B-A3B) Mixture-of-Experts (MoE) model on the NVIDIA Jetson AGX Thor. We move beyond simple token-per-second benchmarks to look at the "Hidden Wait"—the prefill time—and how different inference servers like vLLM and llama-server (llama.cpp) handle long-context inputs and multimodal vision tasks.

What we cover:

The Prefill Bottleneck: Why a "faster" stream can actually result in a slower total turnaround time.

vLLM vs. llama-server: Benchmarking deployment-ready stacks against exploration tools on the Blackwell architecture.

MoE & DeltaNet: How Qwen 3.6 uses hybrid linear attention to maintain a massive context window on edge memory.

Orin vs. Thor: Real-world performance scaling across the 410 GB/s memory bandwidth of the Jetson Thor.

Resources:

Jetson AI Lab: https://jetson-ai-lab.com

Model Card (Hugging Face): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Leaderboard: https://artificialanalysis.ai

00:00 Introduction
00:50 Image Analysis
01:38 Summarize Article
02:15 Benchmarks
05:08 Faster Prefill
06:07 Jetson AGX Orin vs AGX Thor
06:40 Speculative Decoding
08:40 Additional Resources
10:01 Final Thoughts

As an Amazon Associate I earn from qualifying purchases.
Visit the JetsonHacks storefront on Amazon: https://www.amazon.com/shop/jetsonhacks

Visit the website at https://jetsonhacks.com
Sign up for the newsletter! https://newsletter.jetsonhacks.com
Github accounts: https://github.com/jetsonhacks
https://github.com/jetsonhacksnano
Twitter: http://twitter.com/jetsonhacks

Some of these links here are affiliate links. As an Amazon Associate I earn from qualifying purchases at no extra cost to you.

Playback is via YouTube's official embedded player. Data from YouTube; Exumo is not affiliated with YouTube.