Thumbnail for Inference Chips for Agent Workflows by Y Combinator

Inference Chips for Agent Workflows

Y Combinator

1m 20s182 words~1 min read
Auto-Generated

[0:00]Most AI chips are designed for a world where inference means prompt in, response out. Agents don't work that way. They loop. Calling tools, branching, backtracking, holding context across dozens of steps. That's a completely different hardware problem. Current GPUs hit 30% to 40% of peak utilization on these workloads because the work is bursty, bouncing between memory-bound model calls, IO-bound tool use, and CPU-bound orchestration. That gap is where purpose-built silicon wins. NVIDIA bought Groq for $20 billion because they saw this coming. Google built TPUv7 for inference specifically, but nobody's designing for the agent loop itself. Fast contact switching between models, native speculative decoding, memory built for KV caches that persist across an entire execution graph. Groq's real insight wasn't the chip, it was the compiler that made the chip work. We think that will be true for whoever builds this next. If you understand both the chip architecture and how agents actually execute, this is a rare moment where both halves of that experience matter. If you're building inference Silicon for agentic AI, we'd love to hear from you.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript