Elon and SpaceX Have Made AI Training 10 Times Faster

Breaking News

The Rise of the Ellison Empire

Getting Away with Murder: Under Trump, the Buck Stops Nowhere

Ukraine Fires on Iranian Ships. Russia and Ukraine Dragged Into the Iran War?

Meltdown? The Real Reason Why The Media Hates Elon Musk

Top Tech News

Elon Vows AI-Made 'Odyssey' After Blasting Nolan's Take On Homer

Anthropic is launching its own drug discovery programs for rare diseases using Claude...

SpaceX AI Satellites Will Have 250 Kilowatts of Power

Chinese researchers have developed a sodium-metal battery that can fully charge in just 4 minutes...

SpaceX Starship Flight 13 in 3 Days - Thursday July 13

Chinese Scientists Develop Nuclear Battery Using Carbon-14

Teleoperated humanoid robots complete first-ever live surgery

Floating capsule auto-disinfects water without chemicals or battery

Modular Reactors To Solve Data Center Hysteria?

DeepSeek Developing In-House AI Chip In Bid To Cut Nvidia Reliance

News Link • Robots and Artificial Intelligence • 2026-05-29

Elon and SpaceX Have Made AI Training 10 Times Faster

It's designed to run on a massive cluster of 220,000 cutting-edge NVIDIA GB300 GPUs connected by ultra-fast 800G networks. They use pipeline parallelism and get as close to raw hardware (bare metal) as possible to minimize overhead.

This is a major departure from the higher-level frameworks (JAX, with some custom Rust components previously used at xAI) that most labs rely on.

What "over an order of magnitude" speedup vs JAX actually means

Elon stated the potential speed improvement for large training runs is over an order of magnitude (over 10× faster wall-clock time for equivalent work).

Why this is possible?

Standard frameworks like JAX (or PyTorch) have significant overhead at extreme scale. Python interpreter layers, generic abstractions, compiler passes (XLA), collective communication libraries that aren't perfectly tuned to one specific cluster topology, and scaling inefficiencies beyond ~10k–50k GPUs. Typical Model FLOPS Utilization (MFU) on frontier clusters today is roughly 50–67% even in highly optimized setups.

What the new stack does differently

Written in pure C → eliminates interpreter and high-level runtime overhead.

Exact-maps to the exact 220k-GB300 + 800G NIC topology → every GPU, every link, every memory hierarchy is known at compile time. No generic runtime discovery or indirection.

Heavy pipeline parallelism (model layers split into pipeline stages across GPUs, with micro-batches flowing through like an assembly line) is hand-tuned to hide all communication latency behind computation.

Bare-metal kernels and custom collectives → direct hardware control, maximal bandwidth use on the 800G networking, and far higher sustained utilization (potentially 80%+ MFU or better at this scale).

How it applies to AI pretraining (and other training)

Pretraining is where it matters most: 80–95%+ of the compute for a new frontier LLM (like the next Grok foundation models) is spent in the initial pretraining phase on trillions of tokens. This stack is optimized exactly for that.

Other training (post-training, SFT, RL, supplemental/mid-training) will also benefit, but the gains are largest on the longest, most communication-heavy runs.

Read More...

Reported By Robert Lee

Forums

Shop

Breaking News

Top Tech News

Elon and SpaceX Have Made AI Training 10 Times Faster