>
Delcy's 'Gatekeeper': Sources Say Ex-Trump Official Claver-Carone Holds Keys to Caracas
Elon and SpaceX Have Made AI Training 10 Times Faster
SR-71 successor? Quarterhorse Mk 2.1 prototype cracks the sound barrier
EU Wants Crisis Powers To Seize Control Of Chip Supplies, Seeks Restrictions On Chinese Imports
Cars Are Fast Becoming Dystopian Prison Pods...
Our Emergency Water Plan Wasn't Good Enough - So We Built This
Sodium Ion Batteries Can Reach 100 Gigawatt Per Hour Per Year Scale in 2027
Juiced Bikes proves capable electric motorcycles don't have to cost a lot
Headlight projectors turn your car into a drive-in theater
US To Develop Small Modular Nuclear Reactors For Commercial Shipping
New York Mandates Kill Switch and Surveillance Software in Your 3D Printer ...
Cameco Sees As Many As 20 AP1000 Nuclear Reactors On The Horizon
His grandparents had heart disease.
At 11, Laurent Simons decided he wanted to fight aging.
Mayo Clinic's AI Can Detect Pancreatic Cancer up to 3 Years Before Diagnosis–When Treatment...

It's designed to run on a massive cluster of 220,000 cutting-edge NVIDIA GB300 GPUs connected by ultra-fast 800G networks. They use pipeline parallelism and get as close to raw hardware (bare metal) as possible to minimize overhead.
This is a major departure from the higher-level frameworks (JAX, with some custom Rust components previously used at xAI) that most labs rely on.
What "over an order of magnitude" speedup vs JAX actually means
Elon stated the potential speed improvement for large training runs is over an order of magnitude (over 10× faster wall-clock time for equivalent work).
Why this is possible?
Standard frameworks like JAX (or PyTorch) have significant overhead at extreme scale. Python interpreter layers, generic abstractions, compiler passes (XLA), collective communication libraries that aren't perfectly tuned to one specific cluster topology, and scaling inefficiencies beyond ~10k–50k GPUs. Typical Model FLOPS Utilization (MFU) on frontier clusters today is roughly 50–67% even in highly optimized setups.
What the new stack does differently
Written in pure C → eliminates interpreter and high-level runtime overhead.
Exact-maps to the exact 220k-GB300 + 800G NIC topology → every GPU, every link, every memory hierarchy is known at compile time. No generic runtime discovery or indirection.
Heavy pipeline parallelism (model layers split into pipeline stages across GPUs, with micro-batches flowing through like an assembly line) is hand-tuned to hide all communication latency behind computation.
Bare-metal kernels and custom collectives → direct hardware control, maximal bandwidth use on the 800G networking, and far higher sustained utilization (potentially 80%+ MFU or better at this scale).
How it applies to AI pretraining (and other training)
Pretraining is where it matters most: 80–95%+ of the compute for a new frontier LLM (like the next Grok foundation models) is spent in the initial pretraining phase on trillions of tokens. This stack is optimized exactly for that.
Other training (post-training, SFT, RL, supplemental/mid-training) will also benefit, but the gains are largest on the longest, most communication-heavy runs.