China’s “SpikingBrain 1.0” claims 100× faster long-context AI—running on homegrown MetaX chips

TL;DR: A Beijing team says its new brain-inspired language models—SpikingBrain-7B and SpikingBrain-76B—can handle ultra-long prompts dramatically faster by firing only the “neurons” that matter. The kicker: training and inference reportedly ran entirely on China’s MetaX C550 GPUs rather than Nvidia. A 7B model hit 100× faster time-to-first-token on a 4M-token input, trained with ~150B tokens (≈1% of today’s mega-corpora), and the code for the 7B variant is up on GitHub. A public demo nicknamed “Shunxi” is also being promoted. Impressive… if the results hold up under independent testing. South China Morning Post+3arXiv+3arXiv+3

What just launched

Two models: SpikingBrain-7B (linear-attention architecture) and SpikingBrain-76B (hybrid linear + local + softmax attention with MoE). The central idea is event-driven spiking: instead of activating every unit each step (classic Transformer behavior), the network encodes activations into sparse “spike” counts so only salient signals trigger compute.
Hardware angle: The team trained and served on MetaX C550 GPU clusters—part of China’s push to de-Nvidia its AI stack amid export controls.
Long-context claim: On a 4M-token stress test, the 7B model achieved >100× faster time-to-first-token (TTFT) versus standard setups, and training ran stably for weeks across hundreds of C550s (MFU ~23.4%).
Data efficiency: They report ~150B tokens of continual pretraining. For scale, Meta says Llama 3 was trained on >15T tokens—so SpikingBrain’s figure is ~1% of that.
Availability: A SpikingBrain-7B repo is live; media and state outlets say a free “Shunxi” demo is up for public testing.

Why this matters

A different path to speed: The models replace heavy, quadratic attention with linear/hybrid attention plus spike coding, yielding near-constant memory and big gains on very long contexts. The 100× claim is specifically for TTFT (time until the first token appears)—a user-visible metric that often suffers on million-token prompts.
Geopolitics of compute: Running end-to-end on MetaX rather than Nvidia—if broadly reproducible—signals real progress toward a domestic Chinese AI stack. MetaX has been racing to fill the Nvidia gap and has sought STAR-Market listings to scale.
Data diet: Training on ~150B tokens (vs multi-trillion-token norms) is eyebrow-raising. If accuracy truly stays “comparable to open-source baselines,” that’s a meaningful step toward cheaper, greener model development. (Meta reports Llama-3 used >15T tokens for pretraining.)

How it reportedly works (nutshell)

Hybrid attention: Mixes linear attention (state-based, compressive) with sliding-window and some full softmax to balance reach vs. cost—particularly in the 76B MoE variant.
Spiking scheme: Converts activations to integer spike counts and expands them into sparse spike trains (binary/ternary/bitwise encodings). The paper reports ~69% sparsity at the micro level, a big lever for power savings.
System plumbing: Custom Triton kernels/operators and a vLLM-based path adapted for MetaX C550 clusters; the team frames this as the first large-scale brain-inspired LLM trained on a non-Nvidia platform.