The 10,000x Explosion: Reproducing DeepSeek’s mHC at Scale

In the world of Large Language Models, “stability” is often the difference between a state-of-the-art model and an expensive pile of NaNs. When DeepSeek released their paper on Manifold Hyper-Connections (mHC), they proposed a solution to a hidden rot in transformer architectures: the explosion of signal through residual connections.

Developer Taylor Kolasinski recently took these claims to the lab, and the results were nothing short of explosive. Here is the breakdown of how he found “the bomb” hidden in transformer architectures.

1. The “Stream Persistence” Breakthrough

Reproduction started small—a 10M parameter sandbox. Initially, everything looked identical. The math was right, but the results were flat. The culprit? A “collapsing stream” bug.

In Hyper-Connections, the model maintains multiple parallel streams of information. Taylor realized he was accidentally merging these streams back into one at every layer, effectively neutering the architecture. Once he allowed the streams to stay independent (“persistent”), the beast woke up. The signal amplification jumped 9.2× almost immediately.

2. Scaling Until it Breaks

Small models are one thing; 1.7 billion parameters is another. Moving to the “big leagues” meant swapping Colab for a $20/hour 8× H100 cluster from Lambda Labs.

Engineering at this scale is a war of attrition. Taylor documented a gauntlet of OOM (Out of Memory) errors, “thundering herd” problems where multiple GPUs tried to download the 300GB C4 dataset simultaneously, and even a physical hardware failure on “GPU 3.”

The lesson? When training LLMs, assume everything will break. Memory math matters more than your code, and cloud GPUs are as fallible as any other hardware.

3. Detonating “The Bomb”

The climax of the experiment came at the 1.7B parameter mark. As the training runs progressed, the standard Hyper-Connections (HC) model began to exhibit what Taylor calls “quiet violence.”

The signal didn’t just drift; it amplified by over 10,000×. In a startling discovery, he found that this instability doesn’t start deep in the network—it starts at Layer 0, right at the input embedding.

The only reason the model didn’t instantly vaporize into “Not a Number” (NaN) errors was gradient clipping—a safety net that was working overtime to mask the structural instability of the architecture.

The Verdict

The experiment was a total success for the DeepSeek thesis. While the unconstrained Hyper-Connections created a signal bomb, the Manifold Hyper-Connections (mHC) fix kept the signal pinned at a perfect 1.0 across every seed and every depth.

Taylor’s journey proves that as we push toward 10B and 100B parameter models, we can no longer rely on “heroic” gradient clipping to save us. We need architectures that are stable by design.


Reference

[1] https://taylorkolasinski.com/devlog/mhc-reproduction-session-4/ [2] https://taylorkolasinski.com/devlog/mhc-reproduction-session-5/ [3] https://taylorkolasinski.com/devlog/mhc-reproduction-session-6/