Amazing write up and i wish more people showed the process for discovery which is often even more interesting than the result itself
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.