Paradigms Shift in LLMs and Compute

Authors

AI development has reached a turning point. The 2020 Kaplan Scaling Laws stated that more data and more GPUs would inevitably lead to smarter models. But, as we enter 2026, the industry has hit a ceiling where “bigger” models are no longer enough. The new paradigm isn’t about how much data a model can access, but how deeply it can reason and how efficiently the data can move within the hardware.

This evolution begins with a fundamental change in how AI “thinks”. Traditional models were essentially highly advanced next-token predictors, guessing the most likely next word or pixel based on probability. The new generation, led by architectures like Meta’s VL-JEPA and OpenAI’s o-series, has introduced System 2 thinking. Instead of an instantaneous, reactive response, these models engage in a deliberate “thinking” phase. By predicting abstract meanings rather than every surface-level detail, these systems can model the physical logic of the world. They explore multiple logical paths and attempt to correct their errors before providing an answer, making them more reliable for complex tasks like mathematics or engineering.

When it comes to the silicon layer beneath it all, electrons can only move so fast. This critical bottleneck is known as the “memory wall”, i.e. the physical limitation of how fast data can travel from memory to the processor. To overcome the memory wall, the industry has pivoted to using a 3D compute sandwich which essentially reduces the distance travelled by data. Through a process called Hybrid Bonding, memory stacks are now fused directly on top of the logic chip, instead of being placed side-by-side. This vertical integration drastically shortens the distance data travels, allowing for transfer speeds that exceed 2 terabytes per second.

The shift in hardware is not just about physical layout, but the medium of the signal as well. Extreme heat generation as a result of the use of copper in chips is another such criticality, impacting not just efficiency but also large-scale power consumption in data centres. New platforms like NVIDIA’s Rubin are utilising silicon photonics, wherein microscopic lasers transmit data using optic signals (basically, light). This allows for a massive increase in bandwidth with significantly lower power consumption. When paired with FP4 precision, which is a mathematical shortcut that allows chips to process data using simpler 4-bit numbers, AI clusters can now handle trillion-parameter models with a level of efficiency that was previously thought to be unattainable.

These developments mark a shift from resource-hungry scaling to efficient reasoning, which mirrors classic innovation cycles. Furthermore, efficient AI clusters will democratise capability, shifting competition from sheer computational power to regulatory and supply-chain control.