設計工具

Invalid input. Special characters are not supported.

AI

A perspective shift: From computation to cognition

Evelyn Grevelink, Felippe Vieira Zacarias | April 2025

As large language models (LLMs) push the boundaries of AI, high-bandwidth memory (HBM) is key for next-generation LLMs, enabling intelligent, context-aware reasoning at unprecedented speeds.

Traditionally, computer systems were designed around a deterministic, linear model of processing: 

input → compute → output


But the success of artificial intelligence (AI) in recent years, particularly LLMs, has required a paradigm shift. We're no longer dealing with machines that can simply process and calculate. With the emergence of generative AI, an AI-powered chatbot — famously, ChatGPT — now has the sophistication to interpret context, generate novel insights, adapt to new information, and even reason. The capability is not yet sentient — like the iconic and beloved Rosey, the Jetsons’ robotic housekeeper — but you still get a useful and intelligent conversation partner. 

You might be wondering what role memory plays in powering these intelligent systems. Memory becomes increasingly important as LLMs grow in parameter size — now reaching into the trillions — since these massive parameter sets must be stored and quickly accessed from memory during inference and training. And HBM is specifically designed to handle this enormous data movement, which involves frequent and high-volume memory access. For the past decade Micron has advanced memory technology to keep pace with the rapid growth and success of these models. In this blog, we’ll explore the significance of high-bandwidth memory — specifically, HBM3E from Micron — in advancing AI models and enabling them to be more powerful, capable and intelligent.

A decade of AI and memory hardware chart timeline

Figure 1. How memory technology has kept pace with supporting the needs of larger model sizes

Milliseconds matter

The meteoric rise of LLMs has set a new challenge for researchers and engineers to fundamentally rethink how computational systems process and move information. Just as the advent of computer graphics forced a shift in thinking — where we didn’t just improve rendering speed but redefined how machines could perceive and process visual information — we now stand at a similar point with AI. The integration of LLMs like Anthropic’s Claude, Google’s Gemini, Meta’s Llama and others into mainstream applications demand more than incremental gains in performance. It calls for a new class of systems capable of supporting dynamic and context-aware interactions between humans and machines. When designing today’s hardware, engineers must go beyond optimizing traditional metrics like latency and power efficiency. The systems they design must enhance understanding for inference tasks, support real-time learning, and maintain continuity in conversation-like exchanges.

In AI-powered interactions, a few milliseconds can make the difference between a harmonious, human-like experience and a fragmented or frustrating one. In high-load data center scenarios that support thousands—or even millions— of concurrent users, such as real-time translation or AI copilots — the higher bandwidth and greater capacity of next-generation memory like HBM3E are key. This technology ensures consistency in system response, preserves output quality under high loads, and enables equitable, high-fidelity interactions for all users.

HBM3E and AI inference

Next-generation memory hardware is often characterized by improvements in bandwidth and capacity, with the mantra that "more, bigger, and faster is better." However, in the context of contemporary AI systems, particularly LLMs, this approach is more nuanced. Take HBM3E, for example: faster data transfer rates (higher bandwidth) and increased memory capacity have a more complex effect on AI inference. While bandwidth and capacity remain critical metrics for memory hardware, they influence LLM performance in distinctly different ways. Our goal extends beyond increasing speed for the sake of going faster or capacity for the purpose of holding more data; we now need to improve these metrics to enable greater levels of intelligence — the ability to synthesize information and reason. Let’s now look at some specs of HBM3E and explain what those higher values actually mean in the context of AI models.

Bandwidth determines the computational potential

Per cube, HBM3E has a bandwidth of more than 1.2 terabytes per second (TB/s) but this isn't just a higher number.1 It represents computational potential. The ability to transfer data at this rate means that AI models can access, process and synthesize information at unprecedented speeds that dramatically reduce latency and enhance model performance (just how responsive and quick the system can be).

Capacity determines the depth and complexity of reasoning 

An expanded capacity of 24 gigabytes (GB) per cube capacity2 means more than just storage; it enables greater cognitive potential for a neural network, where a larger model capacity allows intelligent machines to execute even more complex tasks. Unlike traditional computing models where memory has served primarily as a storage mechanism, in modern AI architectures, memory capacity is needed for cognition, translating directly into deeper understanding, more nuanced reasoning and more comprehensive answers. We can think of access to a larger memory capacity as compounding or multiplying an LLM’s skill for reasoning. 

With HBM3E, we are not just improving performance through numerical improvements, but we're also designing memory to fundamentally expand the cognitive potential of machine intelligence. The combined impact of higher bandwidth and high capacity allows an LLM to be thoughtful and exact in how it interacts with you. And at a technical level, that increase means LLMs can process larger datasets, more tokens per second, longer input sequences, and longer data formats like FP16. Essentially, without enough bandwidth, these very capable models would struggle to quickly access relevant information. And without an enormous memory capacity, it would lack the depth to generate a comprehensive, contextually rich response beyond surface-level analysis. 

A purple and white chart showing the increase in throughput for Micron HBM3E H200

Experimental results

Now let’s get into some actual test results3 using Meta Llama 2 70B with DeepSpeed ZeRO-Inference to show the transformative potential of next-gen HBM:

  • Performance boost: HBM3E increases inference performance by 1.8 times, with memory bandwidth reaching 4.8 TB/s.4
  • Scalability: The technology supports 2.5 times more batch sizes, enabling more concurrent client processing.4,5
  • Precision and capacity: Expanded memory capacity (144GB, 80% more than the previous generation) allows higher-precision model operations.

These results demonstrate how advanced memory technologies like next-generation HBM can address critical challenges in LLM infrastructure, balancing compute performance with power efficiency6. The improvements in inference performance, capacity, and power use highlight potential pathways for more intelligent and powerful AI systems. Looking ahead, future generations of HBM technology will enable many capabilities, including rapid computational scaling and support for increasingly complex model architectures. Data centers that embrace this technology will be better positioned to deliver faster, more power-efficient, scalable AI services that focus on the user, ultimately advancing progress across industries.

Learn more

1 TB/s per cube bandwidth. NVIDIA’s Blackwell GPU has 8 TB/s, which varies among AI platforms.

2 Compared to 16GB capacity for the previous generation HBM (HBM3). 

3 We analyzed the performance of Meta Llama 2 70B with DeepSpeed ZeRO-Inference, testing a single NVIDIA HGX H200 (HBM3E) against an NVIDIA HGX H100 (HBM3).

4 Result based on INT4 quantized model execution. When we consider both the higher memory bandwidth and capacity of HBM3E in an NVIDIA H200 system (4.8 TB/s), the inference performance of Llama 2 70B increases by 1.8 times over previous HBM generations.

5 Result based on INT4 quantized model execution. HBM3E enables the processing of 2.5 times more batch sizes (inference requests) than the previous HBM generation, supporting more concurrent clients for a single GPU by processing more data simultaneously.

6 To stress the memory bandwidth, we use BabelStream, a microbenchmark, designed to simulate a worst-case scenario requiring maximum bandwidth use. This approach allows us to assess peak memory usage while measuring power consumption. By operating at 100% bandwidth utilization, we can isolate the power draw due to memory. Our results show up to 30% more power consumption at 100% bandwidth utilization for competition HBM3E.

Content Strategy Marketing Lead

Evelyn Grevelink

Evelyn leads the content strategy for the Cloud Memory Business Unit (CMBU) Strategic Marketing team at Micron Technology. She is passionate about acting as a bridge between engineering and marketing through creative, strategic storytelling. Evelyn specializes in writing compelling narratives and designing illustrations to communicate complex concepts for large language models, AI, and advanced memory technologies. She holds a bachelor's degree in physics from California State University, Sacramento.  

Systems Performance Engineer

Felippe Vieira Zacarias

Felippe is a Systems Performance Engineer at Micron Technology, where he works with the Data Center Workload Engineering team to provide an end-to-end systems perspective on understanding memory hierarchy usage for data center workloads. Felippe has extensive expertise in high-performance computing and workload analysis, having worked as a research engineer at renowned supercomputing centers. He holds a Ph.D. in Computer Architecture from Universitat Politècnica de Catalunya.