Bridging the AI Inference Gap: Hyper-Scalers vs Enterprise

Written by: Heather Srigley

May 10, 2024

Affordable Intelligence, AIAI Summit, Energy Efficiency

Until now, the AI landscape has been heavily focused on developing new AI Training models and catering to the needs of hyper-scalers – the major cloud service providers such as Amazon Web Services, Google Cloud Platform (GCP), IBM, Microsoft Azure, and Oracle. Unfortunately, this emphasis on the top tier of the data center market has left AI Inference in the shadows, causing 75% of U.S. businesses and government entities to lag. Currently, only a mere 25% of U.S. businesses have adopted AI, often opting for pre-packaged solutions in the cloud rather than implementing their own customized generative AI solutions.

Now, we are starting to see early signs that the tech industry is awakening to the crucial need for a collective effort to revamp and modernize AI system architecture. Without this essential shift, companies risk being left behind as Nvidia and other innovators continue down a path focused solely on raw performance, potentially hindering mass adoption of AI in the business world.

Efficient AI infrastructure IS key to success

At the AIAI Summit last month in San Jose, CEO Moshe Tanach highlighted the high cost of AI Inference due to outdated CPU infrastructure, and urged fellow engineers and technologists to turn their attention to AI data center efficiency over raw performance in handling multi-modal AI queries and responses.

It’s a multi-faced problem, requiring system-level thinking to reroute the AI pathways, so to speak. In our previous blog, we dove into the distinctions between AI Training and AI Inference, emphasizing the pitfalls of using the same AI system architecture and hardware for both and the waste of time and dollars.

In this post, we delve into the key disparities in AI Inference data center setups between hyper-scalers and enterprise customers across various sectors like retail, healthcare, government, and more. These differences have stacked up against the majority, leaving them as the AI “have-nots” unless the tech industry unites to provide Affordable Intelligence with a heightened sense of urgency.

The AI Infrastructure Mismatch

A fundamentally different AI Inference system is essential for mainstream business and government to fully leverage AI’s potential in public safety, medical advancements, personalized experiences, and more.

Moshe showed this chart to break down the differences as top of mind when NeuReality set out to design the ideal AI Inference solution.

NeuReality shows big differences in AI Inference hardware and systems by highlight GPU, Energy, Location and Business Purpose differences between Hyper-Scalers versus Enterprise

Let’s delve into specifics. Nvidia GPUs were primarily designed for AI Training rather than AI Inference, resembling supercomputers in their capabilities. However, their DGX-H100 and newer Blackwell models are excessive in both cost and power for typical enterprise servers. This often leads to businesses overspending on GPUs that are not even fully utilized in AI deployment, mainly due to significant CPU performance limitations.

Consequently, businesses and cloud providers find themselves purchasing more GPUs to compensate for this inefficiency, resulting in unnecessary expenses and wasted resources. Moreover, lower-margin businesses with limited financial resources face significant barriers to entry into the AI market, especially when GPU prices soar to $50,000 or even $100,000 during periods of low supply.

Real Estate realities

Hyper-scalers run extensive data centers spanning millions of square feet and housing thousands of servers, strategically optimizing their IT infrastructure for maximum efficiency and scalability. These mammoth facilities are typically situated away from bustling urban hubs, encompassing both the data centers themselves and their power sources.

In contrast, other businesses tend to position their data centers in closer proximity to their main headquarters or key campuses, whether nestled in the heart of the mid-west, along the east coast, or down in the southern regions. While over 75% of these operations rely on traditional electrical power, an increasing number are transitioning towards more sustainable energy sources like hydroelectric, geothermal, wind, and solar power.

Geography plays a crucial role, and here’s why. The U.S. Energy Information Administration notes that half of the nation’s hydroelectric capacity is concentrated in just three states – Washington, California, and Oregon. Moreover, Washington, Idaho, Oregon, and Vermont rely on hydroelectric facilities for at least half of their in-state utility-scale energy generation. On the other hand, some states have limited hydroelectric potential, while states like Delaware and Mississippi lack utility-scale hydroelectric facilities altogether.

This disparity puts enterprise customers at a cost and energy disadvantage, impacting their profit margins.

It doesn’t have to be that way. NeuReality designed NR1 to reduce energy consumption at the source: the AI data centers. That means the U.S. can do far more than just thoughtlessly build out bigger and bigger electrical grids and negatively impact the environment but shift the mindset to reduce skyrocketing demand through far more energy-efficient AI data centers

mismatch in power

This brings us to the next challenge overlooked in the realm of modern GPU designs, particularly Nvidia’s latest generation of Blackwell GPUs. Frankly, these powerhouse GPUs are “overkill” for the average business relying on power from local utilities, where the typical rack can only support an average of 10-15 kilowatts.

“Customers cannot even deploy Nvidia DGX-H100 due to power consumption, and now they’re moving from 700W GPUs to 900-1200W GPUs requiring 120KW racks. That might work for a hyper-scaler but it’s hardly something to be excited about for the rest of us. Today’s data centers are limited to 13KW per rack!” explains Moshe.

“What blows my mind is that deep inside the data center, those Blackwell GPUs are still under-utilized and the problem will only worsen every year,” he says. “With Generative AI, that system inefficiency will only grow. It’s critical to fortify your AI infrastructure now.”

50-90% cost savings with NR1 NAPU™

Recognizing the distinct design needs of hyper-scalers versus enterprises (businesses and governments large and small), the NeuReality team ideated, designed, and delivered the innovative NR1 system exclusively for scalable AI inference. Benchmarks to date show significant cost savings of 50-90% depending on the AI application, delivering affordable intelligence and superior performance while using less space and less power consumption.

NR1 is not your average AI chip; it elevates and complements the capabilities of all AI Accelerator chips within an optimal AI data center configuration. With compatibility across GPUs, FPGAs, or ASICs, the NR1-S™ AI Inference Appliance that GPU utilization from 30% to 100%, allowing you to extract more value from your GPU investments without the need to purchase excess hardware.

And, unlike other AI Inference solutions, NeuReality offers 100% linear scalability at the hardware level, ensuring no drop-offs in performance as AI workloads grow now and into the future.

Key Takeaways:

Significant Disparities in Data Center Infrastructure Needs: Understand the key differences in AI economics, energy, and geography between a hyper-scaler (where the data is their core business) and a typical business (where data centers are a business expense).
Business and Technology Differences between AI Training and AI Inference: Gain insights into why top-performing AI Training GPUs are not the optimal choice for AI Inference setups in businesses, governmental organizations, and the majority of cloud service providers. Using the same AI hardware and system architecture for both is a waste of your time and money.

Nvidia Power Overkill: In the realm of everyday AI data center operations, it’s not about having the most powerful GPU, but rather about having the most sustainable, efficient AI system infrastructure. Build the best AI data highway versus the fastest race car. While the average enterprise data center operates at 10-15 KW per rack, Nvidia’s designs cater to hyper-scalers which may demand 120KW racks – an excessively high power requirement for most businesses and governments that average 13 KW per rack.

Take Action Now: Request a comparative analysis of price and performance for the NR1 NAPU + AI Accelerator VERSUS CPU (x86) + AI Accelerator. Whether you’re in discussions with your cloud service provider, server OEM, or AI hardware supplier (e.g. AMD, IBM, Qualcomm, Lenovo, Supermicro) for future AI Inference deployment, be sure to ask. And, brace yourself for mind-blowing differences in NR1 price/performance coupled with your AI Accelerator of choice. Or, ask us directly. We’ll be happy to walk through the numbers in detail.