Earlier this month, we reported on ExaAILabs’s Exacluster, a cluster of 18 machines running 144 Nvidia H200 GPUs, which happens to be one of the first clusters based on these processors. Since then, Hydra Host, the company that facilitated the construction of the cluster, has given us additional details about the system. The cluster uses Lenovo systems with multiple customizations from Hydra Host, which played a significant role. The machine can also be rented — when not in use by the owner — through Hydra’s Brokkr platform.
A Lot of Compute Power
The cluster’s backbone consists of 18 Lenovo nodes equipped with 144 Nvidia H200 GPUs and 20TB of HBM3E memory — or eight per system — enabling compute performance of 570 FP8 PetaTOPS for AI. 16 nodes are configured and fine-tuned by HydraHost for training, which requires massive computation and memory performance, while the remaining two serve as inference nodes. In addition, Hydra Host installed its Brokkr platform for GPU provisioning, management, and remote renting (more on this later).
Hydra Host collaborated with Computacenter to design a high-performance networking architecture tailored to the cluster’s needs. The setup uses 3.2Tbps InfiniBand for east-west traffic and 400Gbps Ethernet for north-south communication, including dual 200Gbps connections per server and 400Gbps Dell Ethernet switches. Computacenter’s networking engineers ensured all components aligned with Nvidia’s reference architecture for seamless compatibility.
“We supplied the 18 Lenovo nodes with H200 GPUs (16 interconnected and two inference nodes), designed the networking architecture in collaboration with Computacenter, and facilitated colocation through Patmos,” explained Andrea Holt, a spokesperson for Hydra Host.
The cluster itself is quite powerful, even in terms of general-purpose computing. The servers feature 192 96-core processors (for a total of 3,456 cores) paired with 36TB of DDR5 memory and 270TB of NVMe solid-state storage. There are spare bays so that storage space can be expanded easily. The supercomputer uses a network custom-built by HydraHost.
The company also brought in Patmos to handle colocation, providing enough power (around 100kW) and cooling for the power-hungry and hot machines.
Best Performance at Best Price
The Exacluster costs $5 million, averaging $277,777 per machine, comparable to a single 8-way H200 baseboard rather than a full server. Here is where it gets interesting. Who facilitated that price?
On the one hand, Hydra Host is a close Nvidia partner and only offers Nvidia GPUs as a service. In addition, its Brokkr software is optimized primarily for CUDA. On the other hand, ExaAI is a company backed by Nvidia, so it can potentially get preferential pricing.
“We are best in market at getting our customers the right GPU for their needs and at the best price,” said Ryan Horjus, Lead Sales Engineer at Hydra. “This cluster was supported by Nvidia from an architecture design and their Inception program. Hydra handled it for Exa, as we do for other companies.”
Hydra also specializes in building custom solutions for startups and even monetizes their machines when not in use.
“Hydra has helped startups get into their own clusters for better pricing through bulk purchasing,” Horjus added. “They can achieve ideal pricing through our network. They are also able to monetize the servers when not in use via the Brokkr management platform.”
Speaking of Brokkr, it is a GPU management and provisioning software and a monetization platform for GPUs. It provides datacenters and startups with a turnkey software solution for getting their hardware into customers’ hands and getting them paid for, explained Ariel Deschapell, chief technology officer and co-founder of Hydra.
“One of its key features is automated bare metal provisioning and lifecycle management,” described Deschapell. “That means the platform does all the work of configuring and managing the base server OS and firmware, setting up drivers and other supporting software, and running tests on the GPUs and other components. That speeds up and standardizes the delivery process significantly, reducing idle time on servers and GPUs. It also makes it easy to resell unused servers later to other users on the Brokkr platform looking for bare metal GPUs, if capacity needs change.”
#Exacluster #Nvidia #H200 #GPUs #detailed #designer #Hydra #Host #enters #scene