Ship Racks with Performance Confidence.

Automated AI rack and cluster validation. Ready to generate revenue at delivery on Day 1.

Data Centers Receive Racks Failing to Perform / Generate Revenue on Day 1

“Data centers receive misassembled, misconfigured, dead-on-arrival components.” — Together AI

Components tested in isolation, not as a system under AI workloads.

Silent failures like RCCL bandwidth and NVLink errors go undetected.

Operators discover underperformance only after the rack is live.

$2M+

at stake in a single 32-GPU B200 rack.

Every additional week in L12 validation or post-delivery debug is a direct hit to OEM gross margin and operator time-to-revenue.

ClusterReady closes this gap — before the rack ships.

ClusterReady

Validate with AI workloads and XPerf’s proprietary test suite at 10× faster validation. Get the remedy and optimization — not just the failure.

10× FASTER

Faster Validation

AI workloads + XPerf proprietary test suite take validation from weeks to hours.

HIDDEN ISSUES

Find What L11/L12 Misses

Silent failures — RCCL bandwidth, NVLink errors — caught before shipment.

SUGGESTED FIX

Root Cause + Remedy

Not just a failure flag — the ONLY tool providing diagnosis, remedy, and optimization.

See ClusterReady in action

Bring Assembled Rack to Production Ready

Four automated stages take every rack from arrival to a verified, optimized, revenue-ready state.

01/

Pre-Flight Health Check

30+ automated checks. GPU · CPU · memory · network · storage.

02/

AI Workloads Validate

MLPerf benchmarks + XPerf database. Catches what standard testing misses.

03/

Stress Testing — Push to Limits

XPerf’s patented stress testing algorithm. Every server and every link, at peak. Surfaces marginal hardware before customers do.

04/

Production Ready — Fix & Optimize

Suggested remedy on every finding. Suggested optimization guidance.

Validated at Every Layer

ClusterReady tests the rack as a system — hardware, software, and AI workloads — not isolated components.

Hardware
  • GPU count, model & health.
  • PCIe speed, width & ACS.
  • IB / RoCE ports & link quality.
  • Transceiver & cabling health.
  • Storage filesystem I/O.
  • Thermals & power draw.
Software
  • OS, kernel & IOMMU config.
  • GPU drivers (NVIDIA & AMD).
  • GPUDirect / RDMA stack.
  • Container GPU passthrough.
  • Firewall & security state.
  • Cross-node connectivity.
AI Workloads
  • Tokens / sec throughput.
  • Watt / token efficiency.
  • TTFT, TPOT & p50–p99 latency.
  • Training MFU % & convergence.
  • Multi-node scaling.
  • LLM (LLaMA, Mixtral), Text-to-Image (FLUX), Text-to-Video (Wan2.2), Text-to-Speech (Whisper).

Weeks of Uncertainty → Hours of Confidence

Before·L11/L12 only
  • 2–6 weeks to validate.
  • Components only — no system test.
  • No AI workloads.
  • Silent failures missed.
  • Manual triage.
Avg. validation: 2–6 weeks
After·With ClusterReady
  • Hours or days to validate.
  • Full rack/cluster tested as a system.
  • Inference + training workloads.
  • Hidden failures caught pre-shipment.
  • Root cause + remedy + optimization.
Avg. validation: Hours to Days

Why ClusterReady Wins

ClusterReady complements L11/L12 testing — what L11/L12 ships, ClusterReady proves at the system level.

XPerf ClusterReady
Test full rack?
Yes
Use AI workloads?
Yes
Automated diagnosis + suggested remedy?
Yes
Serve any OEM / ODM?
Yes
Time to validate
2–8 hours for healthy racks
L11/L12 Testing
Test full rack?
No
Use AI workloads?
No
Automated diagnosis + suggested remedy?
No
Serve any OEM / ODM?
No
Time to validate
2–6 weeks
NVIDIA / AMD Tools
Test full rack?
No
Use AI workloads?
Synthetic only
Automated diagnosis + suggested remedy?
No
Serve any OEM / ODM?
Yes
Time to validate
Hours
OEM Built-in (Dell, SMC)
Test full rack?
Yes
Use AI workloads?
Dell yes / SMC partial
Automated diagnosis + suggested remedy?
No
Serve any OEM / ODM?
No
Time to validate
Days–weeks
Custom Scripts
Test full rack?
Varies by team
Use AI workloads?
Some custom
Automated diagnosis + suggested remedy?
No
Serve any OEM / ODM?
No
Time to validate
Weeks
Aptly Technology
Test full rack?
SMC + NVIDIA only
Use AI workloads?
Yes
Automated diagnosis + suggested remedy?
No
Serve any OEM / ODM?
Yes
Time to validate
Weeks

Works With the GPUs You Ship Today

Coverage across both major accelerator vendors and current network fabrics, with new SKUs added continuously.

AMD logo
MI300X | MI325X | MI350X | MI355X
256GB HBM3e validated | RoCE/RDMA
NVIDIA logo
GB300 | B300 | H200 | H100
NVLink 4/5 validated | InfiniBand | NVLink, RoCE

AMD MI325X Rack — First Public Performance Result

4 nodes · 32× AMD MI325X GPUs · 256 GB HBM3e · RoCE / RDMA

FLUX.1-DEV+20%

vs AMD’s previous performance results.

Multi-Node95.8%

scaling efficiency from 1 node → 4 nodes.

Llama 2 70B33,695

tokens/sec inference, single node.

RCCL10×

bandwidth gain from RCCL config optimization.

The first published Llama 3.1 pretraining performance result on MI325X submitted to MLPerf Training v6.0.

Download ClusterReady

Available on macOS, Windows, and Linux. Pick your platform to grab the latest build.

Windows

x64

Loading…Windows 10 or later

macOS

Apple Silicon & Intel

Loading…macOS 11.0 or later

Linux

x64

Loading…Ubuntu 22.04 or compatible

About XPerf Inc.

Founded by ex-Intel engineers with extensive experience deploying clusters with thousands of accelerators, XPerf Inc. is the control plane for data centers — AI infrastructure software for GPU cluster performance validation and optimization, taking an AI-native approach to cluster operation problems. Austin / Round Rock, Texas.

alex.carter@xperf.ai
Austin/Round Rock, TX