AMD MI300X vs H100: Cloud LLM Inference, Price-Per-Token

The buzz around AMD’s Instinct MI300X has been palpable. Nvidia’s H100 has reigned supreme in the LLM space for a while now, largely due to its raw performance and, crucially, its widespread availability and robust software ecosystem. But AMD has been making noise about the MI300X’s impressive memory bandwidth and capacity—192GB of HBM3 memory. On paper, it looks like a serious contender, especially for memory-hungry large language models. The question, as always, isn’t just about what’s on paper, but what you can actually rent, what it costs, and how it performs in practice.

Availability, Or Lack Thereof

Finding an MI300X instance to rent for a proper comparative test proved to be the first significant hurdle. While a few major cloud providers have announced MI300X offerings, actually spinning one up for a short-term rental, or even getting onto a waitlist with an estimated delivery date, felt like trying to find a unicorn in a data center. We spent two weeks checking daily with various providers, both the hyperscalers and the smaller, more agile outfits. Most listed it as ‘coming soon’ or required an enterprise agreement with significant upfront commitment. One smaller provider had a single MI300X instance available for a fleeting 48 hours, which we promptly snatched up.

In stark contrast, H100 instances are relatively abundant. While demand still means they’re not always instantly available in every region, we’ve had little trouble finding H100s across several platforms for previous reviews, including on platforms like Runpod (see our full Runpod review for details). This difference in accessibility alone is a major factor for anyone looking to deploy an LLM inference service today, not six months from now.

The Cost of Access

Given the scarcity, pricing for the MI300X is less standardized. The instance we managed to secure was priced at $2.70 per hour. This is surprisingly competitive, suggesting providers are keen to get the hardware into users’ hands, or perhaps they’re testing the waters. For our H100 baseline, we settled on an average hourly rate of $2.85, consistent with what we’ve paid on various community and secure cloud platforms for the 80GB variant.

It’s worth noting that these are bare-metal or near-bare-metal GPU prices. Managed services or more abstracted serverless platforms often add a significant premium on top, though their value proposition includes cold-start management and scaling. For a deeper dive into token costs across various providers, you might find our analysis on cheapest Llama 3 70B hosting informative.

Benchmarking Llama 3 70B

To keep things consistent, we deployed a quantized version of Llama 3 70B (specifically, the Q4_K_M variant, which sits around 45GB of model weights) on both GPUs. Our standard benchmarking playbook, as outlined in [/blog/benchmark-playbook/], involves a series of inference tasks with varying prompt and generation lengths. For this comparison, we focused on average tokens-per-second throughput for a mix of 512-token prompts and 256-token completions.

Here’s what we observed:

GPU	Model	VRAM Utilized	Average Tokens/Sec
Nvidia H100	Llama 3 70B Q4_K_M	~46GB	175
AMD MI300X	Llama 3 70B Q4_K_M	~46GB	205

The MI300X consistently outperformed the H100 by a noticeable margin, delivering 205 tokens/second compared to the H100’s 175 tokens/second. This 17% performance uplift is significant, especially for high-throughput inference workloads. The MI300X’s substantial memory bandwidth, along with its 192GB capacity (though not fully utilized by this particular model variant), certainly seems to pay dividends here.

Price-Performance Reality

Now, for the numbers that really matter: the cost per million output tokens.

Nvidia H100:

Hourly rate: $2.85
Tokens per hour: 175 tokens/sec * 3600 sec/hr = 630,000 tokens/hr
Cost per million tokens: ($2.85 / 630,000) * 1,000,000 = $4.52

AMD MI300X:

Hourly rate: $2.70
Tokens per hour: 205 tokens/sec * 3600 sec/hr = 738,000 tokens/hr
Cost per million tokens: ($2.70 / 738,000) * 1,000,000 = $3.66

This is where the MI300X really shines. At $3.66 per million tokens, it’s roughly 19% cheaper than the H100 for this specific LLM inference task. For anyone running large-scale inference, this difference translates to substantial savings over time, assuming you can consistently rent the hardware.

The ROCm Elephant in the Room

Performance and price are compelling, but they don’t tell the whole story. The software ecosystem for AMD GPUs, specifically ROCm, is still less mature and universally supported than Nvidia’s CUDA. While progress has been made, especially with PyTorch’s ROCm backend, getting a model to run optimally often requires more fiddling, specific ROCm versions, and a good deal of patience. We found that deploying our Llama 3 instance on the MI300X took about 30% longer in setup time compared to the H100, primarily due to environment configuration and library compatibility checks. For teams with deep CUDA experience and existing workflows, this migration cost is a real factor. The learning curve for ROCm, while diminishing, is still present.

For most production deployments, the stability and breadth of the CUDA ecosystem remain a significant advantage for Nvidia. Frameworks like JAX, and many niche libraries, often have better or exclusive CUDA support. This isn’t to say ROCm is unusable—it’s improving rapidly—but it’s not a drop-in replacement just yet for all workloads.

Verdict

The AMD Instinct MI300X is a formidable piece of hardware for LLM inference. Our testing showed a clear performance advantage over the H100, which, combined with its competitive hourly rate, results in a significantly lower cost per million tokens. If you can get your hands on an MI300X instance, and you’re comfortable navigating the ROCm ecosystem, it represents a genuinely economical option for Llama 3 70B inference.

However, the caveats are substantial. Availability remains the biggest blocker. As we noted in our H200 availability piece, announced hardware doesn’t always translate to rentable hardware. Furthermore, the software friction with ROCm, while improving, means that for many teams, the H100’s slightly higher cost is a worthwhile premium for the stability, maturity, and broader support of the CUDA ecosystem. For now, the H100 remains the pragmatic choice for most, readily available on platforms like Runpod, but keep an eye on AMD. If they can solve their supply chain and continue refining ROCm, the MI300X could reshape the LLM inference landscape within the next year.