At scale, H100 GPUs fail frequently. Meta's Llama 3.1 training run — 16,384 H100 GPUs over 54 days — logged 419 unexpected interruptions, roughly one failure every 3 hours. GPU hardware and HBM3 memory accounted for 52.5% of those interruptions (according to Meta's Llama 3.1 paper and DCD reporting).
This guide covers the primary H100 failure modes documented in published research, what causes them, how to identify them, and what the repair options look like. Every statistic cited here is sourced from peer-reviewed papers, manufacturer data, or verified reporting.
1. HBM3 Memory Failures
HBM3 memory is the most failure-prone component on the H100 relative to its predecessor. Per a March 2025 study covering 11.7 million GPU-hours (arXiv 2503.11901), the H100's HBM3 has a per-GPU mean time between errors (MTBE) of 88,768 hours — 3.2 times worse than the A100's HBM2e at 283,271 hours. The same study found that H100 row remapping (a built-in error mitigation) succeeds only 59% of the time.
Meta's Llama 3.1 training data confirms this: HBM3 memory accounted for 72 of the 419 unexpected interruptions (17.2%), making it the second-largest single failure category after general GPU hardware faults (148 interruptions, 35.3%).
Symptoms
- Uncorrectable ECC errors reported by
nvidia-smi— the most reliable early indicator. The March 2025 study found that ECC error-recovery mechanisms mitigate approximately 92% of uncorrectable memory errors, but the remaining 8% cause job failures. - Reduced memory capacity — the GPU reports less than 80 GB available. The firmware may disable a failed HBM stack entirely.
- Training job crashes — out-of-memory errors or CUDA memory corruption faults during AI workloads.
Causes
- Thermal stress — published research indicates HBM errors increase exponentially above 75 degrees C. The H100 SXM5 draws up to 700W TDP, creating significant thermal load on HBM stacks.
- Manufacturing variability — the 3.2x worse MTBE compared to A100 suggests that HBM3's higher density and bandwidth come at a reliability cost.
- Thermal cycling — repeated heat-cool cycles stress micro-bump interconnects between HBM die layers.
Diagnosis Approach
Start with nvidia-smi -q -d ECC to pull ECC error counts. Non-zero uncorrectable errors point directly to HBM. Follow up with cuda-memcheck --tool memcheck to identify which memory address ranges are affected. NVIDIA's RMA process requires diagnostic logs from nvidia-bug-report and the Field Diagnostic tool.
Repair Method
HBM repair uses BGA (Ball Grid Array) rework — removing the failed HBM stack, cleaning the substrate, placing a new HBM module, and reflowing the solder. According to Reuters reporting (July 2025), approximately 12 repair shops in Shenzhen perform this work on H100 GPUs, with the largest handling around 500 GPUs per month. These shops validate repairs using 256-node server room testing.
Based on general BGA rework data (not H100-specific), success rates range from 70-96% overall, with 85% for outer rows and 60% for inner rows. Server boards are generally limited to 2 rework cycles before copper dissolution becomes an issue.
Repair costs range from $1,400 to $8,000 depending on provider and complexity (per Reuters, Shenzhen shops charge $1,400-$2,800). Get a diagnostic assessment.
2. GPU Hardware Failures
GPU hardware failures — covering the GPU die, power delivery, and associated board components — are the single largest failure category. In Meta's Llama 3.1 training, GPU hardware faults accounted for 148 of 419 interruptions (35.3%). A separate Meta cluster study (arXiv 2410.21680) found that among "lemon nodes" — nodes with recurring failures — GPU issues were the top root cause at 28.2%, followed by DIMM at 20.5% and PCIe at 15.4%.
ByteDance's analysis (arXiv 2509.16293, covering 778,000 jobs) provides additional granularity: CUDA errors accounted for 36.1% of failures and CPU overload for 11%, while GPU memory errors were 0.3% (suggesting most memory issues manifest as CUDA errors rather than direct memory faults).
VRM (Voltage Regulator Module) Failures
The H100 SXM5 draws up to 700W TDP, placing extreme demands on its voltage regulator modules. VRM repair is a confirmed service offered by Shenzhen repair shops (per Reuters), involving power-delivery circuit diagnosis and component replacement. However, no published H100-specific VRM failure rate data exists — the 700W power envelope makes VRM stress plausible, but we cannot cite a specific percentage.
Symptoms
- Board won't POST — the GPU is detected by the system but fails to initialize.
- Unexpected shutdowns under load — the board powers on at idle but trips overcurrent or undervoltage protection under workload.
- Power throttling beyond normal limits —
nvidia-smishows power limit throttling even when the power target hasn't been reached. - CUDA errors — ByteDance's data shows CUDA errors as the largest single failure category (36.1%), which can stem from various hardware issues including power delivery problems.
Diagnosis Approach
Measure voltage rails with an oscilloscope at the VRM output test points. Thermal imaging under load reveals hot spots on specific power stages. NVIDIA's RMA process requires nvidia-bug-report and Field Diagnostic logs. For DGX systems, the warranty is 3 years minimum, extendable to 5. OEM-channel GPUs go through the OEM for RMA, not NVIDIA directly.
Repair Method
VRM repair involves identifying and replacing failed power stage components — typically MOSFETs, gate drivers, or output capacitors. This is a confirmed repair service in the Shenzhen repair ecosystem (per Reuters). Component-level replacement is performed under microscope with hot-air rework, followed by load validation.
Repair costs: $1,400-$4,000 depending on provider and complexity. Request a diagnostic.
3. Thermal Issues
NVIDIA keeps H100 thermal specifications proprietary (confirmed by an NVIDIA employee on public forums). However, empirical data provides useful reference points: throttling is commonly observed at 83-88 degrees C, and published research indicates HBM errors increase exponentially above 75 degrees C.
With the H100 SXM5 drawing up to 700W TDP, thermal management is critical. Cooling system failures or inadequate airflow can push components into damage zones, particularly affecting HBM reliability.
Symptoms
- Persistent thermal throttling — even with proper cooling restored, the GPU throttles at temperatures lower than expected, indicating sensor damage or degraded thermal interface.
- Intermittent compute errors — correct output at low temperatures, errors at higher temps. Temperature-dependent failures strongly suggest damaged solder joints or thermally stressed HBM.
- Escalating ECC errors under load — given that HBM errors increase exponentially above 75 degrees C, thermal issues often first manifest as rising memory error counts during sustained workloads.
Causes
- Cooling system failure — fan failures, coolant leaks (in liquid-cooled systems), clogged heatsinks, or thermal pad degradation.
- Environmental overload — datacenter HVAC failure causing ambient temperature to exceed design limits.
- Chronic thermal cycling — daily power-cycle patterns creating repeated thermal expansion/contraction stress on solder joints.
Repair Method
For cracked solder joints caused by thermal stress: controlled reflow using a profiled thermal cycle. For severe cases: complete reball. Based on general BGA rework data, success rates range from 70-96%, though boards are limited to approximately 2 rework cycles.
Cost: $1,400-$6,000 depending on provider and extent of damage. Get an assessment.
4. PCIe and System-Level Failures
Meta's lemon node analysis (arXiv 2410.21680) identified PCIe issues as the third most common root cause, responsible for 15.4% of recurring node failures. PCIe failures affect the GPU's communication with the host system and can be difficult to distinguish from GPU-level faults without systematic diagnosis.
Symptoms
- GPU not detected by system — complete communication failure due to damaged PCIe lanes or connectors.
- Intermittent detection — GPU appears and disappears from the device list.
- Reduced bandwidth — PCIe link training at a lower width or speed than expected.
- Post-shipping failures — a board that worked before shipping arrives non-functional due to mechanical stress on connectors or traces.
NVLink: High Reliability but System-Level Complexity
Notably, the March 2025 study (arXiv 2503.11901) found zero NVLink errors in 2.1 million GPU-hours of H100 operation. This is in sharp contrast to the A100, which experienced 1,922 NVLink errors in the same study. NVLink4 on the H100 appears significantly more reliable than NVLink3 on the A100 at the link level.
However, NVLink configuration and connectivity issues can still occur at the system level — topology mismatches, fabric manager errors, and connector wear from maintenance operations remain practical concerns in DGX/HGX deployments.
Diagnosis Approach
Use nvidia-smi topo -m to verify expected NVLink topology. For PCIe issues, check link width and speed via lspci. NVIDIA's diagnostic tools (nvidia-bug-report, Field Diagnostic) are required for RMA evaluation and provide detailed interconnect status.
Repair Method
Connector damage is repairable through component replacement. PCB trace breaks on surface layers are repaired with micro-soldering. Inner-layer breaks under BGA components may be unrepairable. Shenzhen repair shops validate all interconnect repairs through multi-node server testing (per Reuters).
Cost: $1,400-$5,000 depending on provider and complexity. Contact us for diagnosis.
Lemon Nodes: The Outsized Impact of Recurring Failures
One of the most actionable findings from Meta's cluster research (arXiv 2410.21680) is the "lemon node" phenomenon. A small percentage of nodes — just 1.2-1.7% of the fleet — account for a disproportionate share of failures. Removing these lemon nodes from the scheduling pool reduced large-job failures by 67%.
Lemon node root causes break down as: GPU 28.2%, DIMM 20.5%, PCIe 15.4%, with the remainder spread across other components. For fleet operators, this means proactive identification and repair (or replacement) of these chronic problem nodes delivers outsized reliability improvements.
GPUs typically require repair after 2-5 years of continuous operation. Having a repair strategy in place before failures accumulate prevents lemon nodes from degrading overall fleet performance.
Failure Rates: What the Research Shows
Multiple independent studies provide converging data on H100 failure rates at scale:
| Source | Scale | Key Finding |
|---|---|---|
| Meta Llama 3.1 paper | 16,384 H100s, 54 days | 419 interruptions (1 every ~3 hours); GPU hardware 35.3%, HBM3 17.2% |
| arXiv 2410.21680 (Meta) | Production clusters | 2.34-6.50 failures per 1,000 node-days; 1,024-GPU job MTTF: 7.9 hours |
| arXiv 2503.11901 | 11.7M GPU-hours | H100 HBM3 MTBE: 88,768 hrs (3.2x worse than A100); NVLink4: zero errors |
| arXiv 2509.16293 (ByteDance) | 778K jobs | Hardware failure every ~2.78 hrs at 16K+ GPUs; CUDA errors 36.1% |
The consistent pattern: at scale (thousands of GPUs), hardware failures are frequent and unavoidable. The question is not whether failures will occur, but how quickly you can diagnose and resolve them.
NVIDIA RMA Process
Before considering third-party repair, understand your warranty options. NVIDIA's RMA process requires:
- Diagnostic logs: nvidia-bug-report output plus Field Diagnostic results.
- DGX warranty: 3 years minimum, extendable to 5 years.
- OEM-channel GPUs: RMA goes through the OEM (Dell, Supermicro, etc.), not NVIDIA directly.
- Consumer warranty: Void if the GPU was used in an enterprise/datacenter context.
When RMA is not available — out of warranty, OEM complications, or faster turnaround needed — board-level repair becomes the practical alternative.
Key Takeaways
- GPU hardware + HBM3 = 52.5% of failures — according to Meta's Llama 3.1 training data (16,384 H100s, 54 days, 419 interruptions).
- H100 HBM3 is 3.2x less reliable than A100 HBM2e — per a March 2025 study covering 11.7 million GPU-hours (arXiv 2503.11901).
- NVLink4 shows high reliability — zero errors in 2.1 million GPU-hours, a significant improvement over A100's NVLink3.
- Lemon node removal cuts failures by 67% — identifying and addressing the 1.2-1.7% of problem nodes has outsized impact (Meta, arXiv 2410.21680).
- Repair is a viable option — Shenzhen shops repair ~500 H100s/month using BGA rework and VRM repair, at $1,400-$2,800 per GPU (Reuters, July 2025). Get a diagnostic assessment.
Frequently Asked Questions
What is the most common H100 GPU failure mode?
GPU hardware faults and HBM3 memory failures together account for the majority of interruptions. According to Meta's Llama 3.1 training report (16,384 H100 GPUs over 54 days), GPU hardware failures caused 35.3% and HBM3 memory caused 17.2% of all 419 unexpected interruptions — a combined 52.5%. A separate study (arXiv 2503.11901) covering 11.7 million GPU-hours found that H100 HBM3 has a per-GPU mean time between errors of 88,768 hours, which is 3.2x worse than the A100's 283,271 hours.
Can H100 GPUs be repaired at the board level?
Yes. According to Reuters reporting (July 2025), approximately 12 repair shops in Shenzhen perform H100 board-level repairs including BGA rework for HBM and VRM (power-delivery circuit) repair. The largest handles roughly 500 GPUs per month. Based on general BGA rework data, success rates range from 70-96%. Repair costs range from $1,400 to $8,000 depending on the provider and failure complexity, compared to $25,000-$40,000 for a new H100 SXM5.
How do you diagnose which component failed on an H100?
NVIDIA's RMA process requires diagnostic logs from nvidia-bug-report and the Field Diagnostic tool. For board-level diagnosis, technicians use nvidia-smi ECC error reporting to identify HBM faults, thermal imaging for hotspot detection, X-ray inspection for BGA solder joint integrity, and oscilloscope probing for power rail analysis. Shenzhen repair shops validate their repairs using 256-node server room testing (per Reuters).
What is the typical failure rate for H100 GPUs in production?
Published data shows significant failure rates at scale. Meta's Llama 3.1 training (16,384 H100 GPUs, 54 days) experienced 419 unexpected interruptions — roughly one every 3 hours. A separate Meta cluster study (arXiv 2410.21680) measured 2.34 to 6.50 failures per 1,000 node-days, with a mean time to failure of just 7.9 hours for 1,024-GPU jobs. ByteDance (arXiv 2509.16293) reported hardware failures every approximately 2.78 hours during 16,000+ GPU training runs.