Back to blog

H100 GPU Repair vs. Replacement: Cost Analysis for Data Center Operators

| 10 min read

Board-level H100 GPU repair costs $2,000-$8,000 — that's 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5, with a 1-2 week turnaround versus 2-4 weeks for procurement (as of Q1 2026). According to Reuters (July 2025), approximately 12 repair shops in Shenzhen now handle H100 board-level repairs at $1,400-$2,800 per unit, with the largest processing around 500 GPUs per month. No known US or EU companies publicly offer this service — making Japan-based repair a unique option for operators outside China.

This article breaks down the real numbers behind repair vs. replacement for datacenter operators. We cover direct costs from verified sources, hidden costs on both sides, a decision framework for your team, and fleet-level economics that change the calculation entirely at scale. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data), making a repair strategy essential for any fleet operator.

The Numbers: Repair vs. Replacement Cost Breakdown

Here's the direct cost comparison across all major H100 failure types. Repair costs are based on Reuters reporting (July 2025) and our Japan service pricing. New GPU prices reflect verified Q1 2026 market pricing from multiple sources.

Factor Repair Replacement
Unit cost (H100 SXM5) $2,000-$8,000 $25,000-$40,000 (new)
Unit cost (H200 NVL) N/A (too new for repair market) $31,000-$32,000 (new)
Turnaround time 1-2 weeks H100: 2-4 weeks; H200: 3-6 months
Warranty 90-day repair warranty Standard NVIDIA warranty
Success rate 80-95% (varies by failure type) 100% (new unit)
Diagnostic cost Free (included in repair quote) N/A
Shipping Inbound: customer. Return: included. Vendor shipping terms vary
Cost as % of new unit 5-20% 100%

The repair cost varies by failure type. Here's the breakdown:

Failure Type Repair Cost Range Typical Turnaround
VRM / Power Circuit $2,000-$4,000 3-5 days
Connector Damage $2,000-$3,500 3-5 days
PCB Trace Repair $2,000-$5,000 5-10 days
Thermal Damage $3,000-$6,000 5-7 days
NVLink / NVSwitch $3,000-$7,000 5-10 days
HBM Memory Stack $4,000-$8,000 5-7 days

For full technical details on each failure type, see our H100 failure modes guide.

Hidden Costs of Replacement

The sticker price of a new H100 is just the beginning. Replacement carries several costs that don't show up in the purchase order:

Lead Time and Procurement Complexity

H100 availability has improved dramatically since the extreme shortages of 2023, when lead times stretched to 8-11 months. As of Q1 2026, H100 lead times have normalized to 2-4 weeks for standard orders. However, next-generation GPUs remain supply-constrained:

Depreciation and Asset Lifecycle

A failed GPU that's written off is a total loss on a $25,000-$40,000 asset. A repaired GPU retains most of its productive value for a fraction of the cost. Consider the accounting impact:

Configuration and Reintegration

Swapping a new GPU into an existing node isn't always plug-and-play:

Hidden Costs of NOT Repairing

Every hour a GPU sits idle is lost revenue. Here's how to quantify that:

Lost Compute Revenue

The revenue impact of a single idle H100 depends on your workload and pricing model:

Downtime Duration Comparison

The total downtime calculation includes diagnosis, decision-making, repair/procurement, shipping, and reintegration:

Phase Repair Path Replacement Path
Initial diagnosis 1-3 days 1-3 days
Decision / approval 1 day 1-5 days (procurement approval)
Shipping to repair facility 1-3 days N/A
Repair / procurement wait 3-7 days 14-28 days (H100, Q1 2026)
Return shipping / receiving 1-3 days 1-5 days
Reintegration + testing 1 day 1-3 days
Total downtime 8-18 days 19-45 days

At $50-$100/day in lost revenue per GPU, the downtime difference alone ($50-$2,700 savings for H100 replacement) adds to the hardware cost savings. For H200 replacements with 3-6 month lead times, the downtime cost becomes the dominant factor.

SLA and Contract Penalties

If you're selling GPU compute (cloud, managed inference, ML platform), GPU downtime can trigger contractual penalties:

Decision Framework: When to Repair vs. Replace

Use this framework to make the repair/replace decision quickly and consistently. Walk through the questions in order:

Step 1: Is the failure repairable?

Some failures are not candidates for board-level repair:

For all other failure types (HBM, VRM, thermal, PCB trace, connector, NVLink), repair is viable. Proceed to Step 2.

Step 2: What's your time constraint?

Step 3: Strategic considerations

Step 4: Run the numbers

For most scenarios, the calculation is straightforward:

In nearly every case where repair is technically viable, the economics favor repair. Get a free diagnostic assessment to know exactly what you're dealing with.

Fleet Economics: Repair ROI at Scale

The repair vs. replace calculation changes significantly at fleet scale. Here's the annual cost model at different fleet sizes, assuming a 3% annual GPU failure rate (conservative, based on published large-scale failure data) and an average repair cost of $5,000.

Fleet Size Expected Failures/Year Repair Cost (Total) Replacement Cost (Total) Annual Savings
50 GPUs ~2 $10,000 $60,000-$80,000 $50,000-$70,000
100 GPUs ~3 $15,000 $90,000-$120,000 $75,000-$105,000
500 GPUs ~15 $75,000 $450,000-$600,000 $375,000-$525,000
1,000 GPUs ~30 $150,000 $900,000-$1,200,000 $750,000-$1,050,000
5,000 GPUs ~150 $750,000 $4,500,000-$6,000,000 $3,750,000-$5,250,000

These figures don't include downtime cost savings, which add another 20-40% to the repair ROI depending on your revenue-per-GPU-hour.

Maintenance Contracts vs. Ad-Hoc Repair

At 10+ GPUs, a maintenance contract changes the economics further:

Learn more about fleet maintenance contracts or contact us for fleet pricing.

Key Takeaways

  • Repair saves 80-95% vs. replacement — $2,000-$8,000 repair cost vs. $25,000-$40,000 for a new H100 SXM5. Shenzhen shops charge $1,400-$2,800 (Reuters, July 2025).
  • Faster return to production — 1-2 weeks for repair vs. 2-4 weeks for H100 procurement or 3-6 months for H200 (Q1 2026 lead times).
  • Downtime costs add up fast — at $50-$100/day per idle GPU, every extra week waiting for a replacement adds to the total cost.
  • Fleet-scale savings are massive — a 1,000-GPU fleet saves $750K-$1M+ annually by repairing instead of replacing.
  • Not every failure needs replacement — 80-95% of common H100 failures are repairable. Get a free diagnostic before deciding.

Frequently Asked Questions

How much does it cost to repair an H100 GPU?

H100 GPU board-level repair costs $2,000-$8,000 depending on the failure type and service provider. According to Reuters (July 2025), Shenzhen repair shops charge $1,400-$2,800 per GPU — approximately 12 shops operate there, with the largest handling around 500 GPUs per month. Japan-based service with full diagnostics and validation runs $2,000-$8,000. This represents 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5. No known US or EU companies publicly offer board-level H100 repair.

How long does H100 GPU repair take compared to buying a replacement?

Board-level H100 repair takes 1-2 weeks from receiving the board to shipping it back repaired. As of Q1 2026, procurement of a new H100 typically takes 2-4 weeks — a significant improvement from the 8-11 month lead times seen in 2023. H200 NVL GPUs ($31,000-$32,000) still have lead times of 3-6 months. Repair remains the fastest path back to production. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data).

What is the ROI of repairing GPUs for a fleet of 100+ units?

For a 100-GPU fleet with a 3% annual failure rate, repairing instead of replacing saves approximately $75,000-$105,000 per year in hardware costs alone. At 1,000 GPUs, the annual savings scale to $750,000-$1,050,000. Additionally, repaired GPUs return to production 1-5 weeks faster than replacements, avoiding $50-$100+ per day in lost compute revenue per idle GPU.

When should you replace an H100 instead of repairing it?

Replace rather than repair when: (1) the GPU die itself has failed — die-level damage is not field-repairable, (2) multiple major subsystems are damaged simultaneously (e.g., HBM + VRM + PCB), making repair cost approach replacement cost, (3) the board has been repaired multiple times and is showing diminishing reliability, or (4) a technology upgrade (e.g., to H200 or B200) makes replacement strategically preferable.

Need GPU repair?

Get a free diagnostic assessment for your H100 or H200 GPU. We respond within 1 business day.

Get a Quote