How much does repair cost?

$2,000-$8,000 depending on what failed. HBM rework is at the higher end, connector repair at the lower. You get an exact quote after diagnostics—no surprises.

What’s the turnaround time?

1-2 weeks total. Diagnostics take 1-3 business days. Once you approve the quote, repair is 3-7 business days. Priority queue available for fleet contracts.

What if you can’t fix it?

You pay nothing. Diagnostics are free. We send you a report showing what failed and why it’s not repairable. No charge at all.

Do you offer on-site repair?

Yes. Our engineer comes to your datacenter with tooling and parts. No shipping, no chain-of-custody concerns. Global on-site visits possible—contact us for availability. Tell us where you are and how many boards need work.

Do you ship internationally?

Yes. You cover inbound shipping to Japan. We cover return shipping. We handle customs documentation for GPU hardware.

What warranty do you offer?

90 days on every repair. If the same fault comes back within that window, we fix it again at no charge. You also get a detailed repair report with diagnostic imaging and test results.

Which GPUs do you repair?

We mainly work on H100 SXM5, H100 PCIe, and H200 SXM5, but other datacenter GPUs may be possible too. Send us a quote inquiry and we’ll assess your case.

Who runs GPU Repair Lab?

GPU Repair Lab is a Japan-based operation specializing in board-level repair on datacenter GPUs, with direct access to component supply chains in China for parts sourcing.

H100 GPU Repair vs. Replacement: Cost Analysis for Data Center Operators

Board-level H100 GPU repair costs $2,000-$8,000 — that's 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5, with a 1-2 week turnaround versus 2-4 weeks for procurement (as of Q1 2026). According to Reuters (July 2025), approximately 12 repair shops in Shenzhen now handle H100 board-level repairs at $1,400-$2,800 per unit, with the largest processing around 500 GPUs per month. No known US or EU companies publicly offer this service — making Japan-based repair a unique option for operators outside China.

This article breaks down the real numbers behind repair vs. replacement for datacenter operators. We cover direct costs from verified sources, hidden costs on both sides, a decision framework for your team, and fleet-level economics that change the calculation entirely at scale. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data), making a repair strategy essential for any fleet operator.

The Numbers: Repair vs. Replacement Cost Breakdown

Here's the direct cost comparison across all major H100 failure types. Repair costs are based on Reuters reporting (July 2025) and our Japan service pricing. New GPU prices reflect verified Q1 2026 market pricing from multiple sources.

Factor	Repair	Replacement
Unit cost (H100 SXM5)	$2,000-$8,000	$25,000-$40,000 (new)
Unit cost (H200 NVL)	N/A (too new for repair market)	$31,000-$32,000 (new)
Turnaround time	1-2 weeks	H100: 2-4 weeks; H200: 3-6 months
Warranty	90-day repair warranty	Standard NVIDIA warranty
Success rate	80-95% (varies by failure type)	100% (new unit)
Diagnostic cost	Free (included in repair quote)	N/A
Shipping	Inbound: customer. Return: included.	Vendor shipping terms vary
Cost as % of new unit	5-20%	100%

The repair cost varies by failure type. Here's the breakdown:

Failure Type	Repair Cost Range	Typical Turnaround
VRM / Power Circuit	$2,000-$4,000	3-5 days
Connector Damage	$2,000-$3,500	3-5 days
PCB Trace Repair	$2,000-$5,000	5-10 days
Thermal Damage	$3,000-$6,000	5-7 days
NVLink / NVSwitch	$3,000-$7,000	5-10 days
HBM Memory Stack	$4,000-$8,000	5-7 days

For full technical details on each failure type, see our H100 failure modes guide.

Hidden Costs of Replacement

The sticker price of a new H100 is just the beginning. Replacement carries several costs that don't show up in the purchase order:

Lead Time and Procurement Complexity

H100 availability has improved dramatically since the extreme shortages of 2023, when lead times stretched to 8-11 months. As of Q1 2026, H100 lead times have normalized to 2-4 weeks for standard orders. However, next-generation GPUs remain supply-constrained:

H100 SXM5 (Q1 2026): 2-4 weeks for standard orders. Supply has largely stabilized, but pricing remains at $25,000-$40,000 depending on volume and vendor.
H200 NVL: 3-6 months lead time. Priced at $31,000-$32,000 per unit, these remain allocation-constrained as of early 2026.
Procurement overhead: Purchase orders, vendor management, import/customs (for international procurement), and receiving/inventory processing all add labor cost and calendar time — even when the GPU itself ships quickly.

Depreciation and Asset Lifecycle

A failed GPU that's written off is a total loss on a $25,000-$40,000 asset. A repaired GPU retains most of its productive value for a fraction of the cost. Consider the accounting impact:

Write-off vs. repair capitalization: A $2,000-$8,000 repair extends the useful life of a $25,000-$40,000 asset. The alternative is writing off the full value and capitalizing $25,000-$40,000 for the replacement.
Resale value: Even if you're upgrading to H200 or B200, a repaired H100 has significant resale value on the secondary market. A dead H100 is worth scrap.
GPU lifespan: According to Reuters (July 2025, citing Shenzhen repair shop operators), GPUs typically require repair after 2-5 years of continuous operation — making repair a natural part of the asset lifecycle, not a last resort.

Configuration and Reintegration

Swapping a new GPU into an existing node isn't always plug-and-play:

Firmware and VBIOS matching: New units may ship with different firmware versions than the rest of your fleet, requiring updates to maintain consistency.
NVLink topology reconfiguration: In DGX/HGX systems, replacing a single GPU may require NVLink topology recalibration and fabric manager updates.
Burn-in and validation: New hardware still needs burn-in testing before production deployment. This adds 1-3 days to the timeline regardless of how fast the unit arrives.

Hidden Costs of NOT Repairing

Every hour a GPU sits idle is lost revenue. Here's how to quantify that:

Lost Compute Revenue

The revenue impact of a single idle H100 depends on your workload and pricing model:

Cloud GPU rental rates: H100 SXM5 instances typically rent for $2-$4/hour on major cloud providers. That's $48-$96/day in lost gross revenue per idle GPU.
AI training throughput: For internal ML teams, the cost is measured in delayed model iterations. If your training cluster is GPU-bound (and it almost certainly is), one missing GPU slows every job that would have used it.
Inference capacity: For production inference serving, a missing GPU directly reduces your maximum concurrent request capacity and may trigger SLA penalties.

Downtime Duration Comparison

The total downtime calculation includes diagnosis, decision-making, repair/procurement, shipping, and reintegration:

Phase	Repair Path	Replacement Path
Initial diagnosis	1-3 days	1-3 days
Decision / approval	1 day	1-5 days (procurement approval)
Shipping to repair facility	1-3 days	N/A
Repair / procurement wait	3-7 days	14-28 days (H100, Q1 2026)
Return shipping / receiving	1-3 days	1-5 days
Reintegration + testing	1 day	1-3 days
Total downtime	8-18 days	19-45 days

At $50-$100/day in lost revenue per GPU, the downtime difference alone ($50-$2,700 savings for H100 replacement) adds to the hardware cost savings. For H200 replacements with 3-6 month lead times, the downtime cost becomes the dominant factor.

SLA and Contract Penalties

If you're selling GPU compute (cloud, managed inference, ML platform), GPU downtime can trigger contractual penalties:

Availability SLAs: Most cloud GPU services guarantee 99.9%+ uptime. A single GPU being down for weeks blows through any SLA budget.
Training job SLAs: MLaaS contracts often include completion time guarantees. A missing GPU that extends a training job's wall clock time may trigger penalty clauses.
Customer churn: Repeated availability issues drive customers to competitors. The lifetime value of a lost customer far exceeds any single GPU's cost.

Decision Framework: When to Repair vs. Replace

Use this framework to make the repair/replace decision quickly and consistently. Walk through the questions in order:

Step 1: Is the failure repairable?

Some failures are not candidates for board-level repair:

GPU die failure — not field-repairable. Replace.
Catastrophic physical damage (board cracked, severe water/fire damage) — replace.
Multiple simultaneous major failures (e.g., HBM + VRM + GPU die) — repair cost may exceed replacement cost. Get a diagnostic assessment to confirm.

For all other failure types (HBM, VRM, thermal, PCB trace, connector, NVLink), repair is viable. Proceed to Step 2.

Step 2: What's your time constraint?

Need it back in <2 weeks: Repair is the only realistic option. Replacement procurement almost never meets this timeline.
Can wait 4-8 weeks: Both options are on the table. Compare costs.
No time pressure: Consider strategic factors (Step 3).

Step 3: Strategic considerations

Planning a fleet upgrade? If you're migrating to H200/B200 within 6 months, repair the H100 and resell it to offset upgrade costs.
Board has been repaired before? A second repair on the same board is still viable but should be evaluated case-by-case. If the same component fails again, it may indicate a systemic issue.
Fleet standardization: If your fleet uses mixed firmware/VBIOS versions, a repaired board maintains consistency. A new board may introduce version mismatches.

Step 4: Run the numbers

For most scenarios, the calculation is straightforward:

Repair cost ($2K-$8K) + shipping (~$200) + downtime cost (8-18 days x daily revenue loss)
vs. Replacement cost ($25K-$40K) + downtime cost (19-45 days x daily revenue loss) + procurement overhead

In nearly every case where repair is technically viable, the economics favor repair. Get a free diagnostic assessment to know exactly what you're dealing with.

Fleet Economics: Repair ROI at Scale

The repair vs. replace calculation changes significantly at fleet scale. Here's the annual cost model at different fleet sizes, assuming a 3% annual GPU failure rate (conservative, based on published large-scale failure data) and an average repair cost of $5,000.

Fleet Size	Expected Failures/Year	Repair Cost (Total)	Replacement Cost (Total)	Annual Savings
50 GPUs	~2	$10,000	$60,000-$80,000	$50,000-$70,000
100 GPUs	~3	$15,000	$90,000-$120,000	$75,000-$105,000
500 GPUs	~15	$75,000	$450,000-$600,000	$375,000-$525,000
1,000 GPUs	~30	$150,000	$900,000-$1,200,000	$750,000-$1,050,000
5,000 GPUs	~150	$750,000	$4,500,000-$6,000,000	$3,750,000-$5,250,000

These figures don't include downtime cost savings, which add another 20-40% to the repair ROI depending on your revenue-per-GPU-hour.

Maintenance Contracts vs. Ad-Hoc Repair

At 10+ GPUs, a maintenance contract changes the economics further:

Volume pricing: Per-repair costs decrease with committed volume. 10+ units qualify for volume discounts.
Priority queue: Contract repairs go to the front of the line, reducing turnaround by 2-3 days on average.
Pre-negotiated process: No procurement approval delays. When a board fails at 2 AM, the repair path is already established.
Predictable budgeting: Annualized repair costs are forecastable based on fleet size and expected failure rates.

Learn more about fleet maintenance contracts or contact us for fleet pricing.

Key Takeaways

Repair saves 80-95% vs. replacement — $2,000-$8,000 repair cost vs. $25,000-$40,000 for a new H100 SXM5. Shenzhen shops charge $1,400-$2,800 (Reuters, July 2025).
Faster return to production — 1-2 weeks for repair vs. 2-4 weeks for H100 procurement or 3-6 months for H200 (Q1 2026 lead times).
Downtime costs add up fast — at $50-$100/day per idle GPU, every extra week waiting for a replacement adds to the total cost.
Fleet-scale savings are massive — a 1,000-GPU fleet saves $750K-$1M+ annually by repairing instead of replacing.
Not every failure needs replacement — 80-95% of common H100 failures are repairable. Get a free diagnostic before deciding.

Frequently Asked Questions

How much does it cost to repair an H100 GPU?

H100 GPU board-level repair costs $2,000-$8,000 depending on the failure type and service provider. According to Reuters (July 2025), Shenzhen repair shops charge $1,400-$2,800 per GPU — approximately 12 shops operate there, with the largest handling around 500 GPUs per month. Japan-based service with full diagnostics and validation runs $2,000-$8,000. This represents 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5. No known US or EU companies publicly offer board-level H100 repair.

How long does H100 GPU repair take compared to buying a replacement?

Board-level H100 repair takes 1-2 weeks from receiving the board to shipping it back repaired. As of Q1 2026, procurement of a new H100 typically takes 2-4 weeks — a significant improvement from the 8-11 month lead times seen in 2023. H200 NVL GPUs ($31,000-$32,000) still have lead times of 3-6 months. Repair remains the fastest path back to production. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data).

What is the ROI of repairing GPUs for a fleet of 100+ units?

For a 100-GPU fleet with a 3% annual failure rate, repairing instead of replacing saves approximately $75,000-$105,000 per year in hardware costs alone. At 1,000 GPUs, the annual savings scale to $750,000-$1,050,000. Additionally, repaired GPUs return to production 1-5 weeks faster than replacements, avoiding $50-$100+ per day in lost compute revenue per idle GPU.

When should you replace an H100 instead of repairing it?

Replace rather than repair when: (1) the GPU die itself has failed — die-level damage is not field-repairable, (2) multiple major subsystems are damaged simultaneously (e.g., HBM + VRM + PCB), making repair cost approach replacement cost, (3) the board has been repaired multiple times and is showing diminishing reliability, or (4) a technology upgrade (e.g., to H200 or B200) makes replacement strategically preferable.