Board-level H100 GPU repair costs $2,000-$8,000 — that's 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5, with a 1-2 week turnaround versus 2-4 weeks for procurement (as of Q1 2026). According to Reuters (July 2025), approximately 12 repair shops in Shenzhen now handle H100 board-level repairs at $1,400-$2,800 per unit, with the largest processing around 500 GPUs per month. No known US or EU companies publicly offer this service — making Japan-based repair a unique option for operators outside China.
This article breaks down the real numbers behind repair vs. replacement for datacenter operators. We cover direct costs from verified sources, hidden costs on both sides, a decision framework for your team, and fleet-level economics that change the calculation entirely at scale. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data), making a repair strategy essential for any fleet operator.
The Numbers: Repair vs. Replacement Cost Breakdown
Here's the direct cost comparison across all major H100 failure types. Repair costs are based on Reuters reporting (July 2025) and our Japan service pricing. New GPU prices reflect verified Q1 2026 market pricing from multiple sources.
| Factor | Repair | Replacement |
|---|---|---|
| Unit cost (H100 SXM5) | $2,000-$8,000 | $25,000-$40,000 (new) |
| Unit cost (H200 NVL) | N/A (too new for repair market) | $31,000-$32,000 (new) |
| Turnaround time | 1-2 weeks | H100: 2-4 weeks; H200: 3-6 months |
| Warranty | 90-day repair warranty | Standard NVIDIA warranty |
| Success rate | 80-95% (varies by failure type) | 100% (new unit) |
| Diagnostic cost | Free (included in repair quote) | N/A |
| Shipping | Inbound: customer. Return: included. | Vendor shipping terms vary |
| Cost as % of new unit | 5-20% | 100% |
The repair cost varies by failure type. Here's the breakdown:
| Failure Type | Repair Cost Range | Typical Turnaround |
|---|---|---|
| VRM / Power Circuit | $2,000-$4,000 | 3-5 days |
| Connector Damage | $2,000-$3,500 | 3-5 days |
| PCB Trace Repair | $2,000-$5,000 | 5-10 days |
| Thermal Damage | $3,000-$6,000 | 5-7 days |
| NVLink / NVSwitch | $3,000-$7,000 | 5-10 days |
| HBM Memory Stack | $4,000-$8,000 | 5-7 days |
For full technical details on each failure type, see our H100 failure modes guide.
Hidden Costs of Replacement
The sticker price of a new H100 is just the beginning. Replacement carries several costs that don't show up in the purchase order:
Lead Time and Procurement Complexity
H100 availability has improved dramatically since the extreme shortages of 2023, when lead times stretched to 8-11 months. As of Q1 2026, H100 lead times have normalized to 2-4 weeks for standard orders. However, next-generation GPUs remain supply-constrained:
- H100 SXM5 (Q1 2026): 2-4 weeks for standard orders. Supply has largely stabilized, but pricing remains at $25,000-$40,000 depending on volume and vendor.
- H200 NVL: 3-6 months lead time. Priced at $31,000-$32,000 per unit, these remain allocation-constrained as of early 2026.
- Procurement overhead: Purchase orders, vendor management, import/customs (for international procurement), and receiving/inventory processing all add labor cost and calendar time — even when the GPU itself ships quickly.
Depreciation and Asset Lifecycle
A failed GPU that's written off is a total loss on a $25,000-$40,000 asset. A repaired GPU retains most of its productive value for a fraction of the cost. Consider the accounting impact:
- Write-off vs. repair capitalization: A $2,000-$8,000 repair extends the useful life of a $25,000-$40,000 asset. The alternative is writing off the full value and capitalizing $25,000-$40,000 for the replacement.
- Resale value: Even if you're upgrading to H200 or B200, a repaired H100 has significant resale value on the secondary market. A dead H100 is worth scrap.
- GPU lifespan: According to Reuters (July 2025, citing Shenzhen repair shop operators), GPUs typically require repair after 2-5 years of continuous operation — making repair a natural part of the asset lifecycle, not a last resort.
Configuration and Reintegration
Swapping a new GPU into an existing node isn't always plug-and-play:
- Firmware and VBIOS matching: New units may ship with different firmware versions than the rest of your fleet, requiring updates to maintain consistency.
- NVLink topology reconfiguration: In DGX/HGX systems, replacing a single GPU may require NVLink topology recalibration and fabric manager updates.
- Burn-in and validation: New hardware still needs burn-in testing before production deployment. This adds 1-3 days to the timeline regardless of how fast the unit arrives.
Hidden Costs of NOT Repairing
Every hour a GPU sits idle is lost revenue. Here's how to quantify that:
Lost Compute Revenue
The revenue impact of a single idle H100 depends on your workload and pricing model:
- Cloud GPU rental rates: H100 SXM5 instances typically rent for $2-$4/hour on major cloud providers. That's $48-$96/day in lost gross revenue per idle GPU.
- AI training throughput: For internal ML teams, the cost is measured in delayed model iterations. If your training cluster is GPU-bound (and it almost certainly is), one missing GPU slows every job that would have used it.
- Inference capacity: For production inference serving, a missing GPU directly reduces your maximum concurrent request capacity and may trigger SLA penalties.
Downtime Duration Comparison
The total downtime calculation includes diagnosis, decision-making, repair/procurement, shipping, and reintegration:
| Phase | Repair Path | Replacement Path |
|---|---|---|
| Initial diagnosis | 1-3 days | 1-3 days |
| Decision / approval | 1 day | 1-5 days (procurement approval) |
| Shipping to repair facility | 1-3 days | N/A |
| Repair / procurement wait | 3-7 days | 14-28 days (H100, Q1 2026) |
| Return shipping / receiving | 1-3 days | 1-5 days |
| Reintegration + testing | 1 day | 1-3 days |
| Total downtime | 8-18 days | 19-45 days |
At $50-$100/day in lost revenue per GPU, the downtime difference alone ($50-$2,700 savings for H100 replacement) adds to the hardware cost savings. For H200 replacements with 3-6 month lead times, the downtime cost becomes the dominant factor.
SLA and Contract Penalties
If you're selling GPU compute (cloud, managed inference, ML platform), GPU downtime can trigger contractual penalties:
- Availability SLAs: Most cloud GPU services guarantee 99.9%+ uptime. A single GPU being down for weeks blows through any SLA budget.
- Training job SLAs: MLaaS contracts often include completion time guarantees. A missing GPU that extends a training job's wall clock time may trigger penalty clauses.
- Customer churn: Repeated availability issues drive customers to competitors. The lifetime value of a lost customer far exceeds any single GPU's cost.
Decision Framework: When to Repair vs. Replace
Use this framework to make the repair/replace decision quickly and consistently. Walk through the questions in order:
Step 1: Is the failure repairable?
Some failures are not candidates for board-level repair:
- GPU die failure — not field-repairable. Replace.
- Catastrophic physical damage (board cracked, severe water/fire damage) — replace.
- Multiple simultaneous major failures (e.g., HBM + VRM + GPU die) — repair cost may exceed replacement cost. Get a diagnostic assessment to confirm.
For all other failure types (HBM, VRM, thermal, PCB trace, connector, NVLink), repair is viable. Proceed to Step 2.
Step 2: What's your time constraint?
- Need it back in <2 weeks: Repair is the only realistic option. Replacement procurement almost never meets this timeline.
- Can wait 4-8 weeks: Both options are on the table. Compare costs.
- No time pressure: Consider strategic factors (Step 3).
Step 3: Strategic considerations
- Planning a fleet upgrade? If you're migrating to H200/B200 within 6 months, repair the H100 and resell it to offset upgrade costs.
- Board has been repaired before? A second repair on the same board is still viable but should be evaluated case-by-case. If the same component fails again, it may indicate a systemic issue.
- Fleet standardization: If your fleet uses mixed firmware/VBIOS versions, a repaired board maintains consistency. A new board may introduce version mismatches.
Step 4: Run the numbers
For most scenarios, the calculation is straightforward:
- Repair cost ($2K-$8K) + shipping (~$200) + downtime cost (8-18 days x daily revenue loss)
- vs. Replacement cost ($25K-$40K) + downtime cost (19-45 days x daily revenue loss) + procurement overhead
In nearly every case where repair is technically viable, the economics favor repair. Get a free diagnostic assessment to know exactly what you're dealing with.
Fleet Economics: Repair ROI at Scale
The repair vs. replace calculation changes significantly at fleet scale. Here's the annual cost model at different fleet sizes, assuming a 3% annual GPU failure rate (conservative, based on published large-scale failure data) and an average repair cost of $5,000.
| Fleet Size | Expected Failures/Year | Repair Cost (Total) | Replacement Cost (Total) | Annual Savings |
|---|---|---|---|---|
| 50 GPUs | ~2 | $10,000 | $60,000-$80,000 | $50,000-$70,000 |
| 100 GPUs | ~3 | $15,000 | $90,000-$120,000 | $75,000-$105,000 |
| 500 GPUs | ~15 | $75,000 | $450,000-$600,000 | $375,000-$525,000 |
| 1,000 GPUs | ~30 | $150,000 | $900,000-$1,200,000 | $750,000-$1,050,000 |
| 5,000 GPUs | ~150 | $750,000 | $4,500,000-$6,000,000 | $3,750,000-$5,250,000 |
These figures don't include downtime cost savings, which add another 20-40% to the repair ROI depending on your revenue-per-GPU-hour.
Maintenance Contracts vs. Ad-Hoc Repair
At 10+ GPUs, a maintenance contract changes the economics further:
- Volume pricing: Per-repair costs decrease with committed volume. 10+ units qualify for volume discounts.
- Priority queue: Contract repairs go to the front of the line, reducing turnaround by 2-3 days on average.
- Pre-negotiated process: No procurement approval delays. When a board fails at 2 AM, the repair path is already established.
- Predictable budgeting: Annualized repair costs are forecastable based on fleet size and expected failure rates.
Learn more about fleet maintenance contracts or contact us for fleet pricing.
Key Takeaways
- Repair saves 80-95% vs. replacement — $2,000-$8,000 repair cost vs. $25,000-$40,000 for a new H100 SXM5. Shenzhen shops charge $1,400-$2,800 (Reuters, July 2025).
- Faster return to production — 1-2 weeks for repair vs. 2-4 weeks for H100 procurement or 3-6 months for H200 (Q1 2026 lead times).
- Downtime costs add up fast — at $50-$100/day per idle GPU, every extra week waiting for a replacement adds to the total cost.
- Fleet-scale savings are massive — a 1,000-GPU fleet saves $750K-$1M+ annually by repairing instead of replacing.
- Not every failure needs replacement — 80-95% of common H100 failures are repairable. Get a free diagnostic before deciding.
Frequently Asked Questions
How much does it cost to repair an H100 GPU?
H100 GPU board-level repair costs $2,000-$8,000 depending on the failure type and service provider. According to Reuters (July 2025), Shenzhen repair shops charge $1,400-$2,800 per GPU — approximately 12 shops operate there, with the largest handling around 500 GPUs per month. Japan-based service with full diagnostics and validation runs $2,000-$8,000. This represents 5-20% of the $25,000-$40,000 replacement cost for a new H100 SXM5. No known US or EU companies publicly offer board-level H100 repair.
How long does H100 GPU repair take compared to buying a replacement?
Board-level H100 repair takes 1-2 weeks from receiving the board to shipping it back repaired. As of Q1 2026, procurement of a new H100 typically takes 2-4 weeks — a significant improvement from the 8-11 month lead times seen in 2023. H200 NVL GPUs ($31,000-$32,000) still have lead times of 3-6 months. Repair remains the fastest path back to production. GPUs typically require repair after 2-5 years of continuous operation (per Reuters, citing repair shop data).
What is the ROI of repairing GPUs for a fleet of 100+ units?
For a 100-GPU fleet with a 3% annual failure rate, repairing instead of replacing saves approximately $75,000-$105,000 per year in hardware costs alone. At 1,000 GPUs, the annual savings scale to $750,000-$1,050,000. Additionally, repaired GPUs return to production 1-5 weeks faster than replacements, avoiding $50-$100+ per day in lost compute revenue per idle GPU.
When should you replace an H100 instead of repairing it?
Replace rather than repair when: (1) the GPU die itself has failed — die-level damage is not field-repairable, (2) multiple major subsystems are damaged simultaneously (e.g., HBM + VRM + PCB), making repair cost approach replacement cost, (3) the board has been repaired multiple times and is showing diminishing reliability, or (4) a technology upgrade (e.g., to H200 or B200) makes replacement strategically preferable.