01 Worldwide common field issues & failure modes
The three universal enemies of any cooling loop are corrosion, mineral scale and microbiological fouling ASHRAE TC 9.9 5th ed.. The CDU's core job is to hydraulically decouple the chip-side TCS loop from the facility FWS loop. Below are the failure modes operators actually report, with the prevention that addresses each.
| Issue | Symptom | Root cause | Prevention |
|---|---|---|---|
| Leaks REP | Slow seepage / droplets at QDs, gaskets, plate-HX joints; most start small. | QD not fully seated, bypassed interlocks, debris in coupling, rushed swaps, pump-seal failure; unclear who owns wet connections. | Dry-break / dripless blind-mate couplers; drip trays + secondary containment; leak sensors at QDs/hoses/manifolds/low points; clear IT-vs-facilities boundary. ASHRAE 5th ed. formalises dry-break QDs + segmented shutoff valves |
| Biofilm / biological fouling REP | Biofilm restricts microchannel flow, raises hydraulic resistance, cuts efficiency. | Microbial growth — worst with plain DI water (no biocidal protection). | PG-25 (25% propylene glycol) suppresses microbes far better than DI OCP rationale; coolant-quality + filtration monitoring; TCS filtration 50→25 µm cuts the particulate substrate. |
| Galvanic / chloride corrosion REP | Localised pitting through the oxide layer; particles shed into coolant; eventual cold-plate/HX failure. | Dissimilar metals in a conductive fluid (Al anode vs Cu cathode); chloride drives pitting; falling inhibitor reserve = active attack. | Chloride < 25 ppm (consensus, not one named standard); mandatory azole inhibitor if mixed metals; prefer single-metal (copper) wetted paths; monthly inhibitor-reserve test. |
| Glycol / PG degradation VEN | pH drifts down; fluid turns acidic and corrodes Cu/Al/steel. | PG oxidises into organic acids (glycolic/lactic/formic), consuming alkalinity; makeup water dilutes inhibitor. | Hold pH ~8.0–10.5 (below 7.0 nonferrous corrodes fast); test pH, reserve alkalinity, glycol %, conductivity — baseline at commissioning, full panel at 3 & 6 mo then annually, quarterly pH spot-checks. Inhibited OAT glycol can extend life ~2–3→8–10 yr vendor claim. |
| Particulate / microchannel clog VEN | Clogged filters/fins, insulating scale, rising ΔP, abrupt shutdowns. | Poor water quality + weak filtration; corrosion fines abrade narrow channels at high velocity. | CDU supply filtration 50→25 µm as chip microchannel pitch narrows; side-stream/sub-micron polishing; trend filter ΔP; commissioning flush before operation. |
| Flow maldistribution REP | Uneven flow → hot racks/sleds; in two-phase, cold-plate dry-out (quality → >100%). | Nonuniform heating raises ΔP with vapour, so hotter sleds get less flow (self-reinforcing); unbalanced hose routing. | Manifolds with balancing valves; dedicated inlet/outlet paths; flow restrictors for even two-phase split; CDU VFD tied to ΔT + dP setpoints. |
| Air entrainment / cavitation VEN | Vapour in low-pressure zones; impeller erosion; pumps short of rated life. | Low NPSH available, poor inlet geometry, trapped air, incomplete fill/bleed after service. | Cavitation-resistant inlet; purge/bleed modes; ≥10–20% NPSH margin; size pump near best-efficiency point; air-ingress alarms. |
| Condensation / dew-point sweat VEN | Sweating on cold plates/pipes/IT — catastrophic on electronics. | TCS supply driven below room dew point. | Hold TCS supply ≥2–3 °C above room dew point (dew-point-aware reset); coolant preheat; humidity sensors in controls. ASHRAE W-classes cap facility supply temps |
| Pump wear / failed N+1 VEN | Premature pump failure; loss of flow if failover doesn't engage. | Oversizing (off-BEP), inadequate NPSH, seal wear, vibration; unproven failover. | N+1 with auto-failover + isolation valves (PLC <100 ms); seal-less / mag-coupled pumps for online service; size near BEP; predictive analytics. |
| Controls / integration gaps REP | False leak alarms; sensor drift/EMI; out-of-range flow/pressure; unclear alarm response. | Rope sensors tripped by residue; aging/uncalibrated sensors; CDU↔BMS↔DCIM integration gaps; blurred monitoring ownership. | Integrate via Modbus/BACnet/dry-contact into BMS/DCIM; calibrate at commissioning; standard false-alarm procedure (dry, clean, function-test); continuous trending + automated alarms. |
| Commissioning defects REP | Early-life fouling, contamination, leaks, weak heat transfer, "blame-the-chiller" misdiagnosis. | Pre-charge fluid not flushed; wrong glycol % (high → viscosity penalty, low → lost freeze/biocide); incompatible refill; cross-mated QDs. | Manufacturer fill/flush with a flushing skid; staged filtration + adequate flush velocity; QC on pH/turbidity/inhibitor/glycol %. OCP pre-commissioning prep for TCS row manifolds |
| Fluid / standardization gaps STD | Proprietary coolants/connectors with poor cross-vendor interoperability across refresh cycles. | Historically vendor-specific connectors, hoses, manifolds and fluids. | OCP Cooling Environments standardises interfaces/operating params; OCP PG-25 guideline specifies wetted-material compatibility, tubing, temp/pressure, filtration + safety for multi-vendor supply. |
02 Control systems & BMS / DCIM integration
A CDU runs two loops at once: a VFD pump loop for hydraulic delivery and a temperature loop that modulates the primary control valve to hold secondary supply temp. The fullest public description of the actual algorithms is the Lenovo Neptune RM100 O&M guide (built by Cooltera); DMTF Redfish + OCP anchor the standards layer.
How the loops are regulated VENDOR — Lenovo RM100
| Loop | Controls | Strategy / detail |
|---|---|---|
| Pump (VFD) | Flow or ΔP control | Ramp pump speed until measured flow = setpoint, or until supply-return ΔP = setpoint (10 s scan to avoid oscillation). ΔP control is the common default for direct-to-chip with many parallel server branches (holds head as node valves open/close); flow control where a fixed total delivery is the target. Supply temp is not a pump-loop variable. |
| Temperature (PID + valve) | PID on primary 2-way / 3-way valve | Modulates 0–100% flow (or 0% bypass → 100% HX). Demand-vs-feedback checked every 15 min; >10% deviation → valve fault. Loss of valve signal fails to bypass/closed = no cooling. Retune via Ziegler-Nichols (PI for slow loads, PID for fast). |
| Setpoint / reset | Fixed SP / SP + dew-point offset | RM100 default secondary setpoint 18 °C. Vertiv XDU1350 secondary range 10–52 °C, dew-point control standard. Framed by ASHRAE W-classes (W17/W27/W32/W40 new/W45/W+) STANDARD. |
| Dew-point reset | Room temp + RH monitor | If dew point rises within 3 °C of setpoint, the CDU re-adjusts to stay ≥3 °C above dew point; RH-sensor failure → safe fallback to Fixed SP 18 °C. ASHRAE concurs: coolant must stay above dew point |
| N+1 pump changeover | Lead/standby, 7-day duty | Changeover ~0.25 s; on restart picks lowest-runtime pump; if a pump can't reach 90% of demand within 100 s it stops, standby starts, alarms raise; both failing → latching shutdown. |
| Leak → action | Level sensor + pressure interlock | Latching SHUTDOWN if level-open AND flow/dP <50% setpoint (1 s delay); external rope/spot leak-tape connector. Boyd alternative = bypass-loop standby. A dedicated motorised leak-isolation-valve algorithm is not publicly disclosed — documented responses are pump-shutdown or bypass-standby. |
BMS / DCIM protocols actually supported (per vendor)
| Vendor / product | Modbus | BACnet | SNMP | Redfish | Notes |
|---|---|---|---|---|---|
| Vertiv XDU1350 | RTU+TCP | — | Yes | — | CLI, web server, 7″ HMI |
| CoolIT CHx/AHx (CHx2000) | Yes | Yes | Yes | Yes | "+ many others"; group control 20 units |
| Motivair / by Schneider | Yes (+LON) | MS/TP + IP | Yes | — | EcoStruxure integration |
| Lenovo Neptune (XCC) | RS485/CAN | — | v3 | Redfish | Dual Ethernet; XClarity Call-Home |
| ZutaCore | — | — | — | RESTful | HyperCool Cloud fleet ops |
| OCP requirement | legacy | legacy | legacy | REQUIRED | Prometheus telemetry; pump-RPM + valve-aperture commands |
03 After-sales, support & warranty
Serviceability and support are where deployments live or die over a 5–10 year life. Note two recent ownership moves that change the after-sales entity: Boyd Thermal → Eaton (CDU service is now Eaton's) and Motivair → Schneider Electric (75%). Published warranty lengths are rare — most vendors don't disclose them, so the table marks "n/d" honestly rather than guessing.
| Vendor | Warranty (published only) | SLA / response | Serviceability | Remote monitoring |
|---|---|---|---|---|
| Vertiv | n/d (optional Prime Labor Warranty; length n/d) | Guaranteed Emergency Response tiers (hours n/d); agreements 1→5 yr | Redundant pumps, dual feeds, 50 µm filter, 7″ HMI | Proactive remote monitoring + leak algorithms |
| CoolIT | Implied 2–5 yr via SLA alignment | SLA 2–5 yr; PM every 6 mo | Hot-swap pumps/filters/sensors, front+back, N+N (CHx2000); 25 µm | Redfish/SNMP/Modbus; 80+ countries direct, 157+ via ASPs |
| Motivair / Schneider | 12-mo parts; 4-yr compressor; 2–5 yr extended (needs PM) | Platinum 24×7 + 6-hr onsite; 2 PM visits/yr | Redundant pumps each with own VFD; corridor service access | Centurion cellular cloud (read-only) + EcoStruxure; 600+ field techs |
| Boyd / Eaton | n/d | "quick response" (hours n/d) | Hot-swap cold plates + pumps; FRU + depots (Taiwan/USA/Poland) | Predictive PM (loss-of-flow, pump-life, hours-based) |
| nVent | Up to 5 yr via annual PM, then lifetime PM (base n/d) | PM yearly (hours n/d) | Hot-swap pump-filter-drive cartridge, 1 tech <30 min; N+1 seal-less pumps; N+1 isolatable filters; live maintenance | Liquid-quality + leak + telemetry in pump modules |
| Delta | n/d | n/d | GoCool L2L: hot-swap filtration + dual power (N+1 count unconfirmed) | SNMP/Modbus TCP/BACnet; InfraSuite |
| Stulz | n/d | 8-hr response; 12–36 mo contracts | Optional redundant pumps; front+rear serviceable; quick-release sanitary couplings; 50 µm | Filter status + level via Modbus/BACnet/SNMP |
| Accelsius (2-phase) | "Multi-year" custom, CNA-underwritten; up to $100k/rack leak cover (NeuGuard) | Standard + white-glove (numbers n/d) | Hot-swap pumps/PSUs/control boards/sensors; retrofit-ready; non-conductive dielectric leak | n/d |
| ZutaCore (2-phase) | References obligations (length n/d) | "Time-response objectives" (n/d) | Waterless closed loop; minimal scheduled maintenance (no glycol/strainer) | Integrated CDU health monitoring; strongest training tier (certification + LMS) |
| Lenovo | Premier Support tiers (base n/d; record-retention condition) | Scalable 24×7; quarterly health checks | RM100 4U ~100 kW; drip-free blind-mate couplers; Commissioning Kit included | XClarity suite + auto Call-Home + Energy Manager (most mature) |
Per-vendor published-spec & capability matrix
What each vendor actually publishes and supports — distilled from datasheet research. VENDOR figures; ✓ = published/supported, — = not published, ~ = partial/implied. Differential pressure and ASHRAE water class are the most-suppressed specs industry-wide.
| Vendor | Redfish | dP / head published | ASHRAE class | Hot-swap service | Secondary filtration | Notable |
|---|---|---|---|---|---|---|
| Vertiv | — | ✓ (XDU1350: 2.44 bar) | ✓ W3 / W45 | ~ (XDU070 pumps) | 25–50 µm | Modbus/SNMP/CLI; 7″ HMI |
| CoolIT | ✓ | ✓ (35–44 psi) | ✓ W17–W+ | ✓ N+N pumps/filters | 25 µm | 80+ countries service |
| Motivair / Schneider | — | ✓ (32 psi head) | — | ~ redundant pumps | n/p | Platinum 6-hr onsite SLA |
| Boyd / Eaton | — | ✓ (up to 80 psi) | — | ✓ plates + pumps | 0.2 µm side-stream | Deschutes 2 MW; predictive PM |
| nVent | ~ (Deschutes) | ✓ (2.7 bar) | ✓ W4 | ✓ cartridge <30 min | 50 µm (25 opt) | seal-less N+N; lifetime PM |
| Delta | — | — | — | ~ filtration | 50 µm | SNMP/Modbus/BACnet |
| Stulz | — | — | ✓ W32–W+ | ~ front+rear | 50 µm | 8-hr response; sanitary QR |
| Accelsius (2-phase) | ✓ | — | ~ W27/W45 | ✓ pumps/PSU/boards | 20 µm | NeuGuard up to $100k/rack leak |
| ZutaCore (2-phase) | ~ (REST) | ✓ (3 / 4.5 bar) | ✓ W3 | ~ minimal maint | n/p | waterless; strong training tier |
| Lenovo | ✓ (XCC) | ~ (relief 3.5 bar) | — | ~ blind-mate | 50 µm | XClarity + Call-Home (most mature) |
04 TCO & maintenance burden by type
No neutral $/kW CDU price is publicly sourceable — so this is a relative trade-off read, not a quote. The biggest single capex fork is whether the type needs a facility-water plant (L2L) or not (L2A / 2-phase). The biggest maintenance fork is many small units vs few large ones. High-temp water (W40/W45) is what turns direct-to-chip into near-free cooling — the facility loop can reject through a dry cooler / tower most or all hours, eliminating the mechanical chiller.
| Type | CapEx | OpEx / efficiency | Density | Water (WUE) | Retrofit | Maint. burden | Redundancy / SPOF |
|---|---|---|---|---|---|---|---|
| In-rack | ⚠ many units | ⚠ good | ✅ 100–300 kW | depends | ✅ easy | ❌ many small, high PM | ✅ small domain / ❌ N+1 per rack costly |
| In-row | ⚠ plumbing | ✅ good | ✅ 0.6–2.3 MW | depends | ⚠ row piping | ✅ few large, easy service | ❌ central SPOF → N+1 |
| Sidecar | ⚠ moderate | ⚠ good | ✅ ~200 kW | ✅ if L2A | ✅ strong | ⚠ per-rack + airside | ✅ contained |
| L2L | ❌ water plant | ✅ best (PUE<1.1, free-cool) | ✅ multi-MW | ❌ high if evaporative | ❌ needs water infra | ✅ pump/filter/HX focus | ❌ large central SPOF |
| L2A | ✅ no water plant | ❌ PUE<1.2, summer-sensitive | ⚠ 16–200 kW | ✅ ~zero | ✅ best | ⚠ fan/filter/finned-tube | ⚠ many small units |
| 2-phase D2C | ⚠ costly fluid, no water plant | ✅ ~35% OpEx*, low pump power | ✅ >100 kW/rack | ✅ waterless | ✅ good | ✅ no fluid replacement, ITE-safe | ✅ low-flow/low-leak; ⚠ newer |
The read. Lowest capex / best retrofit / zero water = L2A (and waterless two-phase). Best efficiency at scale = L2L with warm-water free cooling — at the cost of a water plant, WUE, and a large central single-point-of-failure that makes N+1 mandatory. Maintenance burden is the many-small-vs-few-large axis: in-rack and L2A multiply units and leak surface area; in-row and L2L concentrate into fewer serviceable units but raise SPOF. Two-phase is the emerging density play (≈1/10 the flow → smaller pumps, waterless), tempered by expensive dielectric fluid (two-phase fluids can exceed $50/L) and lower maturity.