RFP (Request for Proposal) document focusing solely on the cooling solution for an AI/HPC data center
- Kommu .
- Apr 9
- 5 min read
RFP (Request for Proposal) document focusing solely on the cooling solution for an AI/HPC data center. This document references public information from Supermicro’s solutions for liquid cooling, AI/Deep Learning, and HPC workloads, as presented on:
1. RFP Overview
Title:RFP – High-Efficiency Liquid Cooling Solution for AI and HPC Data Center
Purpose:Solicit proposals for a comprehensive liquid cooling infrastructure designed to support high-density AI and HPC clusters. This solution must align with best practices that reduce total power usage, manage heat dissipation efficiently, and enable future scalability with minimal retrofitting costs.
2. Project Background & Objectives
Project Background
We are building or upgrading a data center to host compute-heavy workloads, including AI model training, large-scale inference, and HPC simulations.
The data center will feature high-power GPUs/CPUs (e.g., 300 W+ TDP each) that require advanced cooling solutions beyond standard air-cooling.
Objectives
High Efficiency: Reduce Power Usage Effectiveness (PUE) via liquid cooling or similar next-gen cooling methodologies.
Thermal Management: Ensure that each rack and node receives stable cooling, supporting maximum system performance under sustained load.
Scalability & Modularity: The system should accommodate future expansion (additional compute racks or higher-wattage CPU/GPU modules).
Reliability & Resilience: Provide redundant cooling paths or failover mechanisms, ensuring minimal downtime even if a single cooling component fails.
3. Cooling Requirements & Scope
Cooling Approach
Direct Liquid Cooling (DLC), Immersion Cooling, or Hybrid Air-Liquid:
DLC: Direct-to-chip cooling loops removing heat directly from CPU/GPU heatsinks.
Immersion: Liquid submersion of entire server chassis.
Hybrid: May include chilled water rear-door heat exchangers plus direct liquid loops for high-wattage components.
Reference the Supermicro Complete Data Center Liquid Cooling solutions that integrate direct liquid loops or immersion tanks.
Performance Targets
Handle server racks that may house up to 30–50 kW per rack.
Maintain CPU/GPU temperature thresholds within vendor-recommended specs during peak loads.
Achieve a PUE target of ≤1.2 (or better, if site conditions allow).
Compatibility
Server-Level Compatibility: The proposed cooling infrastructure must align with existing or planned HPC/AI servers—such as Supermicro’s GPU-optimized systems or comparable platforms—that support liquid cooling attachments.
Facility Integration: The design must interface with existing building infrastructure (plumbing, CRAC/CRAH units, chillers, or heat rejection loops).
Redundancy & Resilience
At least N+1 redundancy for pumps, manifolds, and heat exchangers where feasible.
Emergency fallback or safe shutdown procedures in case of coolant loop failure.
Environmental Impact & Water Usage
Describe approaches to reduce overall water usage and any opportunities for waste heat reuse.
Provide solutions or best practices for water treatment and closed-loop system maintenance.
4. Technical Specifications
Bidders should propose a complete system that covers the following points:
Liquid Cooling Infrastructure
Manifold Design: Each rack or row should have a modular manifold to distribute coolant to server nodes.
Coolant Type: Indicate recommended coolant (e.g., water/glycol mix, dielectric fluid if immersion).
Flow Control & Monitoring: Automated flow regulation, real-time monitoring of flow rate, pressure, and coolant temperature (in/out).
Distribution & Loop Topology
Primary Cooling Loop: Piping from facility chillers (or an external cooling plant) to in-rack manifolds.
Secondary / Rack-Level Loop: Within each rack, direct supply and return lines to cooled components (CPUs, GPUs, memory, power modules).
Immersion Option (if proposed): Tanks or enclosures for fluid submersion, plus fluid pumping/filtering systems.
Heat Exchangers & Chillers
Type and capacity of heat exchangers required for server inlets and outlets.
Redundant chiller units with configurable setpoints to adapt for HPC or AI workloads.
Controls & Management Software
Integration with data center infrastructure management (DCIM) software to monitor coolant usage, temperature, flow rates, and alarms.
APIs or dashboards for real-time performance data (e.g., recommended Supermicro IPMI or third-party management tools).
Operational & Maintenance Requirements
Schedule of recommended periodic servicing (e.g., filter replacement, fluid analysis).
Leak detection systems and containment strategies.
5. Implementation Approach
5.1 Phased Deployment
Phase 1: Proof-of-Concept (PoC) or pilot installation in one or two racks to validate performance (heat removal, server stability).
Phase 2: Scaling out to all HPC and AI racks, finalizing piping and manifold installation.
Phase 3: Integration with DCIM, operational readiness testing, and sign-off.
5.2 Site Preparation & Dependencies
Facility Assessments: Evaluate chiller capacity, existing plumbing, floor load capacity (especially for immersion tanks).
Retrofit vs. New Build: If existing data center is being upgraded, vendors must outline how to install liquid cooling with minimal disruption to ongoing operations.
5.3 Training & Knowledge Transfer
Basic operational training for data center staff on:
Starting/stopping cooling loops
Monitoring fluid levels and temperatures
Handling potential coolant leaks
Advanced training on system management software, performance tuning, and alarm resolution.
6. Proposal Format & Required Deliverables
Vendors must submit proposals containing the following sections:
Executive Summary
High-level overview of the cooling architecture and rationale for choosing the recommended technology (direct liquid, immersion, or hybrid).
Technical Solution
Detailed design diagrams showing coolant loops, rack manifolds, pumps, heat exchangers, etc.
Explanation of integration with HPC/AI servers (e.g., Supermicro GPU platforms or similar).
Bill of Materials (BoM)
All hardware components: manifolds, pumps, pipe fittings, immersion enclosures (if applicable), sensors, software licenses, etc.
Indicate each item’s function, part number, and associated cost.
Implementation & Timeline
Installation phases and milestones, from site inspection to final acceptance.
Risk Management & Contingencies
Potential challenges (e.g., leak risks, expansions) and mitigation strategies.
Contingency plans for supply chain delays or facility constraints.
Maintenance & Support
Warranty details, SLAs, and recommended maintenance intervals.
Service-level commitments for on-site repairs and response times.
Cost & Payment Terms
Detailed cost breakdown (equipment, labor, support).
Payment schedule tied to project milestones.
7. Evaluation Criteria
Technical Fit & Completeness
How well the proposal meets or exceeds the stated cooling requirements (heat removal capacity, redundancy, efficiency).
Thoroughness in addressing site integration, future scalability, and maintenance.
Vendor Experience & References
Demonstrated success in implementing large-scale liquid or immersion cooling solutions, particularly for AI and HPC data centers.
Customer references or case studies showing improved PUE and performance results.
Cost-Effectiveness & TCO
Balance between upfront CAPEX and long-term OPEX savings (energy and water usage).
Clarity in total cost of ownership (TCO) analysis over a 3–5-year horizon.
Project Timeline & Deployment Risk
Feasibility of proposed schedule.
Vendor’s ability to handle site-specific constraints, potential expansions, or retrofits.
Support & Warranty
Level of ongoing technical support, training, and local presence (if applicable).
Length and coverage of warranties for cooling components and associated hardware.
8. Submission Guidelines
Deadline:
All proposals must be submitted by [Date/Time].
Submission Method:
Electronic copies in PDF format, plus any supplemental design documents (drawings, charts) in commonly readable formats (Visio, DWG, etc.) to [Contact Email].
Contact Point:
Questions or clarifications must be directed to [Name], [Title], at [Email / Phone].
Proposal Validity Period:
Proposals shall remain valid for a minimum period of 90 days.
9. Conclusion
This RFP seeks an advanced, efficient, and future-proof liquid cooling solution that can handle the rapidly increasing thermal footprints of AI and HPC hardware. Drawing on Supermicro’s integrated data center liquid cooling offerings (direct-to-chip, immersion, or rear-door heat exchangers) or equivalent solutions, vendors must demonstrate how their proposed design will:
Sustain stable operations for dense GPU/CPU nodes,
Deliver measurable energy savings (lower PUE),
Provide flexible scaling to meet next-generation hardware demands.
Vendors are encouraged to include innovative approaches (e.g., waste heat reuse, advanced coolant monitoring) that bolster the data center’s overall sustainability and performance.
Commentaires