RFP-style design configuration for an AI Data Center built around NVIDIA’s DGX B200 platform

RFP-style design configuration for an AI Data Center built around NVIDIA’s DGX B200 platform (using the new Blackwell GPUs) that is future-resilient for both large-scale training and inference. This proposal is based on typical data center best practices, public information on NVIDIA’s Blackwell architecture, and references to sample configurations such as those provided by AMAX in their comparison of NVIDIA Blackwell configurations.

RFP OVERVIEW

Title: Request for Proposal (RFP) – Next-Generation AI Data Center Design Featuring NVIDIA DGX B200 Nodes

Objective: Design and build a future-resilient AI-focused data center solution to support large-scale model training, high-throughput inference workloads, and anticipated technology refreshes. The solution should incorporate best practices for networking, power, cooling, and high-density compute, with a flexible architecture that can expand to meet evolving AI demands.

PROPOSED SOLUTION SUMMARY

Compute Layer
- NVIDIA DGX B200 Nodes (Blackwell Architecture)
  - Each DGX B200 node is equipped with multiple (e.g., 8 or 16) Blackwell GPUs, designed for high-density AI acceleration.
  - Onboard high-performance interconnect (NVLink) for intra-node GPU communication.
  - CPU subsystem configured with high-core-count processors (e.g., dual-socket x86 or Arm depending on final NVIDIA reference design).
  - Ample system memory (min 2–3 TB per node, or as recommended by NVIDIA) to handle large training and inference datasets.
- Scalability: Start with a foundational set of DGX B200 nodes (e.g., 4–8 per rack), with the ability to easily add more nodes or additional racks as AI demands grow.
Network & Interconnect
- Network Architecture: Spine-leaf topology for high bandwidth and low-latency communication.
- Interconnect Speeds:
  - Use high-speed InfiniBand (NDR/HDR 200 Gb/s or 400 Gb/s) or 400 Gb Ethernet, as recommended for Blackwell-based systems.
  - Each DGX B200 node should have multiple network interfaces (e.g., dual 200 Gb/s or 400 Gb/s IB/Ethernet ports) to ensure sufficient throughput for distributed training.
- Top-of-Rack (ToR) Switches:
  - Redundant 1U or 2U high-throughput switches that aggregate node traffic (InfiniBand or Ethernet).
  - Connectivity from each ToR switch up to spine switches at 400 Gb/s, ensuring no bandwidth bottlenecks.
- Management Network:
  - Separate 1/10 Gb Ethernet management layer to handle node provisioning, monitoring, and out-of-band management.
Storage & Data Management
- High-Performance Primary Storage:
  - NVMe-based parallel file system (e.g., WekaIO, Lustre, IBM Spectrum Scale, or BeeGFS) offering multi-terabyte per second aggregate read/write to feed data-hungry training jobs.
  - Scalable capacity to accommodate tens of petabytes of data as AI workloads expand.
- Secondary / Archive Storage:
  - Larger-capacity object storage or scale-out NAS for long-term data retention and model archives.
  - Data lifecycle management to move older or less frequently used datasets off the performance tier.
- Disaster Recovery (DR) / Backup Strategy:
  - Integration with DR site or public cloud solutions for replication and/or versioning of critical data.
Rack Layout & Density
- Rack Size: 42U or 48U standard racks, with additional custom rack height if needed for specialized data center constraints.
- Nodes Per Rack: Each rack can host up to 4–8 NVIDIA DGX B200 nodes (depending on node form factor, power, and cooling).
- Networking Gear: Leaf switches placed at top or middle of rack.
- Power Distribution Units (PDUs): High-capacity PDUs designed for GPU-dense racks, supporting both single-phase or three-phase power as required.
Power & Cooling
- Power Requirements per Rack:
  - Each DGX B200 can draw between ~3–6 kW or more under heavy load, depending on GPU count; actual specs should be taken from the final NVIDIA B200 datasheet.
  - Plan for up to 30–40 kW (or more) per rack depending on node count and overhead.
  - N+1 or 2N redundancy in power feeds, ensuring high availability (HA).
- Cooling Specifications:
  - Hot-aisle/cold-aisle containment or liquid cooling (depending on final vendor guidance for Blackwell-based systems).
  - In-row cooling or rear-door heat exchangers if the data center has high compute density demands.
- Environmental Monitoring:
  - Sensors for temperature, humidity, and airflow to ensure compliance with ASHRAE standards for HPC/AI data centers.
Software Stack & Management
- AI Frameworks & Libraries:
  - Pre-install frameworks such as PyTorch, TensorFlow, RAPIDS, and NVIDIA’s CUDA toolkit.
  - Containerized environment (e.g., Docker, Singularity, or Kubernetes) to streamline deployment and orchestration.
- Cluster Management:
  - Slurm or Kubernetes for job scheduling, cluster monitoring, and container orchestration.
  - DGX System Management tools (NVIDIA DGX software stack) for GPU health checks, driver updates, and HPC-specific functionality.
- Security & Access Controls:
  - Integration with LDAP/AD for user authentication and role-based access.
  - Secure management VLANs and out-of-band management via IPMI/iDRAC or Redfish.
High Availability & Future Resilience
- HA / Redundancy:
  - Redundant management controllers in each DGX B200 and redundant power supplies in all critical equipment (switches, servers, storage).
  - Geographic redundancy or multi-site design, if feasible, for large-scale enterprise deployments.
- Modular Expansion:
  - Ability to add additional racks with matching DGX B200 servers, network spines, and storage expansions.
  - Allocate additional overhead in power and cooling capacity to support incremental expansion as GPU technology evolves.
- Technology Refresh Planning:
  - Rack design that supports future generations of NVIDIA GPUs or CPU architectures, especially as next-generation PCIe, NVLink, or CPU memory channels evolve.
  - Software stack based on containerization to abstract away hardware details and enable easier updates.

TECHNICAL & BUSINESS REQUIREMENTS

Performance Requirements
- Minimum 2 petaFLOPS of AI compute per rack (FP16/BF16) to meet initial HPC/AI training demands.
- Low-latency networking (<2 μs at scale with InfiniBand or 400 Gb Ethernet) to support distributed training efficiency.
Reliability & Uptime
- Target SLA of 99.9% or higher.
- Comprehensive monitoring solution (Prometheus, Grafana, or vendor-supplied) that alerts on GPU/CPU performance, temperature, network usage, and storage capacity.
Manageability
- Automated provisioning for DGX nodes, including GPU firmware updates.
- Integration with existing data center orchestration tools where applicable.
Support & Warranty
- Vendor support for NVIDIA DGX B200 hardware and replacement SLAs.
- Access to advanced RMA services for GPU modules or critical networking.
Cost & Timeline
- Provide a breakdown of estimated CAPEX (hardware, networking, storage, installation) and OPEX (power consumption, maintenance).
- Timeline for delivery, installation, acceptance testing, and training for in-house IT staff.

IMPLEMENTATION & EVALUATION

Deployment Phases
- Phase 1: Delivery of initial rack(s) with 4–8 DGX B200 nodes, storage system, and network fabric.
- Phase 2: Integration with enterprise environment, user acceptance testing, final performance verification.
- Phase 3: Operational readiness for AI/ML workloads.
- Phase 4: Expansion or additional racks based on capacity triggers.
Acceptance Criteria
- All DGX B200 nodes pass burn-in testing with stable GPU performance.
- Network throughput and latency validated through standard HPC benchmarks (e.g., HPCG, ResNet-50 training throughput tests).
- Storage performance verified (bandwidth, IOPs) with representative AI dataset usage.
Future-Resilience Testing
- Stress tests for scale-out expansion: verifying new DGX B200 nodes can be added seamlessly.
- Confirmation that the rack design accommodates next-generation CPU/GPU add-ons (e.g., new NVIDIA GPU modules) with minimal changes to power, cooling, or network.

VENDOR REQUIREMENTS & RESPONSE FORMAT

Vendor Experience
- Demonstrated expertise deploying NVIDIA DGX reference architectures or HPC/AI data centers of similar scale.
- Capability to provide on-site installation, support, and training.
Bill of Materials (BoM)
- Itemized list of all required hardware: DGX B200 nodes, networking components, storage, racks, PDUs.
- Licensing costs for software, if any (e.g., HPC schedulers, container management systems).
Delivery & Support Plan
- Detailed timeline from purchase order (PO) to go-live.
- SLAs for ongoing support, including hardware warranties and software updates.
Pricing
- Break down total cost by major categories: compute, network, storage, services, and support.
- Provide multi-year TCO estimates that include power consumption and data center overhead.
Evaluation Criteria
- Technical compliance with AI HPC needs.
- Financial viability and TCO.
- Vendor track record and service level guarantees.

CONCLUSION

By specifying the NVIDIA DGX B200 nodes (Blackwell-based architecture) and a high-speed network fabric(InfiniBand or 400 Gb Ethernet), this design ensures that the AI data center will deliver exceptional performance for training and inference—while remaining flexible enough to adopt next-generation GPUs, advanced CPUs, and evolving storage technologies. Properly selected storage tiers will avoid bottlenecks for large datasets and accelerate both training and inference workflows.

With modular power and cooling scalability, plus robust management and monitoring, this proposal delivers a truly future-resilient foundation for AI workloads. Vendors should respond to each requirement in this RFP with detailed technical and financial proposals, ensuring that the recommended solution meets or exceeds all performance and reliability targets set forth.

RFP-style design configuration for an AI Data Center built around NVIDIA’s DGX B200 platform

Recent Posts

Комментарии