Traffic jam image

What image does the word “sprawl” conjure up in your mind? Probably not a good one. If you think of urban sprawl, you think of a large American city with far flung, low-density suburbs connected via traffic-choked freeways. You certainly know that sprawl leads to economic inefficiency, wasted hours, and tons of pollution. Indeed, urban sprawl handicaps our aspiration for a sustainable future by unnecessarily increasing CO2 emissions. 

Another kind of sprawl that unnecessarily increases CO2 emissions is data center sprawl, in which IT equipment is irregularly distributed across an increasing number of racks in a data center. What drives this sprawl? Issues of thermal management and limitations in power availability.

Most data centers today are designed to accommodate, from a power and cooling perspective, racks that support around 7-10 KW of power. For some time, this has been fine as rack power densities have been below this number. What is interesting though is how fast rack power densities have been increasing—quadrupling over the last decade as server CPUs have become hotter (roughly doubling in thermal power from 100 W to 200 W in this time frame).

However, CPU thermal power is now increasing even more rapidly. With the release in 2023 of the Genoa series by AMD and Sapphire Rapids by Intel, the top CPUs are now exceeding 300 W of power each. The rate of increase will accelerate over the next few years according to AMD.

The direct result of this is that rack thermal density will increase significantly, with numbers like 50KW or even 100KW being considered for future racks. 

The problem is that data centers are not designed for racks with such high thermal densities. Most current data centers use chilled air which has limits in its cooling capability at the individual rack and server level.

To compensate for this cooling limitation, IT staff are reducing the server density within each rack to fit within the thermal cooling envelope of the rack (typically around 10-15KW for many data centers). In the past, you could expect a fully loaded 42U rack to hold on the order of 18-21 2U servers. But now, IT staff have to reduce the number of servers in a given rack by 30%- 50% and spread them to other racks. 

And this is how you end up with sprawl. 

The consequence of this sprawl, of having your servers spread across a large number of half-full or quarter-full racks, is:

  • More floor space used, which results in poor utilization, or, worst case, running out of expansion space within the data center and thus needing to build a new one. 
  • Reduced performance means lower revenue/profit opportunities per square foot of Data Center compute space
  • If you are renting space in a Colocation (Colo) facility, then using more racks directly results in higher costs as billing is a function of floor space used.
  • Also in a Colo, power costs go up with the number of racks. You get charged per circuit regardless of what fraction of a circuit’s power you actually use, and the number of circuits is proportional to the number of racks. So, more racks = more power cost even if each rack is only partly filled.
  • As a workload gets physically spread out amongst racks, application performance is impacted due to latency
  • More racks = more complexity to manage and trouble shoot; more cabling and switching required
  • More racks = more charges for networking and cross-connects in a Colo environment
  • More racks results in a higher CO2 footprint due to inefficiencies in power and cooling

A solution to this sprawl is to use liquid cooling within the racks, which has been shown to dramatically reduce the energy needed for cooling in a data center (https://datacenters.lbl.gov/liquid-cooling) from an average of 57% down to as little as 15%. 

Direct-to-chip liquid cooling (https://ieeexplore.ieee.org/document/7992537) can enable much higher rack power densities, easily up to 100 KW/rack, without increasing the energy needed in a data center for cooling. This enables you to fully populate the rack even with high performance CPU or GPU servers. Thus, instead of having to use a large number of partially populated racks, you can consolidate to a small number of fully populated racks, and get the benefits of liquid cooling including higher performance/productivity, lower operating costs and better sustainability metrics.

That’s how you say no to sprawl.