The engineer in charge of keeping Google available 24/7 predicts the enormous datacenters that underpin online services worldwide will soon be run with the help of AI.
Ben Treynor Sloss, Google's VP of engineering, is basing this on the profound savings identified by a machine-learning system charged with helping run a Google datacenter in 2016.
The Google DeepMind system significantly improved the power efficiency of the Google datacenter, via tweaks to how servers were run and the operation of power and cooling equipment. Following the system's advice allowed Google to reduce the energy needed to cool servers by about 40 percent. If that reduction were replicated across Google's datacenters worldwide, it could add up to a saving of tens of millions of dollars each year.
"What we developed was so compelling that our challenge has been much more one of engineering 'How quickly can we get it rolled out everywhere?'," says Sloss.
"If you can save that amount of power, what you want is to grab that gain, and we'll continue to train the model, and continue to probably put more systems under its care because the initial results were just so profound."
Sloss says that it won't just be Google that is clamouring to put its datacenters under the stewardship of AI, the results achieved by self-learning systems are such an unambiguous improvement over manual decision-making that using machine-learning systems will fast become essential when running large datacenters.
"That will simply be the way that process is done not too many years from now. It's so much better than the current state of the art," he said.
"Training a network to do essentially analog process controls, my guess is that will become pervasive."
Perhaps more surprising is that the DeepMind system achieved these results by turning conventional logic on its head. While the traditional approach to minimizing power consumption was to run as few cooling systems as possible, the AI instead recommended running all the systems at lower power levels.
Google first revealed its attempts to apply AI to running its datacenters back in 2014, when it said it had used a neural network to pick out patterns of power usage and identify opportunities to cut consumption.
Speaking last year, DeepMind co-founder Demis Hassabis, said Google had stepped up its use of AI since then, using a DeepMind AI that modelled the running of a datacenter and adjusted 120 variables related to its operation to achieve the highest level of energy efficiency. When the recommendations from that model were applied, the datacenter increased by 15 percent its Power Usage Effectiveness (PUE), a measure that reflects how much of the electricity used by a facility ends up powering the servers, rather than driving associated infrastructure handling cooling and power distribution.
Andy Lawrence, VP of research for datacenters and critical infrastructure at 451 Research, agrees that Google's experiment with using AI to help run datacenters will eventually become mainstream.
"Google's use of DeepMind to reduce the PUE of its datacenter is an interesting application of AI/machine learning, and clearly points to what will eventually be achievable," he says.
"The long term trend is towards automatic or autonomic management of datacenters using software tools."
However, he says that Google datacenters are already so efficient that the gains represented only "a datacenter power efficiency improvement from around 86% to 88%".
"Even so, at Google global scale, that would represent an very significant saving: Google uses over 5m MWh of electricity a year," he said, adding the approach could make sense for the largest tech firms, but would require large-scale investment.
"One of the challenges, even for Google, is that a lot of sensors are required, and these can be expensive to install at scale."
Services are already springing up to support AI-driven management of datacenters, with US company Vigilent applying a learning-based algorithmic approach to optimize cooling for customers in several continents, and in the longer term Lawrence expects to see "AI-based efficiency services to be delivered as a service to datacenters".
'I'm agog at what we've been able to do'
Perhaps the most famous demonstration of the efficacy of DeepMind's machine-learning systems was the recent triumph of the DeepMind AlphaGo AI over a human grandmaster in Go, an ancient Chinese game whose complexity stumped computers for decades. Go has about 200 moves per turn, compared to about 20 in Chess. Over the course of a game of Go there are so many possible moves that searching through each of them in advance to identify the best play is too costly from a computational point of view. Instead, AlphaGo was trained how to play the game by taking moves played by human experts in 30 million Go games and feeding them into deep learning neural networks.
Training these deep learning networks can take a very long time, requiring vast amounts of data to be ingested and iterated over as the system gradually refines its model in order to achieve the best outcome.
To streamline that training process Google built it own specialized chips, known as Tensor Processing Units (TPUs), which accelerate the rate at which useful machine-learning models can be built using Google's TensorFlow software library. These chips are not just used to train up models for DeepMind and Google Brain, but also the models that underpin Google Translate and the image recognition in Google Photo, as well as services that allow the public to build machine learning models using Google's TensorFlow Research Cloud. The second generation of these chips was unveiled at Google's I/O conference in May this year, with an array of these new TPUs able to train a Google machine-learning model used for translation in half the time it would take an array of the top-end graphics processing units (GPUs).
"TPUs offer a huge performance advantage over currently available technology," says Sloss.
"Everybody who's working hard on ML [machine learning] at this point is chasing after performance. It gives you a large competitive advantage, because you can get to the point where you have modelled something useful in a fraction of the time that it would otherwise take."
While not making a firm commitment in regards to future rollouts of TPUs within Google's datacenters he says "I suspect that we will continue to make TPUs more widely available".
Even as an insider at Google, Sloss admits to being surprised at the rate at which machine-learning capabilities are advancing on the back of processors capable of manipulating huge amounts of data in parallel and the availability of enormous training datasets.
"I'm still fairly agog at what we've been able to do collectively with machine learning over the last few years," he said.
"I'm an expert chess player and if you had told me three years ago that the Go champion of the world in 2017 would be a computer, I would have chuckled politely, and yet here we are.
"I'm very interested to see what ML is able to do for the world over the next five years."