Though everyone seems to be piling on the machine learning bandwagon, it's a game that only the rich can play, as I've written. While open source machine learning projects like Google's TensorFlow and Amazon's DSSTNE lower the bar to would-be machine learning engineers, resolving the skills deficit that Gartner analyst Merv Adrian called the biggest hurdle to machine learning success, no amount of training can resolve a thornier issue: Lack of data.
Yandex, the Google of Russia, has plenty of data, coupled with experience wrangling it to machine learning success. It's therefore fascinating to hear Yandex COO Alexander Khaytin talk through the best ways to bridge the data divide that keeps the vast majority of enterprises from achieving machine learning success.
But first, you're going to need data. Lots of data.
Teaching your data to fish
Data, of course, is needed to train machine learning algorithms. Many companies simply don't have the data assets necessary for such training. However, according to Khaytin, for the kinds of companies that undertake serious machine learning projects, volume of data isn't the issue—getting it into one place is:
While most companies undertaking machine learning projects inevitably own and store vast quantities of data, this data is not always ready to use. With data often siloed in separate storage and processing systems, the aggregation of data can be time-consuming and difficult. Additionally, when extracting data, companies must take data security into consideration with almost all data being "poisoned" by personal or sensitive kind of data.
Compounding the problem, many organizations lack the willingness to experiment, a key component of machine learning, and are especially reluctant to do so on live, production systems. As he stated, "[W]hen it comes to prescriptive analytics, the measure of business impact can only truly be assessed by actually applying a machine learning model in the real business process. For most companies, often at the start of their digital transformation, the prospect of launching large scale machine learning projects which haven't already demonstrated their value in previous trials can be daunting."
SEE: The cloud war moves to machine learning: Does Google have an edge? (TechRepublic)
Kissing cousin to this willingness to experiment, Khaytin concludes, is business agility. "There are no beaten paths with machine learning yet: The technology is new, the success is not guaranteed, and the experimentation is crucial. By ensuring agile and flexible business processes, companies will spend less time, effort, and money on unsuccessful projects."
All of which is easier said than done. How can enterprises overcome data silos and embrace a culture of experimentation and agility?
Open source can help
While not a panacea, open source offers a way for organizations to experiment without locking themselves into expensive software or infrastructure that inhibits agility. Though open source won't aid in eradicating data silos, it lowers the bar to trial-and-error.
As Michael St. James wrote to me of his machine learning work in the music industry, "In my world, open source makes it easier to try to invent/deploy ML stuff that may not be monetized." MuckRock founder Michael Morisy agreed, telling me, open source machine learning projects like TensorFlow "make it easy to experiment and in some domains [enable you to] get meaningful results without [a] ton of expertise."
Because the only cost to getting started is one's time (and renting infrastructure), open source makes it easier to learn to scale machine learning projects, starting with exceptional, trusted code from Google, Facebook, and more. Over time, such open source tinkering can bleed into the larger organization, fostering the curiosity and agility that Khaytin insists is critical to machine learning success.