A new open source framework that aims to make building machine learning pipelines easier for data scientists was released today by QuantumBlack, the data analytics outfit snapped up by McKinsey in 2015 and that has its roots in data work for Formula 1 racing teams.
The firm hopes that the fully open source development workflow, called Kedro, will become an industry standard for production-ready code in machine learning and data science.
Speaking with Computerworld UK, Michele Battellli, global head of engineering and product at QuantumBlack, explains: "Kedro is a library of code that can be used to create data and machine learning pipelines - basically the building blocks of what we do in an analytics or machine learning project.
"It changes the way data scientists and engineers collaborate and work together with large workflows and data sets, so that the output of their work is something that will be production-ready. In essence, it allows teams to collaborate more easily because they rely on a set of rules and a structure in their code that is uniform throughout the project."
The framework was developed in-house as a "pet project" by engineers Aris Valtazanos and Nikolaos Tsaousis several years ago, along with the former product manager at the firm. It came about because the engineers were trying to manage multiple work streams at the same time, and a prototype Python package was born.
However, the tool started to be picked up by other teams, and began to be used in much of the firm's work with clients. Not only were staff internally finding it useful, but by open sourcing it, the hope is that the framework will provide clarity to clients who want to better understand, or build on, their work together.
QuantumBlack emerged roughly 10 years ago, in the Formula 1 space, where the company's founders were using a large variety of data to improve performance with their clients at scale. They built solutions for the clients but then transferred these over so that they could use the analytics tools independently.
Yetunde Dada, senior consultant product manager for Kedro at QuantumBlack explained that the firm decided to open source the code as a way of giving back to its clients, after several of them requested access to the software after engaging in projects with them.
Kedro has so far been used internally for more than 50 projects throughout McKinsey and QuantumBlack. Both Dada and Battelli point to the fact that it has been built in a way that is intended to be technology agnostic across the industry, as well as easily extendable, so if there are issues around interoperability, in theory it shouldn't be too gruelling a task to build those bridges.
The library includes easy to use project templates, as well as data abstraction, and code management capabilities that addresses reproducibility across environments, configuration management, and modularity, so that large chunks of code can be broken down into smaller self-contained units. And the team promise that it is easy to use by non-experts, as well as promoting a culture of test-driven development within organisations.
Cofounder of QuantumBlack Simon Williams told the Financial Times earlier this year that the outfit would remain independent even post the McKinsey buy-out. It now employs 300 people globally after the acquisition and has since opened offices in Montreal, Sao Paulo, Chicago, Boston and Sydney.
According to Battelli there had been no qualms at the consulting firm over the decision to open source a project for the first time, and that there had been buy-in from "senior stakeholders across the firm [regarding] anything as our path forward when it comes to open sourcing new technologies," he says.
"The partnership offers a very unique proposition because it allows clients to leverage the best industry expertise from KcKinsey, and with change management techniques, and the analytics horsepower that QuantumBlack offers. Kedro is underpinning the technology aspect of this partnership," he added.
Next steps are, naturally, awareness during initiatives like conference engagement and meetups, where the firm can talk more about the project.
"But a lot of it is actually focused on how we actually drive usage of the library," says Dada. "Some of the usage tactics include doing workshops and webinars to actually get people using the library, externally as well.
"The community-building aspects of Kedro is actually primarily driven by those funnels. Awareness first, and then you search and we'll be able to look at contribution further down the line."
The library and documentation for Kedro is all available on GitHub here.