Anyone who’s ever been involved with project management will understand how quickly a project can turn into a web of information, with data on schedules, finances, locations, assets and resources (to name a few) all rapidly expanding. Further complexity can arise from the dependencies between these projects – a single project may impact the delivery of many others in different locations or programmes across the organisation.
This was a challenge that we faced with one client: an infrastructure provider that runs 1000s of projects across the UK every year, ranging in value from the tens of thousands to hundreds of millions. It wanted to make sense of its past projects, but collating projects from just the last ten years presented a mass of data that was difficult both to comprehend and to properly utilise.
Just to give an indication of this, once a subset of key tables are extracted and denormalised you can be looking at a very large and complex set of data over several disparate domains with varying levels of aggregation. Storing such data in a standard SQL data warehouse is within the normal realm of data engineering. But working with this data to gain strategic insights is more of a challenge – yet one that could yield the most benefits.
This is where Databricks and its cutting-edge platform come into play, utilising what are now fairly common big data technologies to extract and manipulate data.
Extracting data using Databricks
We started with a standard cloud-based SQL database to house the project data, set up in a star schema with various dimension and fact tables, linked with system-assigned keys to allow joins and denormalisation. In order to access all this, Databricks provides a managed Spark cluster set-up, where we could easily build and deploy a distributed compute resource pre-loaded with the required libraries to pull and manipulate seriously large datasets in just a few minutes.
Once in memory we could begin to understand and clean the raw data, alongside setting up dedicated storage tables to house the data in different states (raw, cleaned, aggregated, etc). This is again made easy by the APIs made available from the managed cluster, using one of the Scala, Python, SQL or R programming languages to perform a variety of optimised operations. As Databricks was also established by the originator of the Spark engine, it has several further optimisations over standard self-build Spark set-ups, such as HD insights on Azure or EMR on AWS. Circling back round to storage, this has been further improved lately by the introduction of Deltalake, an open source overlay that provides greater stability to a standard data lake, through schema enforcement using parquet files and better data control through data snapshots allowing rollbacks to older versions.
A treasure trove of data
Through the standard Databricks offering we now had a link into this treasure trove of data, allowing us to dig through and find issues with data quality. We could also devise new denormalised, very wide table models with frequently changing schemas. These are inefficient using a rigid row based data model, such as in a SQL database, but are essential for gaining a cross-domain overview of a project via aggregation and for developing predictive models.
Focusing on the second point for a moment, the integration between Databricks and cloud providers such as AWS or Azure allowed for even greater extensibility. We could combine the raw compute power with the columnar storage formats of databases such as AWS Redshift / Azure Synapse or the cheap, almost infinite storage of AWS S3 / Azure Cloud Storage. By forming this link, aggregation tables could be quickly developed, pushed into a database and then served into a standard enterprise reporting tool such as Power BI or Qlik, giving people access to views of data that were previously very cumbersome to achieve. This, combined with the Databricks API to run scheduled pipeline jobs, allowed for a much easier path to giving a variety of overviews, at differing detail, in an easy to consume format.
Moving on to the more interesting ML modelling aspect, the in-house Spark ML library provided a usable API to build standard machine learning models (clustering, trees, SVMs, etc) while also optimising their development through the parallel nature of Spark (which can be difficult to set up and manage using other ML libraries such as Sci-Kit Learn). It was through this library that we were able to develop the Oakland Intelligent Forecasting Platform with Databricks at its heart. The platform utilises both the large scale processing and optimized ML development tools contained within Databricks in order to predict at what point, and by how much a project may go over budget. This is innovative work which is delivering real-world insights and shows what is possible when access to data is no longer a barrier to entry.
During the development of the platform a further aspect of Databricks’s tooling proved invaluable: the model versioning and run tracking support offered via its managed ML Flow offering. By providing a management overlay on top of the opensource ML flow platform, Databricks gives the user an easily understood system for tracking experiments during model development and also the numerous model versions that accompany this process. Once complete, there is also the functionality for quick and scalable deployment.
Finally, while generally seen as a less exciting aspect of the platform, its integration with both AWS and Azure has allowed us to move with client preferences, establishing security via role-based permissions in either cloud environment to ensure access to data and compute power is secured.
Under this platform our team have been able to delve into years of project-level data, delivering insights around project delivery and progress, and providing forecasting around the expected spending outcome of a project, ie under / over budget. While all of this is possible using unmanaged cloud infrastructure, Databricks’ additional features and ease of use enabled a smoother workflow that benefited both our team and the client.