If you’re looking to build your Business Intelligence capabilities and create bespoke dashboards, you’ll most likely need, or already have a Data Warehouse. Add Data Science into the mix, and you’ll also need a Data Lake; However, running both in tandem on a Data Platform can incur some serious costs. So, can you have the best of both worlds using a Databricks Lakehouse (one cloud platform for massive-scale data engineering and collaborative data science) Well, the answer is yes, if your current or new Data Platform is a good fit?
In this two-part series, I’ll discuss the Lakehouse, including how it can behave both like a Data Lake and Data Warehouse, along with some potential alternatives.
So, What is a Data Warehouse and a Data Lake?
A Data Warehouse is a data architecture that has been around since the 90s and is still relevant today. It is where you store your tabular data in a way that can be easily used by Business Intelligence solutions, such as Tableau or Power BI, web applications, and even other data warehouses. The two most common Data Warehouse architectures are the Kimball Star Schema and the Data Vault.
The name is also confusingly used to identify a type of Database, such as AWS Redshift, Azure Synapse, and Snowflake, which specialise at storing and querying large amounts of data. Whilst effective, Data Warehouses have their issues. They are often costly when processing large amounts of data and work best only when data is of reasonable quality and in a tabular structure. Additionally, the focus on data science has increased the demand for large amounts of raw data, often of dubious quality, in various formats that are not feasible using a data warehouse, leading consumers to Data Lakes.
A Data Lake is, at its heart, a distributed file system that allows for cheap storage of large amounts of data in varied formats (images, videos, and music), that Data Warehouses cannot. They are usually hosted in the cloud, such as on AWS S3 or Azure Data Lake, with large files split by a key, to save on processing costs by loading only the partitions of data you need.
So, with this in mind why would organisations not fully shift to Data Lakes to solve their business needs? The major issue is that they cannot easily or efficiently do row-level updates and inserts, nor connect easily to Business Intelligence applications; Two mainstays of a Data Warehouse. As such, there has been a drive to utilise both, which as I mentioned earlier starts to incur serious costs! However, a recent innovation from Databricks, the scalable analytics platform, may solve this through their Data Lakehouse offering.
What is Databricks Lakehouse?
A year or two ago, Databricks was mainly an easy-to-deploy and maintain platform for running Apache Spark, a distributed data processing library for large-scale Data Engineering and Data Science. It worked primarily in tandem with a Data Lake, with similar advantages and drawbacks. Then Databricks released Delta Lake, a file format with attributes only found previously in databases and Data Warehouses: data schemas, ACID transactions, row-level security, version control, row-level updates, and inserts. This can be beneficial as a data warehouse but can be much more expensive and complex to run than a data lake.
Along with Databricks bringing a Business Intelligence / Data Visualisation component soon in SQL Analytics and building better integrations with Power BI and Tableau, you could be able to replace your Data Warehouse or use it less often.
Architecture of a simple data platform using just both a Data Lake and Data Warehouse.
So, do I never need to use a Data Warehouse again?
Maybe, but some use cases do not suit the Lakehouse, such as when your existing data team want to continue using mainly SQL: while Spark has a SQL client, most Databricks users write in either Python or Scala, so making the switch from SQL to Python or Scala maybe not worth the benefits for your data team.
Also, Data Warehouse allows for more complex transformations across tables such as checking Foreign Key integrity (for example, does the ids in this table match all the ids in another table). Delta Lake does not support this. This can be replicated in Spark code, though arguably more inefficient in Spark than a Database.
Finally, Data Warehouses generally have better integration with other software than Spark / Delta Lake has; however, Delta Lake is gaining more support all the time as it is an open-source library and Spark has many data connectors, such as Power BI and Tableau, so it’s common to pull data initially from another format like CSV or a Database, then convert to Delta Lake for later use using Spark.
Can I build a Lakehouse somewhere other than Databricks?
Yes, with some caveats: Delta Lake is open source, so it can be used wherever Spark is used. However, Databricks builds special optimisations into Delta Lake just for Databricks and provides a user interface for viewing Delta Lake schemas, versions, and sample data. So it is likely running Delta Lake will be a bit slower and harder to maintain elsewhere.
There is also an alternative to Delta lake, called Apache Iceberg, which comes with its own pros and cons.
In short, similar to other data products and architectures, the answer is it depends on the make-up of your data team, the size and structure of your data, and how the data is used. That said, if you are consuming a lot of data in your data platform, or struggling to manage both a Data Lake and Data Warehouse, or trying to figure out how to use advanced analytics like Machine Learning with your data, Databricks Lakehouse makes a convincing proposition in our opinion.
In my next post, I’ll show how the Lakehouse can behave like a Data Lake and Data Warehouse with some Delta Lake, and Apache Spark code examples run in Azure Databricks.
(Jake Watson is a senior data engineer at Oakland)