Is Starburst The Analytics Engine for Data Mesh?

24th July 2022

At Oakland, we’re proudly tech agnostic. What does that actually mean? We can work with whatever tech our clients already have or have set their sights on using. That doesn’t mean we can’t review or recommend tech solutions, but we don’t receive any incentives to push one solution over another. We like it that way as we can choose precisely the right tools for the job, and like anyone, we have our favourites, but these are carefully chosen as they are tried and tested by our tech team.

In the first in a series of blogs, we aim to give you some insights into just some of the tech we work with. We begin with one of the newest and most exciting new tech launches in the past few years.

Starburst: Starburst market itself as the analytics engine for all your data, but what does this mean?

What is Starburst?  

Starburst is a highly scalable, distributed, in-memory SQL processing engine. So there are a lot of words in there. But hopefully, most developers will be familiar with the term SQL, which is why Starburst can be an important element. It’s distributed and scalable in a similar way to Spark. Why do we think it’s such a successful product? Because it allows considerable datasets to be computed in large, scalable ways.

Its ability to ingest vast amounts of data from disparate data sources is why it is important. Starburst is one of the biggest data integration engines. You may not have heard of it yet, but we think you will. It began life at Facebook as PrestoDB (later renamed to Trino). It was designed to let large complex organisations with domain teams in key business areas query data in a fast and decentralised manner.

So why should you care? 

Starburst is one of the hottest new products to be launched in the last few years. We compare it to Spark and Databricks (whom we will cover in our next blog) five or six years ago, few of us had heard of either, and now they are multi-billion-dollar businesses.

Starburst’s strength lies in its ability to allow teams to connect the dots by sharing and ingesting data quickly and easily from wherever that data is held.

When a business needs insight, there are many reasons why generating data takes more time than producing insights. It could be time to transfer data, data residency concerns, compliance requirements, or iteration to identify the right data. To save time, data analysts take shortcuts which limit traceability and re-usability.

Starburst offers a unified way to query large amounts of data, enabling faster time to insights, which makes it incredibly useful for spinning up MVPs and POCs.

Where is it useful? 

The larger the company, the more data sources it will typically rely on, which poses a massive challenge to how the data is ingested. Starburst’s ability to join data across different databases and stores is very compelling. Traditionally this is time-consuming, and we have seen many projects held up at this stage impacting an organisation’s ability to report accurately. For example:

  • Sensitive data or data across geographies: “What is my revenue per product category per geography.”
  • Frequently changing data such as segments, channels, and digital journeys “What is the performance of a segment knowing that segments are evolving every year.”
  • Post merging companies: “What is the consolidated performance.”

In those use cases, typical approaches lead to more data than required, implementation challenges, and maintenance efforts. Querying only the required data facilitates insights and governance.

Furthermore, where organisations have a centralised data function, teams can often be waiting for their central data engineering team, who we know are incredibly busy. You now don’t have to wait for the data to be put into your data lake or warehouse (or Lakehouse…or whatever we’re calling it these days).

It will allow you to start ingesting from source into your memory and start processing it. This lets you create connections to all your sources and use it with one SQL statement. Saving both money and memory as queries use a cost-efficient cluster. That’s where we believe the power lies in helping demonstrate value quickly.

Starburst helps connect teams, which means you are slipping data mesh under the radar. It’s a tool that opens up that connection to the other groups, the other data sources, repositories, and data lakes – without needing to provision three or four accounts for each. It can also enable you to create and maintain a user group that everyone can access, so you know who’s using it and who can access and modify particular tables showing the full lineage, which can help bring data mesh to life. It should be said; that technically it does this by integrating with tools like Ranger, Atlas, and Purview to provide the role-based access controls and other governance capabilities.

Teams can better understand the reporting logic inside out and build and own the reporting layer without waiting for the transformation program or change project to deliver your report. Individuals can provide their own reporting layer in a mesh/ data hub/data mart in their domain-specific teams, leaving this centralised data layer to be maintained by the central data team.

We think Starburst will be big simply because of its use cases, flexibility, and power. Although we have to say it isn’t the right solution for everyone. For organisations not looking to decentralise their data, or those with significant ongoing investments in tech, a low level of data maturity and lukewarm senior stakeholder buy-in. It’s probably not the right solution for you.

In our next blog, we’ll look at another US power player Databricks.