“Open data lake architecture gives the advantage of using open source technologies to get analytics on Data Lake without having to use proprietary software that requires vendor lock-in,” says, Dipti Borkar, Co-Founder and CPO of Ahana, in an exclusive interview with EnterpriseTalk.
ET Bureau: In the current business and economic environment, what do you think is driving the need for open data lake architecture?
Dipti Borkar: Companies today have data stored in many different systems, and a lot of that data is in the cloud (or it will be in the cloud soon). It’s a mix of structured, unstructured, static, and streaming data and is usually in many different formats. Eventually, most of this data will end up in a data lake because they’re cheap, and it’s easy to store massive amounts of data in it.
The value of data comes from running analytics and making decisions based on those results. These decisions are increasingly based on evaluating and processing not just parts of the data but all the data in the data lake and across the data lake and other databases. An Open Data Lake Architecture approach helps companies avoid relying on proprietary systems and proprietary data formats to do that.
ET Bureau: In what ways does an open data lake analytics approach benefit businesses?
- No-lock in (open format, open-source)
Open data lake architecture gives the advantage of using open source technologies to get analytics on Data Lake without having to use proprietary software that requires vendor lock-in.
- Flexibility to apply multiple data processing techniques on the same data without copying it in many different places
Over the past few years, many optimized open formats to store data in a structured yet highly compressed form have emerged. Open query engines that support these formats give users the power to choose which engine to use for different use cases on the same set of data. This is extremely powerful – using open formats gives companies the flexibility to pick the right engine for the right job without the need for an expensive migration.
- Best technologies from the best engineers
Many open-source projects like Presto are built at internet giants and used by other companies like Uber, Twitter. These companies continue to innovate on these projects, and end-users get the testing & innovation built for Facebook-scale. In addition, companies get the best of what comes with open-source software – flexibility plus the power of a community that can provide help, fixes, and quick development.
ET Bureau: A challenge with data lakes is not getting locked into proprietary formats or systems. Do you think open data lake architecture can help ensure there is no vendor lock-in?
Dipti Borkar: In short, yes. In addition to open-source, open data lake architecture includes open formats, open interfaces, and open clouds. Open formats like Apache ORC, Apache Parquet, JSON, and Avro give you the flexibility to easily migrate from one query engine to another.
Open interfaces mean that there’s seamless integration with existing SQL systems and support for ANSI SQL. For example, one should be able to access data through standard drivers like ODBC and JDBC. For open cloud, the query engine should be able to access any cloud storage, natively align with containers, and run on any cloud.
A common thread between all these pieces is how much flexibility a company can get with the technologies. A company might decide to change their query engine, which format they store their data in, or which cloud they use. With an open data lake approach, it’s possible to do all these things – without getting locked into a proprietary system or technology.
ET Bureau: Do open data lake architecture enables a vendor-agnostic solution to compliance needs?
Dipti Borkar: Typically, requirements from users on compliance are around security. Because open source technologies are widely adopted, many of the security features may already be available. This could help achieve faster compliance.
In addition, open-source projects that are vendor-neutral and community-driven via foundations like Linux Foundation and Apache Software Foundation would be better technologies for users to bet on. Since multiple organizations are involved in these projects, particularly the large internet companies, there are fairly mature security features built-in.
ET Bureau: What are some of the best practices businesses can follow when it comes to open data lake architecture?
Dipti Borkar: Encourage the developers and engineers to do their research on open source projects. Bonus points if they can participate in those projects and become part of the community.
Decide which use case to start with – usually, it’s the one that will bring the most value to the company – and prioritize it as the starting point for this architecture.
And finally, pick the best approach that works for the company to build this architecture. Many teams don’t have all the expertise in-house, and that’s ok. Leverage a cloud-native managed service that will help reduce the learning curve for the data platform team so that they can take advantage of the technology without having to be experts.
Dipti Borkar is the Co-Founder and CPO of Ahana with over 15 years of experience in distributed data and database technology, including relational, NoSQL, and federated systems.
She is also the elected Presto Foundation Outreach Chairperson. Before Ahana, Dipti held VP roles at Alluxio, Kinetica, and Couchbase. At Alluxio, she was Vice President of Products, and at Couchbase, she held several leadership positions there, including VP, Product Marketing, Head of Global Technical Sales, and Head of Product Management.
Earlier in her career, Dipti managed development teams at IBM DB2 Distributed, where she started her career as a database software engineer.