Skip to content

Data Lakehouse Onehouse secures $35M to capitalize on GenAI revolution

Nowadays you can barely spend an hour without reading about Generative AI. Although we are still in the embryonic phase of what Some have nicknamed it the “steam engine” of the fourth industrial revolution, there is no doubt that “GenAI” is poised to transform almost all industries. of finance and health care to law and beyond.

Cool user-oriented applications It may attract most of the fanfare, but the companies driving this revolution are currently benefiting the most. Just this month, chipmaker Nvidia briefly became the most valuable company in the world, a $3.3 trillion giant substantially driven by demand for AI computing force.

But in addition to GPUs (graphics processing units), businesses also need infrastructure to manage the flow of data: to store, process, train, analyze, and ultimately unlock the full potential of AI.

One company looking to take advantage of this is A housea three-year-old Californian startup founded by Vinoth Chandarwho created open source hudi apache project while serving as a data architect at Uber. Hudi brings the benefits of data warehouses to data lakescreating what is known as a “data lake”, which supports actions such as indexing and performing real-time queries on large data sets, whether structured, unstructured or semi-structured data.

For example, an e-commerce company that continually collects customer data spanning orders, comments, and related digital interactions will need a system to absorb all that data and ensure it stays up to date, which could help it recommend products based on customer feedback. needs of a user. activity. Hudi enables data ingestion from multiple sources with minimal latency, with support for delete, update and insert (“upsert”), which is vital for these real-time data use cases.

Onehouse builds on this with a fully managed data lake that helps businesses implement Hudi. Or, as Chandar puts it, “it drives data ingestion and standardization in open data formats” that can be used with almost all major tools in the data science, artificial intelligence, and machine learning ecosystems.

“Onehouse abstracts the building of low-level data infrastructure, helping AI companies focus on their models,” Chandar told TechCrunch.

Today, Onehouse announced that it has raised $35 million in a Series B funding round as it brings two new products to market to improve Hudi performance and reduce cloud processing and storage costs.

Down at the lake house (data)

Onehouse advert on London billboards
Onehouse advertisement on billboards in London.
Image credits: A house

Chandar created Hudi as an internal project within Uber in 2016, and from the transportation company donated the project to the Apache Foundation in 2019, Hudi has been adopted for him amazon likesDisney and Walmart.

Chandar left Uber in 2019 and, after a brief stint at Confluent, founded Onehouse. The startup emerged stealthily in 2022 with $8 million in seed funding, and shortly thereafter followed with a Series A round of 25 million dollars. Both rounds were co-led by Greylock Partners and Addition.

These venture capital firms have once again joined forces to follow up on Series B, although this time, David Sacks Craft Enterprises is leading the round.

“The data lake house is quickly becoming the standard architecture for organizations that want to centralize their data to power new services like real-time analytics, predictive machine learning, and GenAI,” said Michael Robinson, partner at Craft Ventures, in a statement.

For context, data warehouses and data lakes are similar in the way they serve as a central repository for pooling data. But they do it in different ways: a data warehouse is ideal for processing and querying historical structured data, while data lakes have emerged as a more flexible alternative for storing large amounts of raw data in its original format, with support for multiple data types and high-performance queries.

This makes data lakes ideal for AI and machine learning workloads because it is cheaper to store pre-transformed raw data while also supporting more complex queries because the data can be stored in its own form. original.

However, the trade-off is a whole new set of data management complexities, which risks worsening data quality given the wide range of data types and formats. This is partly what Hudi aims to solve by bringing some key features of data warehouses to data lakes, such as Acid transactions to support data integrity and reliability, as well as to improve metadata management for more diverse data sets.

Setting up data pipelines in Onehouse
Configuring data pipelines in Onehouse.
Image credits: A house

Since it is an open source project, any company can implement Hudi. A quick look at the logos on the Onehouse website reveals some impressive users: AWS, Google, Tencent, Disney, Walmart, Bytedance, Uber, and Huawei, to name a few. But the fact that big-name companies are leveraging Hudi internally is indicative of the effort and resources required to build it as part of an on-premises data lake setup.

“While Hudi provides rich functionality for ingesting, managing and transforming data, enterprises still need to integrate about half a dozen open source tools to achieve their goals of a production-quality data lake,” Chandar said.

That’s why Onehouse offers a fully managed cloud-native platform that ingests, transforms and optimizes data in a fraction of the time.

“Users can get an open data lake up and running in less than an hour, with extensive interoperability with all major cloud-native services, warehouses and data lake engines,” Chandar said.

The company was shy about naming its business clients, other than the couple featured in case studiesas Apna Indian Unicorn.

“As a young company, we do not publicly share the full list of Onehouse commercial customers at this time,” Chandar said.

With a new $35 million in the bank, Onehouse is now expanding its platform with a free tool called Onehouse LakeView, which provides observability of Lakehouse functionality for insights into table statistics, trends, file sizes, timeline history, and further. This builds on existing observability metrics provided by the main Hudi project, providing additional context around workloads.

“Without LakeView, users must spend a lot of time interpreting metrics and deeply understanding the entire stack to detect performance issues or inefficiencies in pipeline configuration,” Chandar said. “LakeView automates this and provides email alerts on good or bad trends, indicating data management needs to improve query performance.”

Additionally, Onehouse is also introducing a new product called Table Optimizer, a cloud-managed service that optimizes existing tables to accelerate data ingestion and transformation.

‘Open and interoperable’

You can’t ignore the countless other big-name players in the space. He likes of Databricks and Snowflake are becoming more embracing the lake house paradigm: Earlier this month, Databricks reportedly distributed billion dollars to acquire a company called Tabular, with a view to creating a common standard for lake houses.

Onehouse has surely entered a hot space, but it hopes its focus on an “open and interoperable” system that makes it easy to avoid vendor lock-in will help it stand the test of time. Basically, it promises the ability to make a single copy of data universally accessible from virtually anywhere, including Databricks, Snowflake, Cloudera, and native AWS services, without having to create separate data silos in each.

As with Nvidia in the GPU space, the opportunities that await any company in the data management space cannot be ignored. Data is the cornerstone of AI development, and not having enough good quality data is a major reason Why do many AI projects fail?. But even when the data is there in large quantities, companies still need the infrastructure to ingest, transform and standardize it to make it useful. That bodes well for Onehouse and its ilk.

“From a data management and processing point of view, I believe that quality data provided by a solid data infrastructure foundation will play a crucial role in bringing these AI projects to real-world production use cases, to avoid waste. solve data problems,” Chandar said. “We’re starting to see that demand from data lakehouse users as they struggle to scale data processing and query needs to build these new AI applications on enterprise-scale data.”