Skip to content

LanceDB, which has Midjourney as a customer, is building databases for multimodal AI

Chang She, formerly VP of Engineering at Tubi and a Cloudera veteran, has years of experience building data infrastructure and tools. But when she started working in the AI ​​space, she quickly ran into problems with traditional data infrastructure, problems that prevented her from putting AI models into production.

“Machine learning engineers and AI researchers often get trapped in a poor development experience,” he told TechCrunch in an interview. “Data infrastructure companies don’t really understand the problem of machine learning data at a fundamental level.”

So Chang, one of the co-creators of Pandas, the wildly popular Python data science library, teamed up with software engineer Lei Xu to co-release LanzaDB.

LanceDB is building the open source database software of the same name, LanceDB, which is designed to support multimodal AI models: models that train and generate images, videos, and more, as well as text. Backed by Y Combinator, LanceDB this month raised $8 million in a seed funding round led by CRV, Essence VC and Swift Ventures, bringing the total raised to $11 million.

“If multimodal AI is critical to your company’s future success, you’ll want your expensive AI team to focus on modeling and tying AI to business value,” Chang said. “Unfortunately, today, AI teams spend most of their time dealing with low-level data infrastructure details. “LanceDB provides the foundation AI teams need so they can have the freedom to focus on what really matters for business value and bring AI products to market much faster than would otherwise be possible.”

LanceDB is essentially a vector database: a database containing series of numbers (“vectors”) that encode the meaning of unstructured data (e.g. images, text, etc.).

As my colleague Paul Sawers recently wrote: vector databases We are going through a time where the AI ​​hype cycle is peaking. This is because they are useful for all types of AI applications, from content recommendations on e-commerce and social media platforms to reducing hallucinations.

Vector database competition is fierce: check out Qdrant, Vespa, Weaviate, Pinecone, and Chroma, to name a few vendors (not counting Big technology Headlines). So what makes LanceDB unique? Better flexibility, performance and scalability, according to Chang.

For one thing, Chang says, LanceDB, which is built on apache arrow – runs on a custom data format, Lance Format, that is optimized for multimodal AI training and analysis. Lance Format enables LanceDB to handle up to billions of vectors and petabytes of text, images and videos, and allow engineers to manage various forms of metadata associated with that data.

“Until now, there has never been a system that can unite training, exploration, search, and large-scale data processing,” Chang said. “Lance Format enables AI researchers and engineers to have a single source of truth and achieve ultra-fast performance across their entire AI pipeline. It’s not just about storing vectors.”

LanceDB makes money by selling fully managed versions of its open source software with additional features like hardware acceleration and governance controls, and business appears to be going strong. The company’s client list includes text-to-image platform Midjourney, unicorn chatbot Character.ai, autonomous vehicle startup WeRide, and Airtable.

However, Chang insisted that LanceDB’s recent venture capital backing would not divert his attention from the open source project, which he said now clocks around 600,000 downloads per month.

“We wanted to create something that would make it 10 times easier for AI teams to work with large-scale multimodal data,” he said. “LanceDB offers, and will continue to offer, a very rich set of ecosystem integrations to minimize adoption effort.”