Lakehouse

Data storage has historically been one of the biggest challenges in implementing decentralized data marketplaces. Some data products might be massive archival corpora used for AI training, while others demand near-real-time delivery to power onchain protocols. Rather than force all datasets into a single storage mechanism or rely on purely onchain solutions, Portex takes a novel approach we call the Lakehouse paradigm. This architecture embodies a design principle of “marketplace agnosticism”: we do not mandate the exact location or technology for storing data. Instead, we simply enforce minimal file format standards and empower data sellers to choose the storage solutions that best suit their needs.

A Lakehouse merges the file-centric flexibility of a data lake with the transactional integrity of a traditional data warehouse. Data is stored in logical “lakes” and can be versioned or queried at scale, but without the restrictions that come with rigid warehouse schemas. For large tabular datasets, Portex requires Apache Parquet as a standardized format due to its strong compression, columnar reads, and rich metadata capabilities. Meanwhile, for computer vision datasets or image collections, we rely on optimized image formats like WebP. These minimal format constraints ensure that data remains both portable and consistent, giving buyers confidence in interoperability while leaving room for seller-side innovation in how they actually host or manage files.

The Lakehouse approach acknowledges that not every dataset is the same. Some sellers may wish to host a massive archive of raw satellite images that require local file downloads or one-time encrypt-and-forget hosting. Other sets, like live transactional data or risk analytics, must be delivered nearly in real-time. By decoupling our marketplace contracts from the physical storage layer, we let each dataset owner optimize for whatever performance, availability, or cost parameters they see fit.

While some datasets may live in centralized cloud services, others can benefit from distributed storage—especially when data availability is critical and outages or censorship must be avoided. Our future vision involves applying Reed-Solomon codes to shard data across multiple nodes, ensuring the dataset remains reconstructible even if some nodes become unavailable. This redundancy also helps maintain privacy, as no single operator (beyond the seller) would hold the entire unencrypted dataset. Data can be encrypted before sharding, using techniques like VECK, thereby removing single points of failure without sacrificing confidentiality.

One of the more exciting directions we’re exploring is the integration of Trusted Execution Environments (TEEs) for constrained data access. With TEEs, AI agents can be granted ephemeral “peeks” into a dataset to perform computations without exposing the raw data. This is especially relevant to data marketplaces, where sellers often worry that revealing too many samples or too much detail can undermine the commercial value of the dataset. TEEs help mitigate this concern by enabling buyers (or their AI agents) to explore and validate portions of the data within a secure enclave. The output can then be shared while the underlying dataset remains hidden. Over time, we believe this model will lower friction, making sellers more comfortable listing valuable datasets in the marketplace and, in turn, providing buyers and AI agents a more robust array of data products to choose from.

Last updated 7 months ago