DataMesh: How Uber laid the foundations for the data lake cloud migration
published on 2024/09/19
We introduced generic concepts modeled around the data mesh principles, to abstract the cloud constructs and easily adapt it to any chosen cloud provider.
- Domain: A hierarchical unit of organizing all the resources in an organization or team. In data mesh terms, domains can be both at the organization level or at a sub-organization/team level.
- Domain Resources: Cloud providers have resource containers, which are a unit of access control, resource grouping, and cost accounting.
- Data Collection: A logical container for data (DB/tables/files) belonging to the same category (e.g., raw data, tables for internal consumption or modeled tables for external broad access). This may be captured as labels in the corresponding cloud resources.
- Domain Storage: The storage buckets which are a unit of access control, cost accounting, geographic location, storage class (hot, warm, cold), data lifecycle policies (TTL, legal holds, etc.), and data replication (to either standby store or archival store).
- Data Product: A curated, discoverable, and usable representation of data, which includes the Hive tables, BI dashboards, workflow pipelines, etc.