16. November 2023 By Attila Papp
Amazon DataZone: A Brief Overview
What is Amazon DataZone?
According to AWS, Amazon DataZone is a data management service that simplifies the process of cataloging, discovering, sharing, and governing data. It provides administrators with fine-grained controls to manage and govern data access, ensuring the right level of privileges and context.
DataZone integrates with various data management services, including Amazon Redshift, Amazon Athena, Amazon QuickSight, AWS Glue, AWS Lake Formation, etc. 
- Data Governance: allows you to govern data access across organizational boundaries, ensuring the right data is accessed by the right user for the right purpose.
- Collaboration: DataZone connects data workers through shared data and tools, increasing business team efficiency. It provides self-service access to data and analytics tools.
- Automated Data Discovery and Cataloging: DataZone uses machine learning to automate data discovery and cataloging.
Integration with Other AWS Services
DataZone supports three types of integrations with other AWS services :
- Producer Data Sources: You can publish data assets to the DataZone catalog from data stored in AWS Glue Data Catalog and Amazon Redshift tables and views. You can also manually publish objects from Amazon Simple Storage Service (S3) to the DataZone catalog.
- Consumer Tools: You can use Amazon Athena or Amazon Redshift query editors to access and analyze your data assets.
- Access Control and Fulfillment: DataZone supports granting access to AWS Lake Formation managed AWS Glue tables and Amazon Redshift tables and views.
DataZone's concepts are well-thought-out and suitable for many use cases. Its main components are:
- Domain: DataZone ‘instance’
- Blueprints: These allow for the connection and consumption of data. For example, it can deploy Glue databases and necessary cross-account permissions or even a basic Athena setup.
- Environment Profiles: These define the blueprints it will use/deploy, such as smaller data lakes, data warehouses, or custom profiles.
- Environments: These describe an associated account and an environment profile.
- Projects: These are collections of resources.
- Data Source: This is a connection to an environment to a specified DB and tables. It acts like a crawler for existing Glue metadata, populating its internal catalog.
- Asset: essentially data product that can be published. Once published, consumers can subscribe to it. DataZone can generate business names from technical column names using AI.
- Catalog: This is a collection of data assets.
At the time of writing this, DataZone does not support Lakeformation filters, which is a deal-breaker for many, especially those impacted by GDPR. This feature is scheduled for Q4 2023, according to AWS support.
Furthermore, IAM connections are not yet available; only SSO-based login is available, meaning only end-users can log in; thus, there is no programmatic consumption yet.
It is also not yet supported by Cloudformation/CDK.
During the evaluation, I found no concept of 'super-admins.' I had to remove a project created by another user, but I couldn't as a domain creator. I had to reach out to support to remove the project. This has been partly patched since then by allowing domain creators to self-assign themselves to any project not created by them.
Enabling SSO login rolled it out to the whole AWS organization by default, which was a bit problematic.
Based on the limitations and the lessons learned, I believe DataZone is not yet ready for enterprise usage. However, it is a good option for mid-size companies on AWS. It provides seamless (although in the case of Lake formation, rather just a basic) integration with other AWS services and makes consumption convenient by its inbuilt access request mechanism and consumer tools.