19. December 2023 By Attila Papp
Defining data product development guidelines
The journey from raw data to actionable insights is often paved with challenges: teams making the same mistake repeatedly, not looking at the broader context, etc. Because of that, establishing clear data product development guidelines is very important, especially if you take the consumer experience into the picture. In this blog post, we will look at why these guidelines are so important, and present a baseline set of guidelines to start the process.
What is a data product?
A data product is a cohesive package that consists of:
- Data sets: maybe just a table, a view, a stream, or even an ML model.
- Domain information: business-friendly terms and their explanation. It helps figuring out the context easier.
- Access to the data: via APIs, data mesh, or visualization with access control policies included. Note: in the case of a data mesh, this is usually handled by a separate platform team.
A data product is prepared by data producers (ideally teams where the transactional data is originating from) and is consumed by data consumers (usually analysts, data scientists, other teams, etc.)
The Crucial Role of Data Product Guidelines
Guidelines in data product development serve as a guiding light for teams, providing a standardized framework that ensures consistency, efficiency, and interoperability across various parts of the organization. They facilitate seamless collaboration among teams by establishing standard practices, thereby reducing the likelihood of errors and misinterpretations, not to mention removing friction from combining data products.
Striking the Right Balance
While guidelines and standardization are rightly important, it's equally important not to overburden the development process with unnecessary constraints. Flexibility is key, and guidelines should be viewed as adaptable recommendations rather than rigid directives. Striking a balance ensures that teams can innovate within the established framework, fostering creativity while maintaining the usability of the data products with other data products.
An example set of guidelines
Use open table formats: Organizations should agree on an open table format, such as Apache Iceberg, to ensure data consistency and portability.
Naming Conventions: use common naming conventions like snake_case to enhance readability and maintain uniformity across different data products.
Inclusion of transformation date or timestamp: I consider including the date of the state ( or transformation timestamp) a best practice when converting transactional data to analytical data. This helps traceability and finding specific rows in raw data files.
Default Column Types: agree on default column types for common elements like timestamp, date, and country code. This streamlines the process of joining data products, making it more efficient and convenient for data consumers. Use ISO formats wherever possible.
Avoiding Ambiguity in Column Names: Ambiguous column names can lead to confusion. Guidelines should emphasize clarity, ensuring each column name accurately reflects its content. Common business names should be common amongst all data products.
Reflecting Aggregation Type in Column Names: Column names should transparently indicate the aggregation type, promoting clarity and simplifying understanding of data products. For example, using names like avg_time or daily_avg_time.
Here are a few more guidelines on data product versioning:
Major Versions in Table Names: implement a versioning system by appending major versions to table names (e.g., my_table_v1). This ensures smooth transitions after breaking changes.
Breaking Changes vs. Non-breaking Changes: clearly define breaking changes, such as column removals or aggregation logic modifications. Non-breaking changes, like adding new columns, should also be distinguished.
Retirement of Old Versions: agree on a policy for retiring old versions, which is communicated to consumers transparently. Actively manage and retire old versions by engaging with them and helping their migration to the new version. A migration window ensures a smooth transition and avoids downtimes because of breaking changes.
In conclusion, data product development guidelines are the foundation for successful data-driven organizations. Organizations can navigate the complexities of data product development efficiently by balancing levels of standardization and flexibility.
I believe data product guidelines should aim at bettering the consumer experience since a good data consumer experience will lead to better actionable insights. By adhering to standardized formats, naming conventions, and versioning practices, teams can ensure their data products' reliability, clarity, and stability, laying the foundation for informed decision-making and sustainable growth.