Thứ Năm, 4 tháng 5, 2023

Data Version Control: The Enabler Of Data Engineering Best Practices

 Data is the backbone of every business organization today, and its importance will only grow in 2023. There have been a lot of discussions lately about adopting version control practices for data. Many engineers believe that data version control is the obvious next step that would transform data pipelines from something that organizations maintain to something they engineer—just like code.

But what exactly does “data version control” mean?

Version control systems emerged in the 1960s to help solve problems engineers encountered on their way while building software applications. Data version control brings versioning capabilities to the world of data practitioners by utilizing best practices such as:

• Smooth collaboration during development.

• Ability to develop and test in isolation.

• Reverting the code repository to a stable version in case an error occurs.

• Reproducing and troubleshooting issues with a given version of the code

• Continuously integrating and deploying new code (CI/CD).

With this, we must ask: What does an organization get from implementing data version control?

Collaboration During Development


When building a data-intensive application, everyone must see the same data version. This is challenging because data accumulates and changes over time. Version control of data and using commit IDs or branches help synchronize all the people involved to use the right data version.

Developing And Testing Data In Isolation

Isolution is made possible by data control through a mechanism called branching. A branch works like a snapshot of the data repository made at the time of its creation. From that point, the data control solution will only record any changes applied to data on the branch there.

Many teams experience tension between the flexibility and robustness of their systems when they want to quickly test the effects of a research finding on their data assets. A branching mechanism resolves this tension as it lets them quickly deploy the change in an isolated data pipeline and measure the results against everything running in their production environment.

Developing And Testing Data In Isolation

Isolution is made possible by data control through a mechanism called branching. A branch works like a snapshot of the data repository made at the time of its creation. From that point, the data control solution will only record any changes applied to data on the branch there.

Many teams experience tension between the flexibility and robustness of their systems when they want to quickly test the effects of a research finding on their data assets. A branching mechanism resolves this tension as it lets them quickly deploy the change in an isolated data pipeline and measure the results against everything running in their production environment.

Reverting In Case Of An Error

Revert is a data control operation that lets engineers time travel within their data repository and return to any point in time tagged by a commit, branch, or merge. If a data quality issue is discovered when you’re just about to make an important decision, like opening a new branch in a location, there’s no need to postpone that decision by hours or days. Engineers can quickly revert data to the last stable snapshot of the repository.

Reproducing And Troubleshooting Issues

If different data consumers at your organization have learned to rely on a healthy state of the data version, mistakes may interfere here. Having different versions of data via version control supports them as well.

Suppose you need to troubleshoot an issue data consumers face; they can always revert to the previous data state by reading it directly from the version needed. They can troubleshoot and update the logic to create a new data version without the bugs.

Your organization will be equipped with tooling that allows fixing issues fast and serving your customers exactly what they need if they change their minds.

Continuous Integration And Delivery

Bringing together these data control operations lets teams automate the process of testing new data before it’s exposed to consumers. This process can be called continuous integration/deployment of data. Engineers can create a set of tests data needs to pass to ensure its high quality and then trigger them automatically, merging the data to a public branch or—if the test fails—leaving the data version for debugging.

But how can you start implementing data version control?

Start With The Psychological Aspects Of Change Management

Analyze the risks, benefits, and challenges. Identify your actual requirements and check what would deliver impact over both the short and long term. Identify the people you need to convince about the change. Prioritize a tool or set of tools that will build the foundation for your process.

Get Buy-In And Build Alignment

Introducing a data version control system means upgrading the processes in the organization. To achieve the expected impact, teams must adopt and develop workflows around it. Help them understand the potential impact and show how the tooling tool can serve them. Spot their fears and doubts.

Assess Multiple Tools Together With Your Stakeholders

Make sure to involve team members in the process and keep communication open before settling on any tools. Build versus buy, open-source versus closed-source—these are the conversations you need to have to keep everyone on the same page. Some common industry choices for data version control are Git LFS, DVC, Weights & Biases, Neptune, Dolt, lakeFS and FastDS.

Design A Proof-Of-Concept Project

Make sure that your PoC reflects your actual workload and workflows. Running a PoC using someone else’s benchmarks isn’t going to cut it.

Identify Champions And Create A Communication Plan

After assessing the value of a mini-POC, it’s time to find champions who will lead the change. Implementing data version control everywhere is a massive process change and should be taken seriously.

Carry Out A Go/No-Go Test

After assessing the tools and completing the PoC, it’s time to have a candid conversation with stakeholders about whether to use a given tool or keep searching. Make sure to come prepared for this meeting with answers following the previous steps.

Adoption And Education

Work with engineers to design and implement an adoption plan, including workflows, processes and education for everyone who will be using the tool.

If you operate in an industry where data changes frequently or you constantly receive a stream of new data, data version control can make a real difference. It opens the door to faster processes, improved decision making and fewer errors that tend to be costly in the data world. Ultimately, it helps you build greater trust in data among all its consumers—from finance to operations.

Looking to hire skilled software developers? Contact TP&P Technology - Leading Software Outsourcing Company in Vietnam Today

Article resource: https://www.forbes.com/sites/quickerbettertech/2022/11/10/on-crm-what-are-the-most-popular-add-ons-for-crm-applications/?sh=3487e26650f6/

Không có nhận xét nào:

Đăng nhận xét

Lưu ý: Chỉ thành viên của blog này mới được đăng nhận xét.

Digital Transformation In Supply Chain Management

Digital transformation is a term that is thrown around a lot, and people have different ways to interpret what it means. Essentially, digita...