Data lineage guideadmin
With the emergence of big data and analytics, organizations require a system that can track and monitor the data, its changes, and its usage in industry. As the incoming data increases, enterprises face many challenges, and handling them manually becomes imaginary. In this article, we will discuss Data lineage – a monitoring and tracking system that can be used to understand the flow of data. We will also discuss the use cases and benefits of data lineage. In a later section, we will implement a simple data lineage and discuss how lineage systems are implemented at an industry level.
Data lineage contains the details regarding:
- the origin of data,
- where it is getting moved to and
- the transformations performed on the data.
In an enterprise, data gets transformed, split, or merged as it goes through certain phases. The same data is further used for analytics and data-driven decision-making. Hence data lineage helps track the transformations that happened over time to the data. Thus lineage represents a life cycle of data on how it moves or changes over time as it goes through different processes and systems.
Basic data lineage system
Data moves from source(s) to ETL process. ETL has n number of transformations that processes and transforms the received data according to analytics requirements. This data is further stored for future purposes. This whole life cycle of data, starting from its source, going through changes, and finally to the storage, is called the Data Lineage.
Data lineage contains the records of origin, transformations, and transfers that a given data has undergone. Data Provenance is similar to data lineage but has extra information on systems and processes that have influenced the data flow and with which the data can be reproduced.
The main goal of data lineage is tracking the source of data and the transformations it has taken up till the end. Data provenance, on the other hand, tracks the origin of the data and segregates the data into three parts – Data-in-motion, Data-in-process, and Data-in-rest. Data provenance can be visualized as data lineage with extra information – that is, it contains the process transformations which can reproduce the data.
In the previous section, we discussed data lineage and its functioning and systematic diagram. This section will discuss the necessity of such systems and the benefits they provide to an organization.
Raw data is barely useful in practice. Hence raw data requires transformations and processing for the company to put it to use. When the data is transformed, it is essential to track it to see where it goes, what transformations are done, and its origin. This is essential as future business decisions are entirely dependent on the data.
Big data companies have to comply with the legal policies of the government mandatorily. Failing to do so will result in penalties and even legal action. Data lineage helps organizations track the travel of data across various systems and the transformations applied to it. Risk Management and Data Governance teams follow the audit trail of the data to ensure all the legal policies are met.
When an issue pops up during a project, data lineage can help solve it quicker than manual methods. This happens because, with data lineage systems deployed, organizations can easily track what happens to the data once it starts from its source. This helps find the issue and resolve it.
Industries that work with vast amounts of data often find it challenging to track and manage data. As the data grows in size, the manageability of vast amounts of data becomes tough.
Data lineage can track the data flow and the changes that take place in an automated way. Data is often replicated as backup or for business purposes. A data lineage system will also track what data is replicated and at which location.
Data scientists and analysts often require access to specific data, especially up-to-date data for correct analysis of the present situation. They generally request the IT team for access to the data source, which causes delayed deliveries. The delay in delivery can cause the data to get outdated, especially in certain applications where the latest data is required. With data lineage systems running, they can get easy access to data sources and provide on-time deliveries and receive the latest data available.
The following diagram depicts a simple data lineage diagram. Let us explore what this diagram means and what role each component has in the lineage system.
Simple data lineage system
It contains three phases – Source, Workflow, and Target.
The source is the origin of the data. The source can be of different types like a File server, Database Server or any other storage medium. Depending on the source type, the storage format of the data also changes – CSV stores in a text file while Database stores in tabular format.
In certain situations, data can have multiple sources, which means, data from multiple sources can be combined to get a complete set. In that case, the data sources will have multiple entries.
Workflow refers to the batch of jobs that transform the data obtained from the source. The transformations depend on the data provided. If the data is processed enough or has enough attributes, then fewer transformations might be enough. For raw data, more processing transformation might be required so that in-depth analytics can be performed.
Workflow is not constant and changes for each organization since it is specific to the type and amount of data collected. Workflow necessarily need not be done in a single system. The data can be sent across several systems, and the final one sent to the target. But the whole transformation process is collectively called workflow.
Targets are where the transformed data is put to use by storing it for future use or analytics. Similar to the source, it can be a file, database, warehouse, etc. Target is the end phase of the data, where it is finally put to use. The data is only fetched for business purposes from the target, and this data can be referred to as “clean-data”.
This is the general diagram of a data lineage. Depending on the use case, the diagram can have more nodes/operations. But the basic idea of Source -> Transformation / Transfer -> Target will still hold True.
There are several approaches to implementing data lineage. This section will look at various methods along with their advantages and disadvantages.
Pattern-Based lineage approach estimates lineage information without looking at any code. This system reads metadata related to tables, columns, reports along with profiling data. The information obtained is then used to create a lineage based on similarities between data and column names. The advantage of this approach is that you can only watch data and not worry about the underlying algorithms or technologies.
The downside comes when the system becomes inaccurate at times, putting your data privacy at risk. The system is strictly restricted to the database-related information and ignores important and most often required details like transformation logic. This system suits scenarios when reading the logic hidden in your code is not possible.
Manual lineage starts by interviewing people involved with the company data to get an idea of where the source is, where all the data is traveling, and all transformations. This method has many downsides, like missing any person to report their knowledge regarding the data can cause errors in the final lineage system. Also, for a vast amount of data, tracking the lineage manually would be a tedious task. It also requires people who have a significant amount of experience and knowledge. In practice, industries working on large amounts of data cannot make this work. Also, there is a risk that the lineage can be erroneous because of humane errors or difficulty in tracking.
Lineage by data tagging
Each piece of data moved or transformed in data tagging is tagged with a label by a transformation engine. This is done from start to finish. For this method to work successfully, the transformation engine has to be consistent. This is not always the case since transformations can make changes depending on the project.
The method works well within the transformation engine, but whatever happens outside the transformation engine is not considered since tagging occurs only inside the engine. Also, transformations do not occur mandatorily in real life. Sometimes, developers or analysts do not require any transformation on the data and transfer it to another system. In such scenarios, tagging won’t work at all.
Few organizations use an all-in-one environment that provides all the processing logic, lineage, data management under a single roof. Using this type of software makes things cleaner and easier since everything is automated and managed by itself and all operations are available under a common roof.
The downside is that it is exclusive to the controlled environment. The system is blind to whatever happens outside the environment. This causes a dead-end in the lineage since certain new tools might be added as requirements change.
Lineage by parsing
For a data lifecycle that is complex, heterogeneous, wild, and constantly evolving, an effective way to manage all your lineage is to automate it. This can be achieved by automatically reading through the logic, understanding it, and reverse engineering to implement end-to-end tracking.
Understanding all the programming languages and tools (Including the tools used for data transformations and data movements) is necessary. To efficiently build a parsing-based lineage system, it is important to parse certain parameters like:
- input parameters,
- default values in the system,
- runtime information, etc.
It is practically difficult to build a solution to support a single programming language or tool. So the complexity of creating one for multiple languages and tools is understood.
As the data grows, data lineage experiences specific challenges in various aspects. Overcoming them is necessary to make an efficient and up-to-date system.
This section will discuss specific challenges that data lineage faces when implemented on a large scale and what approaches are taken to overcome them.
Systems that perform Distributed Data-Intensive Scientific Computing (DISC) are batch processing systems designed to have high throughput and performance. They execute several jobs in a single batch. There can be n number of operators or nodes executing the job (100s or even 1000s) in a single cluster. This depends on the job and cluster size.
Data lineage for such systems can get complex because they have to scale to vast data and operators. If not, it can become a bottleneck for such kinds of DISC systems. Thus, as the data and complexity of such systems rises, the complexity factor also has to be at par to keep everything up and running.
Fault tolerance is another factor that lineage systems must include. Re-runing data flows to capture lineage once a fault happens is not affordable. Lineage systems must identify a failed DISC task to accommodate failures within the DISC system. They must also avoid combining the lineage of a failed task and the restarted task. This prevents creating meaningless and inconsistent lineage reports.
Also, the system should be able to handle the crashing of multiple instances of local lineage systems. The system can store the replicas of lineage associations among multiple machines to handle crashes. The replicated copies act as a backup if the real copy is lost.
DISC dataflows require lineage systems that can capture lineage across black-box operators for fine-grain debugging. Current approaches include Prober – which finds the minimal set of inputs that produces a specified output for a black-box operator by replaying the data-flow several times to guess the minimal set, and dynamic slicing, used by Zhang et al. It is used for NoSQL operators through binary rewriting to compute dynamic slices.
For producing high accurate lineage, techniques require significant time overhead for capturing or tracking. The time overhead might be high such that it becomes preferable to trade the accuracy of the lineage for faster performance. Thus techniques that generate lineage with reasonable accuracy and fewer time overheads are required.
Replaying specific portions of inputs or data flow is essential for debugging and in-depth analysis to test various use cases. A methodology termed lineage-based refresh selectively replays updated inputs to recompute affected outputs. When a bad input in the data flow has been fixed, this method comes in handy to recheck the output.
There are times when a user wants to remove the bad input and replay the lineage of outputs that were affected by the input – a method termed as “Exclusive replay”. A similar method termed selective replay enables stepwise debugging.
Present systems are not developed enough to perform both exclusive and selective replays. Hence, there is a need for a system that can perform both selective and exclusive replays.
In dataflows that are complex and lengthy, manual debugging gets tedious and impractical. Finding faulty operators that cause errors in the data flow becomes near impossible as the system gets bigger.
Even if manual methods are used to narrow down operators, they further contain lineage, spanning many operators. Hence, the need for automated debugging systems has come, which narrows the set of potentially faulty operators from where manual examination becomes possible.
Data lineage tools implement the data lineage process and system. Using these tools, organizations and data scientists can monitor and track any data source, the transfer, and the transformations done on the data as it moves. These tools even provide an interactive visualization of the whole process, thus making it more explanatory.
This functions as a separate tool since the default ETL does not have any lineage tool to track the lineage process. Hence data lineage tools are used and configured, which gives reports and visualizations on the lineage process. Examples of data lineage tools are Apache Atlas, Azure Purview, IBM DataStage, and Layer.
Apart from implementing the data lineage, the tool also provides several sub-tools for monitoring and maintaining the lineage system. The sub tools come in handy, especially when the lineage system is complex and large enough for manual inspections to fail/become impractical.
This section will look at some use cases of data lineage in the real world and their purposes.
Companies that use data for analytics must know:
- where the data is coming from,
- transformations done on the data, and
- how the data is moving in the systems.
Data tracking involves end-to-end tracking of the data flow. Tracking ensures the consistency, legal compliance, and security of the data. In addition, data can move across several systems, taking up modifications. Therefore, tracking is essential to ensure the purity of the data.
Data Scientists often require access to data that is not easily and readily obtained if they contact the IT team. This is because the delivery of data from the IT team takes time. Using lineage, data scientists can easily access the data that they require for their purpose.
Using outdated data gives obsolete results which are not considered in making business decisions. Data Scientists will have access to the source directly if they use lineage and get up-to-date data.
Data governance ensures that the data is trustworthy and is not breaking any legal policies in any way. Similarly, it also ensures the consistency of the data since inconsistent data leads to incorrect business decisions.
A lineage system has two ends – source and destination. We can start from any end and trace back to the other. The rest of all operations (transformations and transfer) fall in between these two ends.
Types of data lineage
Data Lineage is divided into three types based on the traversing that is performed on the lineage diagram:
As the name suggests, the tracking is from its end to its source. That is, starting from any warehouse or repository where the data is stored, we traverse back through its transformations to its source or origin. This tells us how the data got in the form it is in presently.
Forward data lineage is the opposite of backward data lineage. In forward lineage, you start tracing the data flow from its source till its end. This is the most commonly used form of data lineage.
Forward data lineage helps us know how the data changes once it starts from its source and moves across processes and systems.
End-to-end data lineage combines the above two types of lineages, looking at the system from the source to its end and back.
Metadata is data about data – data that describes other data. Metadata is to data like how wheels are to a car. Data alone does the work, but metadata describes the data and relations in a detailed way, thus making the tasks easy. Hence, metadata is an essential component to understand and work on any data. Metadata can be divided into the following categories:
Technical or physical metadata gives information about the physical storage of data. It typically covers databases, their views, schemas, tables, columns, etc. For example, column details include column information like data type, size, and description and even granular information like value pattern, value frequencies, completeness, and domain of that value.
Logical metadata defines a relationship to other assets or entities within the system. For example, logical metadata covers entities like customers, parties, addresses, etc. This acts as a base for the creation of physical assets where the entities exist.
Business metadata stores details regarding business processes. For example, when taking the case of an entity customer, it can belong to a different department – Finance, IT, Sales, etc. The definition of the customer for each department can differ. Therefore, it is necessary for business metadata to be defined and tracked along with related business processes.
Implementing data lineage is distributed among 3 phases:
- Preparation phase,
- Execution phase and
- Subscribe to survive phase.
Let us look at how each phase is executed and what technicalities are considered under each stage.
In the preparatory phase, the goal of the lineage is taken into consideration. This helps the organization:
- determine the use of lineage at present,
- what they are expecting from the system,
- how they can put it to use
Consider the users who will be using the lineage system or interacting with it – Is it an Analyst, auditing team, or any other employee? The source of the data is identified, and architecture diagrams are created. They define the transformation and transfer of data before it reaches the final state. Thus, the preparatory phase brings a clear idea of the purpose of lineage, the interacting users, and its design.
The execution phase is where the data lineage is implemented. The business logic is defined, generated, and linked to the data source. The data fetching part is first developed which links to the data source(s), and fetches the data for transformation and transfer.
The transformation and transfer logic part of the lineage system are implemented and connects to the data fetching system. The fetched data is sent to this part of the workflow for processing. Finally, the target is linked to the end of the lineage, storing data in a database.
Subscribe to survive is the final phase, where the lineage chain is monitored for any further change. The source(s) or target(s) or the transformation transfer systems can change as the business requirements change. Hence, it is necessary to update the lineage system as the business situation changes to sync with the business processes.
Let us now implement a simple data lineage system in practice and see how it works in real life. We will require a data source, a data warehouse (for demonstration purposes, we’ll use SQL database), and any programming language like Python for reading data from the source and writing to the warehouse.
For this demo, we will use multiple data sources, i.e., two CSV files containing student details in a university. The CSV will have two columns – Name and Department. Assume that these files need to be stored together in a data warehouse. Considering this scenario, let us design a lineage system wherein it can track the source of the data received.
File1.csv (image created by author)
File2.csv (image created by author)
Here, the data sources are two CSV files with student details in them. Both data items are stored in a warehouse.
Here we assume an SQL database to be a data warehouse where each column represents each entity – Name / Department and a third column show the file source from where the specific entry has come – named as Source. This way, it is easy to track the source of specific information. This helps in resolving any mistakes that might come up later.
Data lineage involves performing specific transformations on the data and even moving them across many systems before it reaches the final stage. Here, we do not require any logical or mathematical transformations on the data. Hence, we use any programming language like Python or Java to read the files, extract the information and insert them into the warehouse along with the filename. Once that process is done, this is what the warehouse would look like:
View of the final data in the warehouse (Image create by author)
The records from the CSV have been inserted into the database. This is done along with the source to which each data entry belongs. I will be skipping the programming part because it is out of the scope of this article.
Now that the lineage process is complete, let us generate a lineage diagram for this process. For developing a lineage diagram, the following components have to be taken into consideration:
- Data Source(s): Data sources have to be listed and where the data flows from has to be shown using arrows.
- Transfer / Transformation engine: For our specific demo, the only transfer occurs from source to warehouse. If any transformations happen, each of them should be specified separately using block diagrams. In this example, the transfer engine gets the data and the source name and forwards the data to the warehouse.
- Warehouse: The warehouse gets the data from the transfer engine and stores it in tabular format.
Lineage Diagram for above Scenario (Image created by author)
We have developed a simple data lineage system with two data sources (CSV files) and a single target warehouse in the above example. The transfer engine was also developed. The engine takes each data item from the data sources and stores it in the warehouse along with the data source name.
Certain practices can ensure that the full potential of the lineage system is achieved. Following these best practices ensures that the system is monitored and updated as the enterprise needs change. In this section, we will discuss a few practices which can ensure that the lineage system is put to the best use at all times:
- Assigning a person for tracking: Once the owner of the data has given the rights to the company, the data analyst/scientist can track who is using and who is modifying it. This will ensure the consistency of the data.
- Track data modification: Apart from authorized employees of the company, the owner also has the right to know where and how data is being used. Using monitoring, the owner stays updated on the modifications that are taking place on the data. This ensures even the owner of the data knows who is using the data.
- Monitoring critical data: Not all data can be made available to the public since some are confidential to the company. Such data should be marked, tracked, and stored separately from the rest of the data. This ensures the privacy of the data.
- Up-to-date report generation: Generating the latest lineage reports helps track any errors that may come up. These errors can be fixed once they are spotted. Hence constant report generation is essential for troubleshooting.
- Storing data for the long term: Storing data for the long term is essential for future business purposes. Appending the incoming data to the already existing data adds value to the whole dataset. Instead of purging existing data, set up a dataset for storing all the data for the long term.
We have seen what a data lineage system is and how its functions. We also observed different types of lineages and how they benefit industries. We also discussed challenges facing data lineage in the modern world. We implemented a simple data lineage system using two file sources and one warehouse and observed how this is beneficial to ensure data purity. We also discussed approaches to data lineage and best practices of data lineage to bring out the potential in the system.