Over the past few years, my work has involved a lot of Extract-Transform-Load (ETL) processes which use flat files to transmit data. These are typically a CSV file or some equivalent. For most of my projects, I've been on the receiving end of the data transmission. In my experience receiving data from many 3rd parties, most of the time spent is centered around validating the file and providing feedback to the sending party.
Typically an exchange of data is based on a standard data format, where the data fields, with their requirements and constraints, have been specified and agreed to by both parties. We might be sending something that looks like "FirstName,LastName,MiddleName,DateOfBirth,Deceased", but all that list expresses is the ordinal number of the fields. Each of these fields also has a list of rules associated with it to ensure its validity. For example, we might see the following in a data specification document:
FirstName Required: Yes Minimum Length: 1 character Maximum Length: 100 characters Format: None LastName Required: Yes Minimum Length: 1 character Maximum Length: 100 characters Format: None MiddleName Required: No MinimumLength: 1 character Maximum Length: 100 characters Format: None DateOfBirth Required: Yes MinimumLength: 10 characters MaximumLength: 10 characters Format: MM/DD/YYYY Constraints: Must not be a future date compared to file date Must be a valid date (can't be Feb. 30th for example) Must be after 1/1/1900 StatusCode Required: Yes Constraints: Must be a value from the set ["Yes","No","Unknown"]
We can identify a set of common rules or rule checks from this list of fields. For example:
- Required check
- Length check
- Format check
- Value in Code Set
- Value is a Valid Date
We also have some rules that don't seem very general. DateOfBirth must not be a future date in relation to the file date, and it must be after 1/1/1900. These might be rules that are reusable, but they also might only be useful for the concept of "Date of Birth." We can probably abstract a bit, and define a "range" of values. So we might have:
- Value in Valid Range
You could make the argument that "Value in Code Set" and "Value in Valid Range" are two specializations of the same concept - "Value in a Domain of Values" but I'm going to stay at this level of abstraction for now. That said, I might venture that way when it comes to write the actual code.
Surprisingly (or unsurprisingly, depending on your experience), organizations will often send data that does not meet the rules in the spec. This is especially common during the development and testing of a new interface, when the sending organization is building their programs for extracting the data and producing the files. For the receiving organization, it's can be a tedious process to validate files and send feedback. As I stated earlier, having an automated validation tool that can identify errors in a data file and produce a report of such errors is critical to making this feedback process efficient. It's even better when the sending organization can leverage this automated tool before sending the file to the receiver.
Since I've repeatedly seen the value of these validation tools, and Kai Labs continues to have a need for such tools moving forward, and I'm in search of a fun side project, my plan is to develop an open source tool to meet our needs at Kai Labs. Hopefully it will meet the needs of other developers, who may also improve the tool by contributing their expertise to the codebase.
The tool is called ETElevate, which is a pun on "ETL." We'll be publishing code on GitHub here:
There is nothing there yet, but we'll start contributing code soon. We will track and share the process here through this blog series.