In my last post I talked about ETElevate mostly as a flat file rules engine and briefly mentioned that it might be a file ingestion tool as well. In this post, I want to go into more detail about the full scope of what I would like to build and include a high level list of requirements.
I'm aware that there are other ETL tools on the market such as MSSQL SSIS and Informatica. Those are great tools for a lot of projects and I encourage anyone who gets value from those tools to continue using them. In fact, I'll happily implement ETL processes in those tools for our customers.
ETElevate, at least for now, is supposed to be a fun open source project and an open development journal. I'm not trying to replace those aforementioned tools any time soon. However, like most developers who have worked a lot with a particular type of problem, I have ideas about how I think the tools and libraries to solve that problem should work. This blog series is really just about exploring those ideas and experimenting with implementing them.
So what is the full scope of our requirements? Here is a high level guide to all the features that I want to tackle in the current product roadmap:
Format Definition and Validation
I want to be able to specify the format and validation rules for an incoming flat file. I want to be able to specify this in code or in some external data format such as an XML or JSON file. Then, I want to be able to validate an actual file's content against those format and validation rules and produce a report of validation errors and successes.
I want ETElevate to support multiple types of file delivery. It should be able to monitor folders on a local drive, poll HTTP endpoints, or receive webhook file deliveries from SaaS systems.
Once received, ETElevate should begin processing the file and applying validation and processing logic to each record.
Work involved to manipulate a file prior to and after processing should be directly supported. Examples would be unzipping the file, decrypting it, or archiving it for later reference.
I want to configure ETElevate to send notifications of different kinds when it encounters events that are important to me. Example events would be: File received, processing complete, errors encountered, etc.
These should be configurable to notify interested parties in different ways, including emails and text messages.
Record Processing in C#
I want to write the code to process incoming records in C#. That way, I can do pretty much anything with the data and my only limits are what can be done with C#. For those common things that one would want to do with data, like call a web service, or insert into another database, we should provide some library help to make that as easy as possible. It might even be possible to make that configurable and part of a job definition.
For all the other cases that are more complex, I want to implement that code in C#.
Eventually we may want to support other languages like Python, but that's not a high priority right now.
Robust Error Handling and Interruptible Processing
ETElevate should be robust in the face of server or network failures. If a host server goes offline during a file processing job, that job should be resumable when the host comes back up. We should be able to inspect status at any stage of file processing and be able to view all system activities in real time.
Unexpected exceptions that occur during record processing should handled in a way that is configurable per job. Some jobs may want the entire process to fail, while others would like to just move to the next record.
We should be able to tune processing performance based on the type of job. If we can process records in parallel, we should do that. We should not try to load huge files into memory at once, instead we should try to process in discrete blocks that are big enough to provide good performance but small enough to conserve server resources appropriately.
ETElevate should support configurable data transformations that can be applied declaratively to incoming data as it's being processed. We should be able to define available transforms at a global or job level and then use those transforms as needed within a processing job.
In addition to near-real time processing and webhook support, we should also be able to schedule jobs to run at a certain time of day.
An job that ingests files may produce some output that must be sent to either another job or some external party. We should be able to configure delivery mechanisms such as writing a file, posting to an HTTP endpoint, or sending an email with the results of these output steps.
I think that's enough to keep us busy for a while! In the next post, we will start with Format Definition and Validation.
Thank you for reading.