Last time I talked about the high level requirements for ETElevate. In this post, I want to start tackling something that is at the very core of the software - Format Definition and Validation. This will just be the first pass at that functionality as there are sure to be additional complexities that are uncovered as we go. We'll implement this one piece at a time, making multiple passes throughout the whole project

Thinking about the formats we might deal with, there are two types of plain text flat files that I want to target right now:

Comma Separated Value (CSV) files

These files contain data fields separated by a comma delimiter with one record per line. They often contain a header row which contains a column name. You will sometimes find other metadata such as the row count or creation date in a header or a trailer record.

It is also common to see other types of delimiters such as the pipe | character when dealing with character delimited files.

Fixed Width Files

In these files, data fields are organized in fixed width columns. Each column has an allowed width and the data cannot be shorter or longer than that column width. There are no column delimiters as the character width defines the length of the column. Values are often padded with zeros or empty spaces depending on the type of the field.

Our Test Case

We're going to be developing this code against a number of test cases which will probably increase in complexity over time. Our first test case will be based on a somewhat realistic scenario involving healthcare data integration.

In our test scenario, we take the role of a health system which needs to receive lab results from community labs in order to monitor its patient population. In this case, we want to monitor and verify that every diabetic in our population had had a fasting blood glucose test and an A1C test in the last 6 months.

We have asked the community labs to send the data to us in a format that we have specified and provided to them. One of the sources can send these fields in a CSV file and the other can only send a fixed-width. The technical aspects of CSV vs fixed-width here are a bit contrived, but otherwise this is a very realistic scenario, and it will serve as a useful example for us.

We need to be able to ingest both formats (CSV and Fixed-Width) and apply the same validation rules before sending the records into our data processing pipeline.

We will be implementing this in C#, using unit testing techniques to validate our code along the way. Here is our starting solution which contains a class library and a test library:

The CSV Format

The first case I want to consider is the CSV. The format will look like this:

* Comma separated
* Each field is enclosed in double quotes
* First row contains header with field names

Fields are:
	1. First Name
		Required: Yes
		Max Length: 100 characters
	
	2. Last Name
		Required: Yes
		Max Length: 100 characters
		
	3. Patient Date of Birth
		Required: Yes
		Max Length: 10 characters
		Min Length: 10 characters
		Format: MM/DD/YYYY
	
	3. Patient ID
		Required: Yes
		Max Length: 15 characters
		Must match our patient ID format
		
	5. Observation Date (the date the test was taken) (we're not worrying about time right now)
		Required: Yes
		Max Length: 10 characters
		Min Length: 10 characters
		Format: MM/DD/YYYY

	6. Result Date (the date the test was resulted in the lab)
		Required: Yes
		Max Length: 10 characters
		Min Length: 10 characters
		Format: MM/DD/YYYY
		Must be greater than or equal to Observation date	
	
	4. Lab Test Type
		Required: Yes
		Max Length: 15 characters
		Must be a valid CPT code for either Fasting Blood Glucose or A1C		
		
	5. Result Value
		Required: Yes
		Max Length: 50 characters

There will eventually be a lot more to say about Result Value and its valid formats but let's start with this definition and see where it takes us.

Ideally, we will get to a point one day where we have a GUI (graphical user interface) that enables us write these definitions and then use them to validate files. We will want to have some definition format that lets us store the file definition in a file or database for later use. Right now, we need to concentrate on getting the behavior of the application correct, which will give us a solid foundation for building a GUI and an externalized file format.

One big reason why I'm choosing to start with unit tests from the beginning is that I don't want to start with implementing file ingestion. We will get to that eventually, but I'm most interested in resolving all the unknowns around format specification and how we will validate fields. I want to start implementing those features first. Using unit tests will allow us to run the data validation code without establishing the whole end-to-end solution.

With that all said, let's look at each field. We have a number of basic rules right from the start. We have Required (not null, not empty check), Max and Min lengths, and Format specifications. It's important to understand that when all this data comes into our program it will come in as raw text from flat files. We are dealing with simple primitive strings here, not the datatypes that we might be used to in C# like DateTime or Int32.

It looks to me that we have the following responsibilities that must be distributed to objects in our solution:

  • Store the string data in memory after it's been read from the file
  • Execute a validation rule against the string data and produce a result
  • Store the parameters of the validation rules (max length is 50 characters, for example)
  • Validate for required checks
  • Validate for min length checks
  • Validate for max length checks
  • Validate that test type is a valid CPT code
  • Validate that one date is greater than or equal to another date
  • Store all of this information about the definition of a file in some kind of code or data structure.
  • We will start with our data storage. At the moment, I only have the requirement to store the field data as a string, so we can start very simple with a dictionary that holds all data for a record and go from there.

    Again, we don't need to say anything about how this gets populated from a text file because we are deferring file ingestion for later. We have a very basic dictionary wrapper:

    public class DataRecord
    {
    	private Dictionary fieldValues = new Dictionary();
    
    	public void SetValue(string name, string value)
    	{
    		fieldValues[name] = value;
    	}
    
    	public string GetValue(string name)
    	{
    		if (!fieldValues.ContainsKey(name))
    		{
    			throw new ArgumentException($"{name} is not a known field name.");
    		}
    
    		return fieldValues[name];
    	}
    }
    

    And some simple nunit tests to validate its behavior:

    [TestFixture]
    public class DataRecordTests
    {
    	[Test]
    	public void CanSetAndGetValue()
    	{
    		var fieldName = "FirstName";
    		var fieldValue = "Michael";
    
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue(fieldName, fieldValue);
    
    		Assert.AreEqual(fieldValue, dataRecord.GetValue(fieldName));
    	}
    
    	[Test]
    	public void CanOverrideValue()
    	{
    		var fieldName = "FirstName";
    		var fieldValue = "Michael";
    		var fieldValue2 = "John";
    
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue(fieldName, fieldValue);
    		dataRecord.SetValue(fieldName, fieldValue2);
    
    		Assert.AreEqual(fieldValue2, dataRecord.GetValue(fieldName));
    	}
    }
    

    Next, we want something that can execute a rule against a particular field in our data record and produce a result. Let's start with our most basic case, which is: FirstName is required.

    We can implement some unit tests to verify this functionality. Note: we have chosen to let an unset field be an exception at this point. I'm not sure if that will need to change later, but if we try to access a field that the DataRecord does not know about, it throws an exception and there is a test to specify that.

    [TestFixture]
    public class ValidatorTests
    {
    	private const string fieldName = "FirstName";
    
    	[Test]
    	public void WhenFieldIsNotEmpty_Required_ReturnsTrue()
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue(fieldName, "Michael");
    
    		var validator = new DataRecordValidator();
    		var isValid = validator.Required(dataRecord, fieldName);
    
    		Assert.IsTrue(isValid);
    	}
    
    	[Test]
    	[TestCase(null)]
    	[TestCase("")]        
    	public void WhenFieldIsNullOrEmpty_Required_ReturnsFalse(string fieldValue)
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue(fieldName, fieldValue);
    		var validator = new DataRecordValidator();
    		var isValid = validator.Required(dataRecord, fieldName);
    
    		Assert.IsFalse(isValid);
    	}
    
    	[Test]
    	public void WhenFieldIsNeverSet_Required_ThrowsException()
    	{
    		var dataRecord = new DataRecord();            
    		var validator = new DataRecordValidator();
    
    		Assert.Throws(() => validator.Required(dataRecord, fieldName));
    	}
    }
    

    Next, let's consider the max length rules. They are implemented and tested in much the same way as Required. We just need to add the MaxLength method to our DataRecordValidator class:

    public bool MaxLength(DataRecord dataRecord, string fieldName, int maxLength)
    {
    	var value = dataRecord.GetValue(fieldName);
    
    	return string.IsNullOrEmpty(value) || value.Length <= maxLength;
    }
    

    Now let's jump to format because it's a bit more interesting. We want to validate that the incoming data meets a specific format and in the case of dates we want to ensure that it matches an MM/DD/YYYY format. We will use regular expressions to match the format and decide whether or not the field value is valid. Let's look at the code for that:

    [Test]
    [TestCase("BirthDate", "01/01/1980", @"[0-9]{2}\/[0-9]{2}\/[0-9]{4}")]
    [TestCase("BirthDate", "01/01/0000", @"[0-9]{2}\/[0-9]{2}\/[0-9]{4}")]
    [TestCase("BirthDate", "01/01/0001", @"[0-9]{2}\/[0-9]{2}\/[0-9]{4}")]
    [TestCase("BirthDate", "12/31/9999", @"[0-9]{2}\/[0-9]{2}\/[0-9]{4}")]        
    public void WhenFieldIsValidFormat_Format_ReturnsTrue(string fieldName, string fieldValue, string formatRegex)
    {
    	var dataRecord = new DataRecord();
    	dataRecord.SetValue(fieldName, fieldValue);
    
    	var validator = new DataRecordValidator();
    	var isValid = validator.FormatRegex(dataRecord, fieldName, formatRegex);
    
    	Assert.IsTrue(isValid);
    }
    

    Our test code here is pretty interesting. Note that some of the supplied values are valid in format but not necessarily valid in content. In other words, 01/01/0000 is not a valid date for our system and 12/31/9999 is probably not a valid date either. However, it does match the format, so it passes validation.

    Validating a format is a pretty low level general case. If we want to validate that something is a "valid date" or within the range of valid values, that is a new layer that we need to place on top of the format validation. We are now talking about the content of the values and not just the format/structure of the test.

    The interesting thing here is that content validation can take many forms. We want to validate that our date falls within a certain range, but we also want to validate that our result code is a valid CPT code. These are two implementations of the same abstract concept, which is to check if a value exists within a set of known valid values. Each implementation of this abstract concept will be different, but each can share the same outward-facing interface:

    interface IValueContentValidator
    {
    	bool CheckValue(string value);
    }
    

    Our first implementation of this will be ValidDateContentValidator. We want to allow for different date formats and cultures in this validator, so that will be part of the object constructor. We will attempt to convert the value from a string to a DateTime so that we can use the built in date logic functionality of C#'s DateTime structure. When we perform this conversion, we will automatically validate the value against the range of the values that the platform can support (i.e. DateTime.MinValue and DateTime.MaxValue). In addition to these platform limits, we also want to provide constructor parameters to allow for constraining the set of valid dates even further. For example, a birth date in our system should not be 1000 years in the future, even if C# can support that and recognizes it as a valid date in time.

    public class ValidDateContentValidator : IValueContentValidator
    {
    	private string dateFormat;
    	private CultureInfo cultureInfo;
    	private readonly DateTime minDate;
    	private readonly DateTime maxDate;
    
    	public ValidDateContentValidator(string dateFormat, CultureInfo cultureInfo)
    		: this(dateFormat, cultureInfo, DateTime.MinValue, DateTime.MaxValue)
    	{
    	}
    
    	public ValidDateContentValidator(string dateFormat, CultureInfo cultureInfo, DateTime minDate, DateTime maxDate)
    	{
    		this.dateFormat = dateFormat;
    		this.cultureInfo = cultureInfo;
    		this.minDate = minDate;
    		this.maxDate = maxDate;
    	}
    
    	public bool CheckValue(string value)
    	{
    		if (!DateTime.TryParseExact(value,
    			dateFormat,
    			cultureInfo,
    			DateTimeStyles.None,
    			out DateTime parsedDate))
    		{
    			return false;
    		}
    
    		return parsedDate >= minDate && parsedDate <= maxDate;
    	}
    }
    

    We then write a suite of tests to exercise this "valid date" functionality:

    [TestFixture]
    public class ValidDateContentValidatorTests
    {
    	private const string dateFormat = "MM/dd/yyyy";
    
    	[Test]
    	[TestCase("01/01/2020", dateFormat)]
    	[TestCase("12/31/2020", dateFormat)]
    	public void WhenIsValidDate_CheckValue_ReturnsTrue(string dateString, string dateFormat)
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue("BirthDate", dateString);
    
    		var validDateValidator = new ValidDateContentValidator(dateFormat, CultureInfo.CurrentCulture);
    		var validator = new DataRecordValidator();
    
    		var isValid = validator.CheckContent(dataRecord, "BirthDate", validDateValidator);
    
    		Assert.IsTrue(isValid);
    	}
    
    	[Test]
    	[TestCase("1/1/20", dateFormat)]
    	[TestCase("1/1/2020", dateFormat)]
    	[TestCase("02/29/2019", dateFormat)]
    	[TestCase("02/30/2020", dateFormat)]
    	[TestCase("13/30/2020", dateFormat)]
    	[TestCase("99/30/9999", dateFormat)]
    	public void WhenIsNotValidDate_CheckValue_ReturnsFalse(string dateString, string dateFormat)
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue("BirthDate", dateString);
    
    		var validDateValidator = new ValidDateContentValidator(dateFormat, CultureInfo.CurrentCulture);
    		var validator = new DataRecordValidator();
    
    		var isValid = validator.CheckContent(dataRecord, "BirthDate", validDateValidator);
    
    		Assert.IsFalse(isValid);
    	}
    
    	[Test]
    	[TestCase("01/01/1900", dateFormat, "01/01/1900", "12/31/9999")]
    	[TestCase("12/31/2020", dateFormat, "01/01/1900", "12/31/2020")]
    	public void WhenDateIsWithinRange_CheckValue_ReturnsTrue(string dateString, string dateFormat, string minDate, string maxDate)
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue("BirthDate", dateString);
    
    		var validDateValidator = new ValidDateContentValidator(dateFormat, CultureInfo.CurrentCulture, 
    			DateTime.ParseExact(minDate, dateFormat, CultureInfo.CurrentCulture, DateTimeStyles.None),
    			DateTime.ParseExact(maxDate, dateFormat, CultureInfo.CurrentCulture, DateTimeStyles.None));
    
    		var validator = new DataRecordValidator();
    		var isValid = validator.CheckContent(dataRecord, "BirthDate", validDateValidator);
    
    		Assert.IsTrue(isValid);
    	}
    
    	[Test]
    	[TestCase("12/31/1899", dateFormat, "01/01/1900", "12/31/9999")]
    	[TestCase("01/01/1899", dateFormat, "01/01/1900", "12/31/2020")]
    	[TestCase("01/01/2021", dateFormat, "01/01/1900", "12/31/2020")]
    	[TestCase("01/01/2021", dateFormat, "01/02/2021", "12/31/2021")]
    	[TestCase("12/31/2021", dateFormat, "01/01/2022", "12/31/2022")]
    	public void WhenDateIsOutsideRange_CheckValue_ReturnsFalse(string dateString, string dateFormat, string minDate, string maxDate)
    	{
    		var dataRecord = new DataRecord();
    		dataRecord.SetValue("BirthDate", dateString);
    
    		var validDateValidator = new ValidDateContentValidator(dateFormat, CultureInfo.CurrentCulture,
    			DateTime.ParseExact(minDate, dateFormat, CultureInfo.CurrentCulture, DateTimeStyles.None),
    			DateTime.ParseExact(maxDate, dateFormat, CultureInfo.CurrentCulture, DateTimeStyles.None));
    
    		var validator = new DataRecordValidator();
    		var isValid = validator.CheckContent(dataRecord, "BirthDate", validDateValidator);
    
    		Assert.IsFalse(isValid);
    	}
    }
    

    Finally, we add a method to our DataRecordValidator that allows it to use any implementation of our IValueContentValidator interface:

    public bool CheckContent(DataRecord dataRecord, string fieldName, ValidDateContentValidator validDateValidator)
    {
    	var value = dataRecord.GetValue(fieldName);
    	
    	return validDateValidator.CheckValue(value);
    }
    

    Now, as we turn to look at our Result Code validator, we can re-use much of this functionality and only implement the parts that are specific to CPT codes. We can generalize the concept of "CPT Codes" a bit further, and think in terms of a general "Code List." As we start to implement different file formats, there will be many instances of "Code Lists" that contain the valid list of values. However, given a list of codes, there will be a limited number of ways to look up a code within that list. So, we will implement the simplest solution here, which is to provide our validator with a List of valid codes at construction time.

    The ValidCodeContentValidator is fairly simple:

    public class ValidCodeContentValidator : IValueContentValidator
    {
    	private readonly IList codeList;
    
    	public ValidCodeContentValidator(IList codeList)
    	{
    		this.codeList = codeList;
    	}
    
    	public bool CheckValue(string value)
    	{
    		if (string.IsNullOrEmpty(value))
    		{
    			return false;
    		}
    
    		return codeList.Contains(value);
    	}
    }
    

    It's worth noting that this code has the following characteristics:

    1. It's case sensitive. The incoming value casing must match the case of the code value. This may not be what we want in practice but I'm going to defer any discussion for later review. Right now we don't know if we always want it to be case insensitive or not. We could always add a flag to the constructor to construct it in a case sensitive vs case insensitive way.

    2. We do not allow nulls and string.Empty to pass through and validate against this list. I can't think of a valid case for that right now, so I'm going to explitly exclude those values from validity and write a test to confirm that.

    Here are the tests for our new validator:

    [TestFixture]
    public class ValidCodeContentValidatorTests
    {
    	private static readonly List validCodes = new List { "001", "002", "003", "CodeFour", "CodeFive"};
    
    	[Test]
    	[TestCase("001")]
    	[TestCase("002")]
    	[TestCase("003")]
    	[TestCase("CodeFour")]
    	[TestCase("CodeFive")]
    	public void WhenCodeIsInList_CheckValue_ReturnsTrue(string fieldValue)
    	{
    		var validator = new ValidCodeContentValidator(validCodes);            
    		var isValid = validator.CheckValue(fieldValue);
    
    		Assert.IsTrue(isValid);
    	}
    
    	[Test]
    	[TestCase("CodeOne")]
    	[TestCase("CodeTwo")]
    	[TestCase("CodeThree")]
    	[TestCase("CODEFOUR")]
    	[TestCase("CODEFIVE")]
    	public void WhenCodeIsNotInList_CheckValue_ReturnsFalse(string fieldValue)
    	{
    		var validator = new ValidCodeContentValidator(validCodes);
    		Assert.IsFalse(validator.CheckValue(fieldValue));            
    	}
    
    	[Test]
    	public void WhenValueIsNullAndNotInList_CheckValue_ReturnsFalse()
    	{
    		var validator = new ValidCodeContentValidator(validCodes);
    
    		Assert.IsFalse(validator.CheckValue(null));
    	}
    
    	[Test]
    	public void WhenValueIsNullAndInList_CheckValue_ReturnsFalse()
    	{
    		var validCodesWithNullAndEmpty = validCodes.Concat(new[] { null, string.Empty }).ToList();
    		var validator = new ValidCodeContentValidator(validCodesWithNullAndEmpty);
    
    		Assert.IsFalse(validator.CheckValue(null));
    		Assert.IsFalse(validator.CheckValue(string.Empty));
    	}
    }
    

    Now we have pretty good coverage of the rules that we have as part of our format and content validation according to our file spec. We have not touched them all. For example, we haven't looked at validating the relationship between ObservationDate and ResultDate yet.

    Next time we will expand on the code above and think about how this code might need to change to support an external file defintiion. We will consider how to define our file format in code and how that translates directly into the execution of the rules that we've already written today.

    As we close this post, our solution looks like this:

    Browse the GitHub repository at this point in its commit history

    GitHub Repository Home

    Thanks for reading!