In today's post, which is part 4 of our ETElevate Project development journal, I would like to explore a model for specifying our file formats.

From our sample format described in part 3, we know that we want to specify several fields, each having its own name, format specifications, and content validations.

Let's take them one-by-one and figure out how to specify this in a code model. I'm addressing this as a code model first before thinking about how a user might specify these externally (in a config file or database table). That is because the external specification is just a way of expressing the configuration that will ultimately result in the reified objects at runtime. It's more productive for me to work out the code model that is needed to express our requirements and then figure out how to specify that in the external configuration.

We know that our file is a CSV (comma separated value) flat file with variable length fields. We also know that the first row contains header columns. Our first field is:

First Name
	Required: Yes
	Max Length: 100 characters

We can develop a structure for holding this specification. We will keep it intentionally generic, data-oriented, and structural rather than object-oriented and behavioral, because we know that we will need to translate this into a data file at some point, and we want it to be easy to serialize this format to disk. This is only the first step in bringing the file spec into our system.

[Test]
public void CanBuildFileSpecWithFields()
{
    var fileSpec = new FileSpec();
    fileSpec.FileType = FileType.CommaSeparatedValues;
    fileSpec.FirstLineIsColumnHeaders = true;

    var firstNameFieldSpec = new FieldSpec();            
    firstNameFieldSpec.Name = "First Name";
    
    firstNameFieldSpec.ValidatorSpecs.Add(
        new ValidatorSpec 
        { 
            Type = ValidatorType.Required 
        });

    firstNameFieldSpec.ValidatorSpecs.Add(
        new ValidatorSpec 
        { 
            Type = ValidatorType.MaxLength,
            Parameters = new List
            {
                new ValidatorSpecParameter { Name = "MaxLength", Value = 100}
            }
        });

    fileSpec.FieldSpecs.Add(firstNameFieldSpec);

    var createdFieldSpec = fileSpec.FieldSpecs.SingleOrDefault(fs => fs.Name == "First Name");
    Assert.IsTrue(createdFieldSpec != null);
    Assert.IsTrue(createdFieldSpec.ValidatorSpecs.Any(vs => vs.Type == ValidatorType.Required));
    Assert.IsTrue(createdFieldSpec.ValidatorSpecs.Any(vs => vs.Type == ValidatorType.MaxLength));
}

Our test is deliberately very basic. In fact, it doesn't verify much of anything. It's just there to give us a workspace to play around with defining a file spec. Now that we have the file spec object, we can think about how to process this file spec and turn it into executable code.

We are able to create a FileSpec object which is the root object of the spec. This contains all the basic information that will be needed in order to know how to process the file. In this test, we denote this file spec as a CSV with the first line being column headers.

Then, our FileSpec contains a list of FieldSpecs. We will need one instance of a FieldSpec for each field in our file. We give it basic info such as a "Name" and then we add a list of ValidatorSpec objects to its ValidatorSpecs list property.

Each of these ValidatorSpec objects describes what type of validator we are creating and contains a generic list for providing parameters to that validator. A "Required" validator doesn't have any parameters right now, but a MaxLength validator needs to know what the max length actually is. In this case it's 100 characters, and we specify this as a ValidatorSpecParameter called "MaxCharacterCount."

Now we need to do something with this spec. The important concept here is that we have a generic structural data model that we need to transform into a dynamic behavioral object model. What I'm going to do now is write the behavioral object model. Once that is functional, we will create a FileSpec reader which translates our static specification into a dynamic object model at runtime.

To start, we will consider the behavior of taking a Stream and reading it into DataRecord objects. An object which can read CSV data from an input stream would look something like this:

public class CommaSeparatedValuesReader : IFileReader
{
    private readonly bool firstLineIsColumnHeaders;
    private readonly DataRecordBuilder dataRecordBuilder;
    private int currentLine = 0;

    public CommaSeparatedValuesReader(bool firstLineIsColumnHeaders, DataRecordBuilder dataRecordBuilder)
    {
        this.firstLineIsColumnHeaders = firstLineIsColumnHeaders;
        this.dataRecordBuilder = dataRecordBuilder;
    }

    public DataRecord ReadNextDataRecord(StreamReader reader)
    {
        var fields = ReadNextDataLine(reader);
        return dataRecordBuilder.Build(fields);
    }

    private IList ReadNextDataLine(StreamReader reader)
    {
        if (currentLine == 0 && firstLineIsColumnHeaders)
        {
            DiscardHeaderLine(reader);
        }

        var lineData = reader.ReadLine();
        
       // Deliberately naive implementation placeholder for parsing the CSV line.
        return lineData.Split(",");
    }

    private void DiscardHeaderLine(StreamReader reader)
    {
        reader.ReadLine();
        currentLine++;
    }
}

The IFileReader interface is there to allow other components to leverage this object for producing DataRecord instances without knowing the specific implementation. This will become important when we start looking at other file types such as fixed-width files. It looks like this:

public interface IFileReader
{
    DataRecord ReadNextDataRecord(StreamReader reader);
}

The CommaSeparatedValuesReader object has some basic functionality for reading CSVs. It is able to discard the header line automatically if a header line is expected and split the comma separated values using the string.Split() method. Keep in mind, this is an extremely naive implementation. Real world CSVs will not conform to simple string.Split() and we will have to revisit this code. However, it's good enough for now because we're just trying to get the general shape of the code in place.

Aside from the field splitting, this object doesn't do much else on its own. It delegates the bulk of the work to an instance of DataRecordBuilder. This object has the knowledge of the fields contained in the file and it uses that knowledge to extract the data and put it into the DataRecord object.

DataRecordBuilder is a behavioral version of our list of FieldSpecs:

public class DataRecordBuilder
{
    private Dictionary fields = new Dictionary();

    public DataRecord Build(IList fieldDataList)
    {
        var dataRecord = new DataRecord();

        foreach (var index in fields.Keys)
        {
            var fieldData = fieldDataList[index];
            var name = fields[index];

            dataRecord.SetValue(name, fieldData);
        }

        return dataRecord;
    }

    public void AddField(int index, string name)
    {
        fields.Add(index, name);            
    }
}

But where does the list of fields come from, and who calls AddField? We will get to that soon. First, a unit test that exercises our model to get the fields out of a stream and into a DataRecord object:

[Test]
public void CanReadFieldFromStreamWithHeaderRow()
{
    var stream = new MemoryStream();
    var writer = new StreamWriter(stream);

    writer.WriteLine("FIRST NAME,LAST NAME");
    writer.WriteLine("Michael,Bledsoe");
    writer.WriteLine("John,Doe");
    writer.Flush();
    
    stream.Position = 0;

    var reader = new StreamReader(stream);

    var processor = new CommaSeparatedValuesReader(true, CreateTestDataRecordBuilder());
    var dataRecord1 = processor.ReadNextDataRecord(reader);

    Assert.AreEqual("Michael", dataRecord1.GetValue("First Name"));
    Assert.AreEqual("Bledsoe", dataRecord1.GetValue("Last Name"));

    var dataRecord2 = processor.ReadNextDataRecord(reader);
    Assert.AreEqual("John", dataRecord2.GetValue("First Name"));
    Assert.AreEqual("Doe", dataRecord2.GetValue("Last Name"));

    stream.Close();
}

[Test]
public void CanReadFieldFromStreamWithoutHeaderRow()
{
    var stream = new MemoryStream();
    var writer = new StreamWriter(stream);

    writer.WriteLine("Michael,Bledsoe");
    writer.WriteLine("John,Doe");
    writer.Flush();

    stream.Position = 0;

    var reader = new StreamReader(stream);

    var processor = new CommaSeparatedValuesReader(false, CreateTestDataRecordBuilder());
    var dataRecord1 = processor.ReadNextDataRecord(reader);

    Assert.AreEqual("Michael", dataRecord1.GetValue("First Name"));
    Assert.AreEqual("Bledsoe", dataRecord1.GetValue("Last Name"));

    var dataRecord2 = processor.ReadNextDataRecord(reader);
    Assert.AreEqual("John", dataRecord2.GetValue("First Name"));
    Assert.AreEqual("Doe", dataRecord2.GetValue("Last Name"));

    stream.Close();
}

private DataRecordBuilder CreateTestDataRecordBuilder()
{
    var dataRecordBuilder = new DataRecordBuilder();
    dataRecordBuilder.AddField(0, "First Name");
    dataRecordBuilder.AddField(1, "Last Name");

    return dataRecordBuilder;
}

In this test, we set up a stream with three lines of data. We have one header and two data lines. Then, we construct our CommaSeparatedValuesReader and validate that we can read the data out of the Stream using the field names that we've chosen.

At this point, we have a structural data specification, which is the FileSpec object. We also have a behavioral object model, which is the CommaSeparatedValuesReader object. What we still need to build is the bridge between these two objects. We need to load up the FileSpec at runtime and transform it into a CommaSeparatedValuesReader. We will create a new object to do that, the FileReaderFactory:

public class FileReaderFactory
{
    public IFileReader CreateFileReader(FileSpec fileSpec)
    {
        switch (fileSpec.FileType)
        {
            case FileType.CommaSeparatedValues:
                return new CommaSeparatedValuesReader(fileSpec.FirstLineIsColumnHeaders, CreateDataRecordBuilder(fileSpec));
            default:
                throw new ArgumentException($"Unable to construct reader for FileType: {fileSpec.FileType}");
        }            
    }

    private DataRecordBuilder CreateDataRecordBuilder(FileSpec fileSpec)
    {
        var dataRecordBuilder = new DataRecordBuilder();

        for (int i = 0; i < fileSpec.FieldSpecs.Count; i++)
        {
            var fieldSpec = fileSpec.FieldSpecs[i];
            dataRecordBuilder.AddField(i, fieldSpec.Name);
        }

        return dataRecordBuilder;
    }
}

This class is meant to encapsulate the creational logic for our CommaSeparatedValuesReader and any other flat file readers we develop in the future. At this point, I don't know if the creational logic will stay here forever, but it is good enough for now.

The next concern to address is the execution of our ValidatorSpec definitions, but we will save that for the next post.

Browse the GitHub repository at this point in its commit history

GitHub Repository Home

Thank you for reading!