Friday, February 7, 2014

Normalization Musings

Musings on Database Normalization

Much of the work I do has to do with formatting the presentation layer of data. The typical data flow is (a) vendor sends data, (b) we transform and store it in a format somewhere between what they send us as what is required by the display later (website) and (c) procedures are written against this data.

Having the luxury of placing the burden of data integrity on the vendor, in combination with an asynchronous replication model means that most of the time we cannot and do not use surrogate keys or foreign keys. Some of the time this works fine; actually most of the time it works fine. Of course if the data consumption architecture deviates from the simple truncate/insert correcting data can be vexing.

I had my first taste of self-inflicted bad database design on a recent project. Hindsight being what it is, I would have done many things differently (not the least of which would have been more planning before developing), but as it is, it was a good learning experience.

I was given a data set which I didn't understand well and a limited amount of time to develop against it. The data sets were effectively fact tables from a data warehouse; some with provided dimension files and some without. (By the way, for reasons I wont get into SSAS was not an option to use). The data sets were also quite large (fact tables around 100 million records with all the different dimensions factored in). To add insult to injury, the default format the data came in was name-value pairs rather than columns. (i.e. {Name: Price, Value: 3.14} as opposed to Price: 3.14}.

Without using surrogate keys, a key might easily have 8 columns in the primary key including a generic “Name” column catapulting the number of records by a factor of however many members there were in the table.

There were quite a few revisions I undertook to optimize both retrieval and storage. The first was to do away with the NVP structure and use dedicated columns. Sure, new values required a little bit of dev work to hook up, but even if it were a schema-less NVP format, some amount of work would still need to be done to hook up the new data points. So that cut table sizes dramatically. In combination with data compression, the data set was not looking quite manageable.

But that’s not the main point I want to make (although it is a good example of why NVP models suck).

The next step I took (and what I should have done from the beginning) was to really get to know the data, understand the natural hierarchies and unique entities. Since not all of this data was provided as reference data by the vendor, I built a process to build inferred dimension from the incoming files and maintain surrogate identifiers as necessary.

What I ended up was two conceptual layers in the database. A decomposed and normalized version of the fact tables the vendor sent, and a set of custom pre-aggregated tables to serve specific use surface data quickly (where indexing alone was insufficient). The main advantage of the normalized structure was the amount of work it removed from building and maintaining these “second-level” aggregate data structures. Being able to reference master records of data ahead of time, and logically know where to look for everything made the process of pre-populating tables significantly easier. Plus it put into my hands much more control over the integrity of the data.

The next project I worked on was a departure from the model I described in the first paragraph (vendor -> store -> present). Instead I built a database drawing upon both internal and external data sets. I used many lessons from the project I described earlier. First and foremost, I took (had?) the time to plan out the database ahead of time. Wherever possible, I normalized things out; in a few cases even so far as to have references tables with one or two id/value rows in them. I created surrogate identifiers when the need to standardize disparate data sources was needed, I documented the database, and made use of foreign keys for reasonably static reference data.

First the bad. Although to an extent, surrogate keys are supposed to be business-process agnostic, there was a need to be able to uniquely identify entities which had no natural key. As such, when setting up new items, while the value of the surrogate key was irrelevant, those identifiers made their way into descriptive models and processes. There was distinct logic written around EntityID:1 vs EntityID:2. This means that moving data from environment to environment, updating values and correcting data, these identifiers have to be maintained. MyFirstEntity must always be ID:1, and MySecondEntity must always be ID:2, not just randomly assigned numbers. Perhaps there’s a better way to uniquely identify elements with no natural key, but short of fuzzy logic, I’m not aware of one.

Additionally, foreign keys, while central to database integrity, tie your hands a bit in terms of how you can modify an immature database. For instance, during early development cycles, I was changing table structures every few day or so. Foreign Keys did their job admirably and prevented me from taking shortcuts to restructure the database and data; objectively a good thing, but it takes some getting used to.

Now the good. As with the first project I mentioned, I had to build several consolidated tables in order for front-end queries to perform optimally, but having everything structured logically made building the new structures embarrassingly simple once the foundation was configured. Highly (maybe 3NF) normalized databases requiring a wide variety of complex query format does present probme problems for systems with low latency tolerance, not all of which can be ameliorated through indexing and query design alone, but again, building de-normalized structures off of the normalized data, over which I had full control, was simple compared to trying to maintain 20 de-normalized tables and having to make sure the name of an entity never got out of sync between them.

I've worked with several highly normalized systems in the past, and while the learning curve to understand the data can be steeper than if you were looking at a de-normalized table, report or spreadsheet, once you understand the data, it makes getting what you need a much more sane process.

Foreign keys were the unsung heroes of this database. They’re not as obviously helpful as normalization or other structural advantages, but they prevent things from going wrong. Every day that I don’t have orphaned records, they’re doing their job. Having the assurance that the schema I've spent so long tinkering with and refining will work correctly is well worth the effort. Sure, it takes time to ensure I delete from tables in the correct order and don’t add sub-entities before their parent, but I doubt anyone will say the cost of learning to do things in the right order is worth having an unstable database.

There were certain instances where I made weighed decisions to avoid foreign keys (such as on tables with very heavy traffic and/or when an orphaned record would have no impact on the system, and I stand by those decisions.

I’m guessing at this point (if not earlier) normalization fans have been thinking “we told you so” since the start of the article. These are things I had no idea about when I started working with SQL. I was educated in a certain way and the methods used by my mentors just became engrained. I really had no significant exposure to highly normalized databases or foreign keys when I started, and nobody took the time to explain their implications (positive and negative), and I wish someone had. There are still many cases where I forgo these database practices and probably will again in the future, and maybe some of the time they’ll come back to haunt me, but if I can give anyone who doesn’t use these tools some advice, it would be to strongly consider them. These were tools I dismissed because others dismissed them, and I assumed the problems that arose from them were just the way things were. I found excuses not to break form, but fortunately over time my curiosity won out, and I’m glad it did. As Robert Frost said “… And I, I took the [road] lets traveled by, and that has made all the difference”.


No comments: