Friday, February 7, 2014

Kaleidoscopics

I got to thinking about brains today, which as any person with an overactive imagination knows, is sort of a black hole. But regardless, my thoughts shifted to consciousness, understanding of the world around us, and psychedelics (specifically DMT and other chemicals which occur naturally in the brain).

I'm not one to believe that much (if any) real astounding wisdom can be derived from the use of drugs, but it does raise the question of why such chemicals naturally exist in our brains, and why adding more of them (or sympathetic chemicals) has such a profound effect on us, perceptually, spiritually and emotionally.

As I started to think about the problem from a programming perspective, I wondered what their role could be and I came up with the idea of "kaleidoscopics". Really this is just a made up word that I though sounded cool, but the idea is that it's an agent which casts doubt upon normal inputs.

 It's difficult  to design a system which is both deterministic and "intelligent" in the sense which humans typically ascribe to the word. A simple system would be an if-then statement. More complex systems introduce more and more variables, until you have a system, which, like a brain, or an ant colony, takes on emergent properties. There's obviously a lot more going on than the simple interaction of one method or another, but to some extent after a time a system (we'll use the brain as the example from here on out) reaches a state of relative homeostasis.

So I'm rambling a little bit. The point is, how do you program something which can learn any language, lie, love, play, laugh, calculate, trust and kill? You've got to have some pretty flexible rules around what defines reality.

So my hypothesis (like most of the hypotheses I make, being grounded in absolutely no scientific study) is that these chemicals, or if not them, some other set of chemicals in the brain, has the job of turning things on their heads; randomly flipping bits, and presenting the mind with alternative interpretations of what's going on. As a child, when the brain is young and the neurochemical soup of  grey matter starts to solidify into a brain and consciousness, benefits greatly from the ability to try out theories and discard them at a moments notice.

As we get older, we start to know what the flipping of a bit will do (what if I COULD put my hand through the wall), and some of our experimentation, sadly, ends.

It also brings up the somewhat terrifying concept of The Singularity, in that to program a mind with the same ability to perceive reality as flexibly as we do, we would have to take the safety off, and really give the program the ability to come up with whatever it wants. It's pretty cool when you see it on the scale of Conway's Game Of Life, but scale it up to (super)human intelligence and allow the possibility for it to veer off in a direction like Skynet, and the human race would be in trouble. Or we could end up with something closer to Peter F. Hamilton's concept of the singularity; a more benign entity whose motivations are beyond our comprehension.

Again, I'm rambling. But if I can try to tie a nice bow on this, is that reality is highly subjective. If you think about how YOU would try to program something to "perceive" reality, it cascades into all sorts of deliciously complex issues. But one of them I'm sure is a mechanism for self doubt. It plagues us and drives us at the same time, but always has a profound effect.

Normalization Musings

Musings on Database Normalization

Much of the work I do has to do with formatting the presentation layer of data. The typical data flow is (a) vendor sends data, (b) we transform and store it in a format somewhere between what they send us as what is required by the display later (website) and (c) procedures are written against this data.

Having the luxury of placing the burden of data integrity on the vendor, in combination with an asynchronous replication model means that most of the time we cannot and do not use surrogate keys or foreign keys. Some of the time this works fine; actually most of the time it works fine. Of course if the data consumption architecture deviates from the simple truncate/insert correcting data can be vexing.

I had my first taste of self-inflicted bad database design on a recent project. Hindsight being what it is, I would have done many things differently (not the least of which would have been more planning before developing), but as it is, it was a good learning experience.

I was given a data set which I didn't understand well and a limited amount of time to develop against it. The data sets were effectively fact tables from a data warehouse; some with provided dimension files and some without. (By the way, for reasons I wont get into SSAS was not an option to use). The data sets were also quite large (fact tables around 100 million records with all the different dimensions factored in). To add insult to injury, the default format the data came in was name-value pairs rather than columns. (i.e. {Name: Price, Value: 3.14} as opposed to Price: 3.14}.

Without using surrogate keys, a key might easily have 8 columns in the primary key including a generic “Name” column catapulting the number of records by a factor of however many members there were in the table.

There were quite a few revisions I undertook to optimize both retrieval and storage. The first was to do away with the NVP structure and use dedicated columns. Sure, new values required a little bit of dev work to hook up, but even if it were a schema-less NVP format, some amount of work would still need to be done to hook up the new data points. So that cut table sizes dramatically. In combination with data compression, the data set was not looking quite manageable.

But that’s not the main point I want to make (although it is a good example of why NVP models suck).

The next step I took (and what I should have done from the beginning) was to really get to know the data, understand the natural hierarchies and unique entities. Since not all of this data was provided as reference data by the vendor, I built a process to build inferred dimension from the incoming files and maintain surrogate identifiers as necessary.

What I ended up was two conceptual layers in the database. A decomposed and normalized version of the fact tables the vendor sent, and a set of custom pre-aggregated tables to serve specific use surface data quickly (where indexing alone was insufficient). The main advantage of the normalized structure was the amount of work it removed from building and maintaining these “second-level” aggregate data structures. Being able to reference master records of data ahead of time, and logically know where to look for everything made the process of pre-populating tables significantly easier. Plus it put into my hands much more control over the integrity of the data.

The next project I worked on was a departure from the model I described in the first paragraph (vendor -> store -> present). Instead I built a database drawing upon both internal and external data sets. I used many lessons from the project I described earlier. First and foremost, I took (had?) the time to plan out the database ahead of time. Wherever possible, I normalized things out; in a few cases even so far as to have references tables with one or two id/value rows in them. I created surrogate identifiers when the need to standardize disparate data sources was needed, I documented the database, and made use of foreign keys for reasonably static reference data.

First the bad. Although to an extent, surrogate keys are supposed to be business-process agnostic, there was a need to be able to uniquely identify entities which had no natural key. As such, when setting up new items, while the value of the surrogate key was irrelevant, those identifiers made their way into descriptive models and processes. There was distinct logic written around EntityID:1 vs EntityID:2. This means that moving data from environment to environment, updating values and correcting data, these identifiers have to be maintained. MyFirstEntity must always be ID:1, and MySecondEntity must always be ID:2, not just randomly assigned numbers. Perhaps there’s a better way to uniquely identify elements with no natural key, but short of fuzzy logic, I’m not aware of one.

Additionally, foreign keys, while central to database integrity, tie your hands a bit in terms of how you can modify an immature database. For instance, during early development cycles, I was changing table structures every few day or so. Foreign Keys did their job admirably and prevented me from taking shortcuts to restructure the database and data; objectively a good thing, but it takes some getting used to.

Now the good. As with the first project I mentioned, I had to build several consolidated tables in order for front-end queries to perform optimally, but having everything structured logically made building the new structures embarrassingly simple once the foundation was configured. Highly (maybe 3NF) normalized databases requiring a wide variety of complex query format does present probme problems for systems with low latency tolerance, not all of which can be ameliorated through indexing and query design alone, but again, building de-normalized structures off of the normalized data, over which I had full control, was simple compared to trying to maintain 20 de-normalized tables and having to make sure the name of an entity never got out of sync between them.

I've worked with several highly normalized systems in the past, and while the learning curve to understand the data can be steeper than if you were looking at a de-normalized table, report or spreadsheet, once you understand the data, it makes getting what you need a much more sane process.

Foreign keys were the unsung heroes of this database. They’re not as obviously helpful as normalization or other structural advantages, but they prevent things from going wrong. Every day that I don’t have orphaned records, they’re doing their job. Having the assurance that the schema I've spent so long tinkering with and refining will work correctly is well worth the effort. Sure, it takes time to ensure I delete from tables in the correct order and don’t add sub-entities before their parent, but I doubt anyone will say the cost of learning to do things in the right order is worth having an unstable database.

There were certain instances where I made weighed decisions to avoid foreign keys (such as on tables with very heavy traffic and/or when an orphaned record would have no impact on the system, and I stand by those decisions.

I’m guessing at this point (if not earlier) normalization fans have been thinking “we told you so” since the start of the article. These are things I had no idea about when I started working with SQL. I was educated in a certain way and the methods used by my mentors just became engrained. I really had no significant exposure to highly normalized databases or foreign keys when I started, and nobody took the time to explain their implications (positive and negative), and I wish someone had. There are still many cases where I forgo these database practices and probably will again in the future, and maybe some of the time they’ll come back to haunt me, but if I can give anyone who doesn’t use these tools some advice, it would be to strongly consider them. These were tools I dismissed because others dismissed them, and I assumed the problems that arose from them were just the way things were. I found excuses not to break form, but fortunately over time my curiosity won out, and I’m glad it did. As Robert Frost said “… And I, I took the [road] lets traveled by, and that has made all the difference”.


Wednesday, February 5, 2014

For those of you who may not have known, there’s a cool trick you can do in Visual Studio or SSMS; the alt-drag marquee. Many of these functions can be handled with fine and replace with or without regular expressions, but sometimes those are more trouble than it’s worth.

The basics of alt-drag are this. Say I have a block of text, such as a comment block of a procedure which I want to replace all at once. Hold alt and drag to the bottom right, and you get a marquee area which respects x/y coordinates rather than just lines and indentation.


Now the cool part. If you’ve ever used copy-paste with these marquees, you’ll find they work… oddly. However like any good Jedi, you can turn your enemies weakness into your strength.

Say for example, I forgot to type select statements for a derived table like this:


You could use find replace to replace “all” with “all select’, you could use a series of copy/paste commands and down arrow keys, or this:

Create a zero-width vertical marquee and start typing. Every text line which is crossed by the marquee will begin typing there (it works with ctrl-v_ as well.