Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

Friday, February 7, 2014

Normalization Musings

Musings on Database Normalization

Much of the work I do has to do with formatting the presentation layer of data. The typical data flow is (a) vendor sends data, (b) we transform and store it in a format somewhere between what they send us as what is required by the display later (website) and (c) procedures are written against this data.

Having the luxury of placing the burden of data integrity on the vendor, in combination with an asynchronous replication model means that most of the time we cannot and do not use surrogate keys or foreign keys. Some of the time this works fine; actually most of the time it works fine. Of course if the data consumption architecture deviates from the simple truncate/insert correcting data can be vexing.

I had my first taste of self-inflicted bad database design on a recent project. Hindsight being what it is, I would have done many things differently (not the least of which would have been more planning before developing), but as it is, it was a good learning experience.

I was given a data set which I didn't understand well and a limited amount of time to develop against it. The data sets were effectively fact tables from a data warehouse; some with provided dimension files and some without. (By the way, for reasons I wont get into SSAS was not an option to use). The data sets were also quite large (fact tables around 100 million records with all the different dimensions factored in). To add insult to injury, the default format the data came in was name-value pairs rather than columns. (i.e. {Name: Price, Value: 3.14} as opposed to Price: 3.14}.

Without using surrogate keys, a key might easily have 8 columns in the primary key including a generic “Name” column catapulting the number of records by a factor of however many members there were in the table.

There were quite a few revisions I undertook to optimize both retrieval and storage. The first was to do away with the NVP structure and use dedicated columns. Sure, new values required a little bit of dev work to hook up, but even if it were a schema-less NVP format, some amount of work would still need to be done to hook up the new data points. So that cut table sizes dramatically. In combination with data compression, the data set was not looking quite manageable.

But that’s not the main point I want to make (although it is a good example of why NVP models suck).

The next step I took (and what I should have done from the beginning) was to really get to know the data, understand the natural hierarchies and unique entities. Since not all of this data was provided as reference data by the vendor, I built a process to build inferred dimension from the incoming files and maintain surrogate identifiers as necessary.

What I ended up was two conceptual layers in the database. A decomposed and normalized version of the fact tables the vendor sent, and a set of custom pre-aggregated tables to serve specific use surface data quickly (where indexing alone was insufficient). The main advantage of the normalized structure was the amount of work it removed from building and maintaining these “second-level” aggregate data structures. Being able to reference master records of data ahead of time, and logically know where to look for everything made the process of pre-populating tables significantly easier. Plus it put into my hands much more control over the integrity of the data.

The next project I worked on was a departure from the model I described in the first paragraph (vendor -> store -> present). Instead I built a database drawing upon both internal and external data sets. I used many lessons from the project I described earlier. First and foremost, I took (had?) the time to plan out the database ahead of time. Wherever possible, I normalized things out; in a few cases even so far as to have references tables with one or two id/value rows in them. I created surrogate identifiers when the need to standardize disparate data sources was needed, I documented the database, and made use of foreign keys for reasonably static reference data.

First the bad. Although to an extent, surrogate keys are supposed to be business-process agnostic, there was a need to be able to uniquely identify entities which had no natural key. As such, when setting up new items, while the value of the surrogate key was irrelevant, those identifiers made their way into descriptive models and processes. There was distinct logic written around EntityID:1 vs EntityID:2. This means that moving data from environment to environment, updating values and correcting data, these identifiers have to be maintained. MyFirstEntity must always be ID:1, and MySecondEntity must always be ID:2, not just randomly assigned numbers. Perhaps there’s a better way to uniquely identify elements with no natural key, but short of fuzzy logic, I’m not aware of one.

Additionally, foreign keys, while central to database integrity, tie your hands a bit in terms of how you can modify an immature database. For instance, during early development cycles, I was changing table structures every few day or so. Foreign Keys did their job admirably and prevented me from taking shortcuts to restructure the database and data; objectively a good thing, but it takes some getting used to.

Now the good. As with the first project I mentioned, I had to build several consolidated tables in order for front-end queries to perform optimally, but having everything structured logically made building the new structures embarrassingly simple once the foundation was configured. Highly (maybe 3NF) normalized databases requiring a wide variety of complex query format does present probme problems for systems with low latency tolerance, not all of which can be ameliorated through indexing and query design alone, but again, building de-normalized structures off of the normalized data, over which I had full control, was simple compared to trying to maintain 20 de-normalized tables and having to make sure the name of an entity never got out of sync between them.

I've worked with several highly normalized systems in the past, and while the learning curve to understand the data can be steeper than if you were looking at a de-normalized table, report or spreadsheet, once you understand the data, it makes getting what you need a much more sane process.

Foreign keys were the unsung heroes of this database. They’re not as obviously helpful as normalization or other structural advantages, but they prevent things from going wrong. Every day that I don’t have orphaned records, they’re doing their job. Having the assurance that the schema I've spent so long tinkering with and refining will work correctly is well worth the effort. Sure, it takes time to ensure I delete from tables in the correct order and don’t add sub-entities before their parent, but I doubt anyone will say the cost of learning to do things in the right order is worth having an unstable database.

There were certain instances where I made weighed decisions to avoid foreign keys (such as on tables with very heavy traffic and/or when an orphaned record would have no impact on the system, and I stand by those decisions.

I’m guessing at this point (if not earlier) normalization fans have been thinking “we told you so” since the start of the article. These are things I had no idea about when I started working with SQL. I was educated in a certain way and the methods used by my mentors just became engrained. I really had no significant exposure to highly normalized databases or foreign keys when I started, and nobody took the time to explain their implications (positive and negative), and I wish someone had. There are still many cases where I forgo these database practices and probably will again in the future, and maybe some of the time they’ll come back to haunt me, but if I can give anyone who doesn’t use these tools some advice, it would be to strongly consider them. These were tools I dismissed because others dismissed them, and I assumed the problems that arose from them were just the way things were. I found excuses not to break form, but fortunately over time my curiosity won out, and I’m glad it did. As Robert Frost said “… And I, I took the [road] lets traveled by, and that has made all the difference”.


Tuesday, August 13, 2013

Find What Filegroup a Given Index is on.

If you use the SQL Server system function sp_helpindex, you'll get back table information as well as index information, which includes a concatenated string which tells you what filegroup the index is on. I have an issue right now where we're doing cleanup of tables which had indexes created on the wrong filegroups, so I wanted a way to programmatically determine this information without having to run sp_helpindex on each table. To my surprise, I wasn't really able to find an article online for how to do this. Plenty of articles on moving the indexes to a new filegroup, which were helpful since I have to do that, but not for this initial research stage. With that in mind, I'd like to show you what I came up with.

    select 
        FilegroupName = ds.name, 
        IndexName = i.Name,
        IndexType = i.type_desc,
        TableName = t.Name
    from sys.indexes i
    inner join sys.data_spaces ds
        on i.data_space_id = ds.data_space_id
    inner join sys.tables t
        on i.object_id = t.object_id
    where i.type in (1, 2, 5, 6)

I got this basically by using sp_helptext on sp_helpindex.

Monday, June 10, 2013

When SQL Output Makes Me Cry

I'm a big proponent of the SQL OUTPUT clause introduced in SQL 2008. Having the ability to output the rows of a given DML statement has a lot of great uses. You can log DML actions without the need for triggers or costly features like CDC, you can capture data moving around inside a procedure, or debug a one-off script you're running. I did however just run into a situation with makes me want to cry.

Typically with the DML procedures I work with, we like to put all the data manipulation, massaging and preparation at the start, and then at the end persist all this data to the presentation tables. While I always realized why, it became especially apparent here. If it's not clear, the benefit here is that you have all your staging done in temporary objects and if something goes wrong, you can break out and not have to worry about anything live being affected. You also avoid convoluted schemes of transactions and try catch blocks. Also you know where to look for your DML statements modifying presentation tables. Now this doesn't always work, but I think for the most part, a small amount of effort can fit this mold. One big problem comes with debugging a procedure. You need to see data flow through the procedure in order to test individual parts and see where things are going wrong. If you have a presentation table DML action early in the procedure, if you want to run it in any environment but a development environment, you likely wont have the permissions to affect the tables (and if you do, you still probably don't want to). This is easy to get around by changing the
  insert into MyPresentationTable (ColumnA, ColumnB) select ColumnA, ColumnB from #TempTable   
to something like
 select ColumnA, ColumnB into #MyTempPresentationTable from #TempTable  
Now you have a temp object which holds the temporary data you need, and you can move on with debugging. However, with the advent of the OUTPUT clause, you can now make selections from things like Merge statements, output into statements, and so forth. The case I have now is that a DML action is performed way early in the procedure, via a merge statement, and the OUTPUT of that is then used to feed a temp table used in just about every subsequent statement in the procedure. In the procedure, the $action attribute is used in conjunction with deleted.* and inserted.*. Trying to re-create the contents of a merge statement is difficult enough to do by hand, but when the situational outputs of a merge statement are then used for everything else, running this in an environment where you can't just run it and fill up objects to debug is a nightmare. I'm a big fan of the OUTPUT statement, and actually a pretty big fan of the MERGE statement as well, but please, consider how it's being used. New features are wonderful, but they don't have the time tested trial and error of longer standing methods. You have to give more thought to how you use them because dealing with them and debugging them is going to raise issues which don't typically come up with more core features.

Tuesday, March 12, 2013

Non-deplorable use of triggers

So I had a rare case today to actually use a trigger without invoking the wrath of all the DBAs, developers and gnomes that live under my desk.

For starters, let me give my little myopic view of why I (and most SQL developers I know) avoid triggers like The Plague. Up front I admit that I probably don't know all there is to know about triggers, and how best to implement them to keep from pulling your hair out. That said, the existing structures with triggers have one of the two following problems.

Triggers, when used to enforce data integrity involve restricting data in at least one way. There's an older system I have to work on occasionally at work which has a procedure which updates identifiers in about 20 tables; each table containing triggers referencing other tables (some, the same ones which are in the procedure, but some are not). When the procedure fails, tracing down where the error took place in the procedure is just the beginning. You then end up traversing a network of obscure DML enforcing statements across dozens of objects. The end result is that most people who work on the system take extraneous effort to circumvent the triggers if something goes wrong with them rather than dig through them to fix the problem.

The next problem with triggers is that regardless of the use, there is a "hidden" execution cost on every transaction on the table to which the trigger is bound. Imagine you have a table which when built had very low traffic; maybe 50 items added a day. Each time one of those is logged, a trigger fires to update an audit logging table, and additionally update a reference table to make sure that any identifiers which came in are historically corrected as well. Now imagine a few years go by and the developer who wrote the system has retired to become a pet psychiatrist. As years go by (perhaps due to this developer leaving the company) the company grows by leaps and bounds, and now that table is receiving 500,000 DML transactions a day, or 5 million. While there are certainly ways to remedy this situation, it might take a long time to try to realize that there is a trigger on the table.

So again, maybe this is just the way I've grown accustom to doing things, but integrity enforced by table constraints or through procedural logic are the way I prefer to maintain data integrity.

That said, here's the situation I had today. A client was trying to upload some new data to one of our test environments, and throughout the day the data would (according to the client) appear, and later disappear. I'll spare you the hour long heated conversation we had between several teams, but in the end, I undertook the task of monitoring the tables which housed the data to see what they did throughout the day. Initially I set up an SSIS package which would periodically throughout the day dump the data into a logging table via an Agent job. But on my bus ride home, the annoying fact that this still would not necessarily catch the culprit statements persisted. Thinking back on a presentation I'd given on logging tables (Change Tracking, Change Data Capture, etc.) suddenly it occurred to me that a trigger would be a perfect solution to this*.

I set up an AFTER INSERT, UPDATE, DELETE trigger on the two tables with the identifiers I was concerned with and had them dump any DML statements into a logging table. The table would auto increment to prevent any PK violations. The trigger additionally filtered the INSERTED and DELETED tables by the 4 identifiers the client said were popping in and out, and I set up another Agent job to not let the table grow larger than one month of history. Additionally I added a clause in the trigger compilation to only instantiate in development and test environments. There's certainly no reason it could not run in production as well, but the idea I wanted here was maximum visibility into the table's modifications with a minimal footprint on the database. So with small non-intrusive trigger, I was able to log the actions on the table and identify when the records popped in and out of existence.

There are still a few drawbacks to this approach. First of all, maintaining the list of tracked tickers is a very manual process. While this is a rare situation that I'll probably only have to monitor for a few weeks, if this happened again, i'd almost have to re-build the trigger from scratch. Second, ideally I would have the trigger "retire itself" after say a month so if I forgot about it, when I moved on to become a pet psychiatrist the trigger wouldn't get lost in the tangle of 0s and 1s. Also, and this is not really a drawback of a trigger, but rather a limitation, I would have liked if there was a way to pass the calling procedure's information into the trigger in order to further trace the source (something like object_name(@@PROCID), but unfortunately the execution context of the @@PROCID changes to that of the trigger upon it's call.)

In the end however, this seemed like one of those times where it was the right tool for the job. It's refreshing to break out of the tunnel vision which often inadvertently affects ones programming styles and find a legitimate use for something you'd historically written off.



* Change Tracking and CDC were not options due to database restrictions we have.

Tuesday, November 6, 2012

SQL Server Templates

SQL Server Management Studio comes with a bunch of pre-built templates for scripting out procedures, functions, tables, administrative functions and more. All they are essentially are a .sql file with some special syntax in the script where values can be pasted into place. The special syntax is pretty easy to get a handle on, but at first glance can seem a little strange. Here's a sample of what you might see if you were looking at a SQL Server template (minus all the red lines I added).



There are several parts to this. Let's start with the window that's currently being displayed (in the picture, it's the window with the header "Specify Values for Template Parameters"). To get to this window, you must have a query window open in SSMS, and it wouldn't hurt to have a few tags in it (more on that later). In this query window, press ctrl+shift+M. I have no idea what the M is a mnemonic for... para(M)eters? Whatever. Once you're in this view, you'll be able to replace the special notation inside the script with values you input on this screen.

So how do you set that up? You put in tags with the notation
<[Parameter], [Type], [Value]>

[Parameter] is like the identifier for your replacements. Anywhere in your script, any time you have, say tagged in the script, all instances of that Database tag, will be replaced with the value you put in that popup window. This is the only part of the tag which is NOT optional. Think of it as your primary key.

[Type] is really just a hint to yourself about what datatype the script is expecting. Often times it's irellevant. If you're building a new Proc, you might have a tag which says , but you could just have easily put "Int" instead of "Sysname" and it wouldn't care. It's not a data type validator, it's just so that if you have a script which requires data of a certain type you know what to enter. It's worth noting that this is entirely optional. Also, if you change this data type throughout the script, it won't care. Only the first instance of the tag will be used to populate the [Type] column in the popup window.

[Value] is an optional default value in the field. In the example I showed above, more often than not, I'm looking for information_schema.columns, so I set the default of Parameter:SubType to "Columns". If i wanted to make it Routines or Tables, I just have to replace the "Columns" value with the value I want.



When you then click OK, all the values you have are propagated into those replaced tags throughout your script. It's worth mentioning that SQL treats each replacement of a value as a single undoable action, so if you've got a huge parameter list and you accidentally hit OK too soon, all the values you did (or probably didn't) submit will have to be undone one by one till you get back to your original script.



My final note on this is that if you don't have SQL 2012, or do and cant figure out how to use the snippets feature, get something like the SSMS Toolspack. It's free, and comes with a bunch of helpful tools, my favorite of which is a really simple code snippet interface. I've built little 2 - 3 character shortcuts for some of my most commonly used functions, such as the one I listed here. I type "info"+tab and it populate my query window with that code. I've got others as well, "tab"+tab for tables, "IX" for indexes and so on. When you combine this template format with the ability to call these templates up in a fraction of a second, you can really cut down extra typing time a LOT.

Monday, July 30, 2012

Reset SQL Server 2008 SSMS Shortcuts

Recently I've been playing with gVim trying to get into the lightning fast editing that is said to come with using vi as an editor. Problem is, unless I'm mistaken (I'd be happy to be proven wrong here) it's difficult or impossible to add the common tasks I use with SSMS. F5 to execute, remove results pane, new/next window etc.

An addon I was trying out called ViEmu added a lot of great vi functionality to SSMS. It's something that I think users adept at using vi would love. Maybe even an SSMS user could learn to love. But I'm new, and like anyone starting out using vi, it's frustrating, so in lieu of the hundreds of projects looming which need to get done, I guess learning vi will have to wait.

What does this have to do with my posting? This addon (as do many other SSMS addons) remaps many commonly used shortcut. Ctrl-r to hide results pane, alt-w to access window switching, even ctrl-n for new query window. Begin google search for resetting keyboard shortcuts in SSMS.

Turns out for whatever reason, SSMS really doesn't like you being able to easily change your shortcut settings, and a surprising dearth of information was available on the interwebs on this topic. I got it to work, so let me distill what I can for you.

  1. SSMS doesn't store keyboard shortcut information in the registry but rather in local files so uninstallation won't necessarily help.
  2. the files are user specific so they are stored in AppData as opposed to somewhere in the install directory.
  3. SSMS stores keyboard shortcut data in .vsk files (two of which seem to be responsible for the settings concerning SSMS


You have to go to your app data directory. Depending on your version and how your machine(s) is/are setup, your path may look something like this:

C:\Documents and Settings\ShayShay\Application Data\Microsoft\Microsoft SQL Server\100\Tools\Shell

in there, you're looking for two .vsk files - User.vsk and/or Current.vsk. I found them both in the directory mentioned above, but it sounds as though they may be in slightly different paths in the SSMS app data dir.

Point is, close SSMM, delete those two files (or as i did, cut and paste them to my desktop to be safe) then start up SSMS. Try something simple like select 1 + f5 = ctrl-r (toggle results pane) or really whichever your previously existing shortcut was. If it's still borked, time to go hunting through your app data dirs for other offending .vsk

NOTE: I'm not sure what the result of deleting .vsk files at will are. So far, SSMS looks like it auto-remakes the necessary files wiht default settings (what I wanted) but always be careful snipping out random files. I'd really recommend cutting and pasting them to desktop or something so that if things go horribly wrong, you dont end up in a worse situation than you started in.

Monday, December 12, 2011

Replace tags in a string

I ran into a situation today where a client wants to be able to give us a string for a disclaimer, but they want to be able to add tags to the string which, after we work our magic, will be replaced with a value from a table, making the string somewhat dynamic in nature. Here's how I went about doing it.

First, I made a function to house the logic around the tags. I realize there are a few shortcomings here and currently, this means that you can't have a '<' in the text without it being the start of a full tag ending in '>', but there are a few adaptations that I feel would be easy enough to implement if this became problematic. The code is a modified version of a comma-delimited string parsing function I use a lot. Basically what this part does is gets the charindex of the opening and closing brackets and then length between the two. It cuts out the substring between them based on that length and pops it into a table. After it does that, it cuts off everything to the left of that closing bracket (or to be more accurate, it keeps the right side). This repeats until you have a table filled with the contents of all the bracketed terms in the string.



IF OBJECT_ID (N'dbo.TagParse') IS NOT NULL
DROP FUNCTION dbo.TagParse
GO

CREATE FUNCTION dbo.TagParse(@String varchar(max))
RETURNS @TagValues TABLE
(
ident int identity(1,1) primary key clustered,
TagValue varchar(100)
)
AS
begin

declare
@FragText varchar(255),
@OpenIndex int,
@CloseIndex int,
@FragLength int,
@StringLength int

--Get some relevant information about the frament length and position
select
@OpenIndex = CHARINDEX('<', @String) ,
@CloseIndex = CHARINDEX('>', @String),
@StringLength = datalength(@String)
select
@FragLength = @CloseIndex - @OpenIndex

while @CloseIndex > 0
begin

select @FragText = SUBSTRING(@String, @OpenIndex + 1, @FragLength -1)
select @String = RIGHT(@String, @StringLength - @CloseIndex)

insert into @TagValues(TagValue)
select @FragText

select
@OpenIndex = CHARINDEX('<', @String) ,
@CloseIndex = CHARINDEX('>', @String),
@StringLength = datalength(@String)
select
@FragLength = @CloseIndex - @OpenIndex

end

return
end


Once we've got the tags parsed out, we can get into getting those values from a presentation table. For the purposes of what I had to do, it will always come from the same table, but I suppose if one were so inclined, you could do some funky dynamic sql to be able to specify a table name too, but I won't go there (right now). I've built up some basic sample data so you can see this in action, but the idea behind it is to:

1) parse out the string using the above function
2) set a variable = the fact name parsed from the string
3) retrieve the value from the presentation table based on the fact name from step 2

Then, much to my dismay, using a loop, I go through the parsed table and replace instances of the tag with the values I just set. If there's a set based way to do this, I'd love to know it, but I couldn't think of a way that didn't involve dynamic sql and lots of nested replace statements. And given that these tagged values are probably not ever going to be more than a few tags here and there, the strain of RBAR is not too bad.


if OBJECT_ID('tempdb.dbo.#FactSet') > 0 drop table #FactSet
create table #FactSet
(
FactName varchar(100),
FactValue varchar(100)
)

insert into #FactSet (FactName, FactValue)
select 'FactName1', 'Test Value 1' union all
select 'FactName2', '3.14'

declare @Tags table
(
ident tinyint primary key clustered,
TagValue varchar(100)
)

declare
@String varchar(max),
@ident tinyint,
@MaxIdent tinyint,
@FactName varchar(255),
@FactValue varchar(255)

--This is the string that will constitute the disclaimer text they want to chop up
select @String = 'This is a test string. Field 1: , Field 2: '

--Parse the string into its constituent tags
insert into @Tags (ident, TagValue)
select ident, TagValue
from sysmon.dbo.TagParse(@String)

--Get the max ident through @@rowcount so I dont have to actually hit the @Tags table
select
@Ident = 1,
@MaxIdent = @@rowcount

while @ident <= @MaxIdent
begin
--Get the name of the fact name we parsed and assign it to @FactName
select @FactName = TagValue
from @Tags
where ident = @ident

--Get the value from the Fact Set table which corresponds to @FactName
select @FactValue = FactValue
from #FactSet
where FactName = @FactName

--We then replace any instances of the original tag with the value we just got.
--Enclosing the @FactName in <> to simulate the original tag syntax.
select @String = REPLACE(@String, '<' + @FactName + '>', @FactValue)

select @ident = @ident + 1

end

select @string


If anyone who reads this understands what the hell I just did, I'd be happy to hear any ways you think this could be done better.