Av rating:

Total votes: 12
Total comments: 8

send to a friend

printer friendly version

State Transition Constraints

08 October 2010

by Joe Celko

Data Validation in a database is a lot more complex than seeing if a string parameter really is an integer. A commercial world is full of complex rules for sequences of procedures, of fixed or variable lifespans, Warranties, commercial offers and bids. All this requires considerable subtlety to prevent bad data getting in, and if it does, locating and fixing the problem. Joe Celko shows how useful a State transition graph can be, and how essential it can become with the time aspect added.

I did an article called Constraint Yourself! back in October 2008 on how to use DDL constraints to assure data integrity. One of the topics in that piece was a look at state transition constraints via an auxiliary table. I did not go into much detail since this was one of several topics I was covering. I got the feedback that I needed in order to to cover this technique in more depth, so here I am.

I introduced the topic of transition constraints to show how such constraints could be modeled as a state-transition diagram in order to enforce the rules when an entity can be updated only in certain ways. There is an initial state, flow lines that show what are the next legal states, and one or more termination states. The original example was a simple state change diagram of possible marital states which looked like this:

This state transition diagram was deliberately simplified, but it is good enough to explain principles. To keep the discussion as simple as possible, my table is for only one person's marital status over his life. Here is a skeleton DDL with the needed FOREIGN KEY reference to valid state changes and the date that the current state started.

CREATE TABLE MyLife

(previous_state VARCHAR(10) NOT NULL,

current_state VARCHAR(10) NOT NULL,

CONSTRAINT Improper_State_Change

FOREIGN KEY (previous_state, current_state)

REFERENCES StateChanges (previous_state, current_state),

start_date DATE NOT NULL PRIMARY KEY, --DateTime for SQL Server 2005

--etc.

);

What is not shown on it are which nodes are initial states (in this case "Born") and which are terminal or final states (in this case "Dead", a very terminal state of being). A terminal node can be the current state of a middle node, but not a prior state. Likewise, an initial node can be the prior state of a middle node, but not the current state. I did not write any CHECK() constraints for those conditions. It is easy enough to write a quick query with an EXISTS() predicate to do this and I will leave that as an exercise for the reader. Let's load the diagram into an auxiliary table with some more constraints.

CREATE TABLE StateChanges

(previous_state VARCHAR(10) NOT NULL,

current_state VARCHAR(10) NOT NULL,

PRIMARY KEY (previous_state, current_state),

state_type CHAR(1) DEFAULT 'M' NOT NULL

CHECK (state_type IN ('I', 'T', 'M')), /*initial, terminal, middle*/

CONSTRAINT Node_type_violations

CHECK (CASE WHEN state_type IN ('I', 'T')

AND previous_state = current_state

THEN 'T'

WHEN state_type = 'M'

AND previous_state <> current_state

THEN 'T' ELSE 'F' END = 'T')

);

INSERT INTO StateChanges

VALUES ('Born', 'Born', 'I'), -- initial state

('Born', 'Married', 'M'),

('Born', 'Dead', 'M'),

('Married', 'Divorced', 'M'),

('Married', 'Dead', 'M'),

('Divorced', 'Married', 'M'),

('Divorced', 'Dead', 'M'),

('Dead', 'Dead', 'T'); -- terminal state

An aspect of this problem that I did not consider in the first article is the time dimension. We want to see a temporal path from an initial state to a terminal state. State changes do not happen all at once, but are spread over time. An acorn becomes an oak tree before it becomes lumber and finally my chest of drawers. The acorn does not jump immediately to being a chest of drawers. Some of the changes are controlled by time. I cannot get married immediately after being born, but have to wait to be of legal age. A business offer can expire in a set number of days. You can fill in any number of examples of your own.

For a production system, you would need a more complete set of temporal columns to guarantee that we have no gaps in the history, but this will do for now. We now need a stored procedure to add data to the MyLife table. Here is one solution which is deliberately broken into clear steps for clarity.

CREATE PROCEDURE Change_State

(@in_change_date DATE,

@in_change_state VARCHAR(10))

BEGIN

DECLARE @most_recent_state VARCHAR(10);

SET @most_recent_state

= (SELECT current_state

FROM MyLife

WHERE start_date

= (SELECT MAX(start_date) FROM MyLife));

/* insert initial state if empty */

IF NOT EXISTS (SELECT * FROM MyLife)

AND @in_change_state

IN (SELECT previous_state

FROM StateChanges

WHERE state_type = 'I')

BEGIN

INSERT INTO MyLife (previous_state, current_state, start_date)

VALUES (@in_change_state, @in_change_state, @in_change_date);

RETURN;

END;

/* must be a real state change */

IF @in_change_state = @most_recent_state

BEGIN

RAISERROR ('This does not change the state.', 16, 1);

RETURN;

END;

/* must move forward in time */

IF @in_change_date <= (SELECT MAX(start_date) FROM MyLife)

BEGIN

RAISERROR ('Violates time sequence.', 16, 1);

RETURN;

END;

INSERT INTO MyLife (previous_state, current_state, start_date)

VALUES (@most_recent_state, @in_change_state, @in_change_date);

END;

The first block of code locates the most recent state of my life, based on the date. The second block of code will insert an initial state if the table is empty. This is a safety feature but there probably ought to be a separate procedure to create the set of initial states. The new state has to be an actual change, so there is a block of code to be sure. The changes have to move forward in time. Finally, we build a row using the most recent state as the new previous state, the input change state and the date. If the state change is illegal, the FOREIGN KEY is violated and we get an error.

If you had other business rules, you could also add them to the code in the same way. You should have noticed that if someone makes changes directly to the MyLife Table, they are pretty much free to screw the data. It is a good ideas to have a procedure that checks to see that MyLife is in order. Let's load the table with bad data:

INSERT INTO MyLife (previous_state, current_state, start_date)

VALUES ('Born', 'Married', '1990-09-05'),

('Married', 'Divorced', '1999-09-05'),

('Married', 'Dead', '2010-09-05'),

('Dead', 'Dead', '2011-05-10'),

('Dead', 'Dead', '2012-05-10');

This poor guy popped into existence without being properly born , committed bigamy and died twice. And you think your life is tough! Here is a simple validation procedure to catch those errors.

CREATE PROCEDURE ValidateLife

WITH Sequenced_State_History

(SELECT previous_state, current_state,

ROW_NUMBER() OVER (ORDER BY start_date)

AS change_seq

FROM MyLife)

/* There is chain of links from the initial state to the current state */

SELECT 'Missing link(s) in History'

FROM Sequenced_State_History AS H1, Sequenced_State_History AS H2

WHERE H1.change_seq + 1 = H2.change_seq

AND H1.current_state <> H2.previous_state

UNION ALL

/* has one and only one initial state */

SELECT 'No unique initial state.'

FROM MyLife AS M, StateChanges AS C

WHERE C.state_type = 'I'

AND M.previous_state = C.previous_state

AND M.current_state = C.previous_state

HAVING COUNT(*) <> 1

UNION ALL

/* has zero or one terminal state */

SELECT 'Too many terminal states.'

FROM MyLife AS M, StateChanges AS C

WHERE C.state_type = 'T'

AND M.previous_state = C.previous_state

AND M.current_state = C.previous_state

HAVING COUNT(*) > 1;

The CTE numbers the steps of the temporal path from an initial node to a middle or terminal node. This chain has to be unbroken which means going from step (n) to step(n+1) has to be a legal change in the StateChanges table. This chain can have only one initial node, so let's check for that next. Finally, the chain is either still in progress or it has reached a single terminal node.

A little note on the programming technique used. The union of separate queries to do one validation at a time can often be made faster by combining some of the queries. But there are trade-offs; this code is easy to read and to maintain and (hopefully) it will not be run often. It is also hard to get error messages from a single statement. Look back at the ChangeState() procedure; the two IF and RAISERROR() blocks of code could have been converted into CASE expressions that will generate NULLs, folded into the INSERT INTO statement and cause the insertion to fail.

INSERT INTO MyLife (previous_state, current_state, start_date)

VALUES

(NULLIF (@in_change_state, @most_recent_state),

@in_change_state,

CASE WHEN @in_change_date

<= (SELECT MAX(start_date) FROM MyLife)

THEN NULL ELSE @in_change_date END);

This is not easy to read or to get error messages that tell you if the @in_change_date is invalid in that it violates the time sequence.

The Temporal Side of Changes

What is still missing is the temporal aspect of state changes. In this example, the ('Born', 'Married') change would have to deal with the minimum age of consent. The (Married', 'Divorced') change often has a legal waiting period. While technically a business rule, you know that no human being has lived over 150 years, so a gap that size is a data error. The terminal and initial states are instantaneous, however. Let's add more flesh to the skeleton table:

CREATE TABLE StateChanges

(previous_state VARCHAR(10) NOT NULL,

current_state VARCHAR(10) NOT NULL,

PRIMARY KEY (previous_state, current_state),

state_type CHAR(1) DEFAULT 'M' NOT NULL

CHECK (state_type IN ('I', 'T', 'M')), /*initial, terminal, middle*/

state_duration INTEGER NOT NULL -- unit of measure is months

CHECK (state_duration >= 0),

CONSTRAINT Node_type_violations

CHECK (CASE WHEN state_type IN ('I', 'T')

AND previous_state = current_state

THEN 'T'

WHEN state_type = 'M'

AND previous_state <> current_state

THEN 'T' ELSE 'F' END = 'T')

);

To make up some data, let's assume that the age of consent is 18 (12 months * 18 years = 216), that you have to wait 3 months into your marriage before getting a divorce, and that you have to be divorced 2 months before you can re-marry. Of course, you can die instantly.

INSERT INTO StateChanges

VALUES ('Born', 'Born', 'I', 0), -- initial state

('Born', 'Married', 'M', 216),

('Born', 'Dead', 'M', 0),

('Married', 'Divorced', 'M', 3),

('Married', 'Dead', 'M', 0),

('Divorced', 'Married', 'M', 2),

('Divorced', 'Dead', 'M', 0),

('Dead', 'Dead', 'T', 0); -- terminal state

The first question is where to check for temporal violations; during insertion or with validation procedures? My answer is both. Whenever possible, do not knowingly put bad data into a schema so this should be done in the ChangeState() procedure. But someone or something will subvert the schema and you have to be able to find and repair the damage.

Here is a procedure which will tell you what state change in the chain has an improper duration and what the disagreement is.

WITH Sequenced_State_History

(SELECT previous_state, current_state, start_date,

ROW_NUMBER() OVER (ORDER BY start_date) AS change_seq

FROM MyLife)

/* There is chain of links from the initial state to the current state */

SELECT H2.change_seq, H2.previous_state, H2.current_state,

DATEDIFF (MM, H1.start_date, H2.start_date) AS actual_state_duration,

C.state_duration AS expected_state_duration

FROM Sequenced_State_History AS H1,

Sequenced_State_History AS H2,

StateChanges AS C

WHERE H1.change_seq + 1 = H2.change_seq

AND DATEDIFF (MM, H1.start_date, H2.start_date) <= C.state_duration

AND C.previous_state = H2.previous_state

AND C.current_state = H2.current_state;

Inserting a new life change is not a simple matter of putting a (previous_state, current_state, start_date) row into the table. To do it right, you can put conditions into the INSERT INTO statement to cause errors when there is bad data.

CREATE PROCEDURE Life_Status_Change

(@in_change_state VARCHAR(10),

@in_most_recent_state VARCHAR(10),

@in_change_date DATE)

INSERT INTO MyLife (previous_state, current_state, start_date)

VALUES

(NULLIF (@in_change_state, @in_most_recent_state),

@in_change_state,

CASE WHEN @in_change_date

<= (SELECT MAX(start_date) FROM MyLife)

THEN NULL ELSE @in_change_date END);

A slightly different model will keep a (start_date, expiry_date) pair in the history table. In the case of the MyLife example, the durations were minimums for certain changes. You can get married when you are older than 18 years of age and probably should. But a lot of commercial situations have a fixed lifespan. Warranties, commercial offers and bids expire in a known number of days. This means adding another column to the StateChanges table that tells the insertion program if the expiration date is optional (shown with a NULL) or mandatory (computed from the duration).

Here is some skeleton DDL for a bid application to explain this better.

CREATE TABLE MyBids

(bid_nbr INTEGER NOT NULL,

previous_state VARCHAR(10) NOT NULL,

current_state VARCHAR(10) NOT NULL,

CONSTRAINT Improper_State_Change

FOREIGN KEY (previous_state, current_state)

REFERENCES StateChanges (previous_state, current_state),

start_date DATE NOT NULL PRIMARY KEY,

expiry_date DATE, -- null means still open.

CHECK (start_date <= expiry_date),

PRIMARY KEY (bid_nbr, start_date),

etc.

);

The DDL has a bid number as the primary key and a new column for the expiration date. Obviously the bid has to exist for a while, so add a constraint to keep the date order right.

CREATE TABLE StateChanges

(previous_state VARCHAR(10) NOT NULL,

current_state VARCHAR(10) NOT NULL,

PRIMARY KEY (previous_state, current_state),

state_duration INTEGER NOT NULL,

duration_type CHAR(1) DEFAULT 'O' NOT NULL

CHECK ('O', 'M')), -- optional, mandatory

etc.

);

The DDL for the state changes gets a new column to tell us if the duration is optional or mandatory. The insertion procedure is a bit trickier. The VALUES clause has more power than most programmers use. The list can be more than just constants or simple scalar variables. But using CASE expressions lets you avoid if-then-else procedural logic in the procedure body.

All it needs is the bid number and what state you want to use. If you don't give me a previous state, I assume that this is an initial row and repeat the current state you just gave me. If you don't give me a start date, I assume you want the current date. If you don't give me an expiration date, I construct one from the State Changes table with a scalar subquery. Here is the skeleton DDL for an insertion procedure.

CREATE PROCEDURE Bid_Status_Change

(@in_bid_nbr INTEGER,

@in_previous_state VARCHAR(10),

@in_current_state VARCHAR(10),

@in_start_date DATE,

@in_expiry_date DATE)

INSERT INTO MyBids (bid_nbr, previous_state, current_state, start_date, expiry_date)

VALUES (@in_bid_nbr, -- required

COALESCE (@in_previous_state, @in_current_state),

@in_current_state, -- required

COALESCE (@in_start_date, CAST (CURRENT_TIMESTAMP AS DATE),

(SELECT COALESCE (@in_expiry_date,

DATEADD(MM, @in_start_date, S.state_duration))

FROM StateChanges AS S

WHERE S.previous_state = COALESCE (@in_previous_state, @in_current_state)

AND S.current_state = @in_current_state

AND S.duration_type = 'M'))

);

There are other tricks to assure that there are no gaps between the rows using DDL, but that it another article.

This article has been viewed 6634 times.

Subject:	integrity of state_duration
Posted by:	AlexK (view profile)
Posted on:	Friday, October 15, 2010 at 12:16 PM
Message:	Hi Joe, This is a very interesting article, thanks a lot! I was wondering how can we guarantee the integrity of state_duration column - I believe it can be updated to an invalid value. Maybe state_duration could be a computed column calculated off of StateChangeDate and PreviousStateChangeDate? What do you think?

Subject:	I like YOUR temporal integrity model
Posted by:	Celko (view profile)
Posted on:	Friday, October 15, 2010 at 3:10 PM
Message:	I like YOUR temporal integrity model for contiguous time periods. If other readers have not seen it, you need to Google it. It has a start and ending time, with a third previous end time column (previous_end_time + 1 unit = current start time) and self-referenced rows. The only confusing part is that the first event in the chain violates the self-reference (it has no predecessor). You have to defer the constraint, get the first row into the table, then restore the self-reference. This is all DRI and it is beautiful. In the actual bid system in this article, Our unit is in hours (the company is global and uses UTC times). A request-for-bid is inspected for a time determined by a human reviewer, then it goes live. When the bid is alive, it changes state in (n1) hours to "send email warning to bidder, no responses" and will expire in (n2) hours without any human action. At this point, the requester can recede (expires now), maintain (expires as scheduled) or extend (expires (n) hours later) his offer's duration. Some state durations are computed from the state's start_time (computed columns, VIEW). Some durations are provided by a bidder ("My offer expires on 2010-12-31"). Either way, we get a expiry times for both the bid and the request. They are only good if they overlap. This can be modeled with only start times. But it is easy to do statistics if you have both a start and end time.

Subject:	please post link
Posted by:	Anonymous (not signed in)
Posted on:	Tuesday, October 19, 2010 at 6:11 AM
Message:	Googled temporal integrity model and contigous time periods... don't see what you are referring to.

Subject:	Why no triggers?
Posted by:	Max Stayner (not signed in)
Posted on:	Wednesday, October 20, 2010 at 8:35 AM
Message:	It seems that it would be (relatively) easy to define insert/update/delete triggers to go quite a long way to enforcing this type of constraint, but there's no mention of this. Is it because they are unreliable, or what? Perhaps it would be worth at least a discussion!

Subject:	No triggers because ..
Posted by:	Celko (view profile)
Posted on:	Monday, October 25, 2010 at 12:33 PM
Message:	1) Trigger are proprietary. In spite of the SQL/PSM standards, each vendor has a "code museum" syntax from the early days. SQL was a weak language put on top of existing file systems. You needed something to make up the gap and the solution was a simple proprietary language, like T-SQL. Most the time, we wrote trigger code to do what we now do with DRI actions. 2) Triggers do not talk to the optimizer A trigger is procedural code and cannot give the optimizer any help. DRI does. Furthermore, since DRI is Standard, it will be improved and kept from one release to the next. 3) Triggers apply to one and only one table. In the actual system that my MyBids example came from, the old system uses BIT flags and triggers. This means that determining if a bid is in a valid state requires some messy queries. The business rules are spread all over the schema. As an example, the only way to see if a bid had been paid and shipped is to look at the Invoices, the Accounts Receivable and the Shipments. Oh, did you also check the Returns? Was the bid retracted? Gathering the BIT flags showed us impossible states of being. A shipping program updates the flags in the Shipments table, but does not go back to the Bid to see that the "is_retracted" flag was set. Consider two triggers that enforce a rule about the time between event -- "this bid is open for 30 days" can mean that the bid expires on (bid_date + 30 DAYS) or (bid_date + 31 DAYS). It is no surprise that you have it both ways in the system. 4) Triggers require complicated logic that is hard to maintain. A state change table and the accompanying bid history table put everything in one place, one way, one time. Looking at the state change diagram for a simple bid, I can count tens states. A simple shipment has 17 states. It is not uncommon for a manufacturing process to have many more states for the "work in process" steps for a complex product. Trying to put these stuff into IF-THEN-ELSE logic is hard. You have to be sure that history moves forward in valid steps. That means the trigger is always having to make sure of the ordering (bake the bread before you slice it; slice it before you package it; etc) Adding a new state to all triggers that are effected can be really hard. But adding a few more rows to a single table is easy. For example, we originally had ("shipment completed => "Archived") as a transition. We added an optional "feedback for TSP" state, which required ("shipment completed" => "feedback for TSP") and ("feedback for TSP" => "Archived") rows in the state transition table.

Subject:	temporal integrity model is Storing intervals of time with no overlaps
Posted by:	Anonymous (not signed in)
Posted on:	Monday, October 25, 2010 at 9:53 PM
Message:	The temporal integrity model Joe is speaking about is named "Storing intervals of time with no overlaps". http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx

Subject:	No Triggers Because... we now have SSD
Posted by:	BuggyFunBunny (view profile)
Posted on:	Tuesday, October 26, 2010 at 8:38 AM
Message:	Back when I first spent serious time with databases, it was with *nix/database ERP systems. There was lots o procedural code. The reason for this "denormalized" approach was that high normal form would be toooooooooooo sloooooooooooow (allegedly). Well, now that SSD are making significant headway into systems, 5NF is no longer "obviously" slower than the procedural alternative. Remember, the total cost of computation includes client data manipulation. Put it all on the server, and you get not only one fact, one place, one time; you also get much easier maintenance and fewer knucklehead client coders mewling. I'd say that's a win-win.

Subject:	Kuznetsov's temporal model
Posted by:	Celko (view profile)
Posted on:	Wednesday, October 27, 2010 at 9:17 AM
Message:	Last night (2010-10-26, I submitted a follow-up article on history tables and contiguous temporal periods. Hopefully, it will run soon.