Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I've been around a bunch of projects where statuses are stored in the db as a integer, then mapped in the model back to a human readable string:

class ExampleObject < ActiveRecord::Base
  STATUSES = {
    cool:     0,
    not_cool: 1,
    bad:      2
  }
  STATUSES_LOOKUP = STATUSES.invert
end

This approach comes with a lot of messy, error prone code. Removing, renaming and migrating objects from a status to another when the business model changes becomes a nightmare.

The question is, is there a real performance issue with just using a string instead of an integer?:

class ExampleObject < ActiveRecord::Base
  STATUSES = %w|cool not_cool bad|
end

I'm actually trying to convince myself there's enough reason to refactor the ball of spaghetti code I'm dealing with. We have no just one of this pick an option like fields, but three of them an they are strongly connected as in a option tree.

share|improve this question

closed as off-topic by RubberDuck, rolfl Oct 15 '14 at 10:35

This question appears to be off-topic. The users who voted to close gave this specific reason:

  • "Questions must involve real code that you own or maintain. Questions seeking an explanation of someone else's code are off-topic. Pseudocode, hypothetical code, or stub code should be replaced by a concrete example." – rolfl
If this question can be reworded to fit the rules in the help center, please edit the question.

2  
If you are using Rails >=4.1, you may want to take a look at ActiveRecord::Enum. –  toro2k Oct 14 '14 at 18:15
3  
Back in the olden days, we did this to save memory/disk, not for CPU efficiency. –  Barry Carter Oct 14 '14 at 18:53

1 Answer 1

up vote 3 down vote accepted

The options

Before picking, you should look at all the options.

Enums

For highly static definitions it's generally optimial to use an enumeration type (CREATE TYPE ... AS ENUM (...)), with the caveat that PostgreSQL does not currently support removing values from enumerations. You can add/insert new values though.

Enums are great if you're defining something you don't expect to ever change.

Apps can easily look up the allowed values for an enum at runtime, but it's more typical to just code the enum into the application.

Look-up tables

Integer-keyed look-up tables are generally fine. If the string keys are short, there isn't a big advantage to using integers instead.

It's often both more efficient and more readable to use a "char" key, which is PostgreSQL's extension type for a single character. The double-quotes are significant and required when naming this type. (I don't know if Rails supports "char" though). There's no detectable performance difference over integer. Note that this is different to char, (no quotes) which is a horrible legacy space-padded character type you should never use.

String-keyed lookup tables

You get a form of validation by using a key into a lookup table, but nothing forces that key to be a number rather than a short string. (Except possibly Rails, which is not noted for dealing well with tables that don't follow its convention of having a synthetic integer primary key.

It's perfectly valid to have a one-column lookup table with a string key.

Domains

With a string key you don't strictly even need the lookup table. You can also use a DOMAIN type, e.g.

CREATE DOMAIN mytype AS varchar(10) 
CONSTRAINT mytype_legal_values 
  CHECK (value IS NOT NULL AND value IN ('fred','bob','joe'));

Unlike an enum, you can re-define the constraint of a domain to remove currently-legal values and PostgreSQL will re-check all uses of the domain to make sure the change is OK before applying it.

It's difficult for an application to look up the list of allowed values for a domain at runtime.

What to choose?

Which option you choose depends on many factors:

  • How well the tools/frameworks you use work with each feature;
  • How often you expect the lookup/enum to change;
  • What you're comfortable with / personal preference

For example, many ORMs deal very poorly with domains and enums, often requiring custom type-handling code because they don't understand the general concept of "domain type" or "enumerated type". I can't speak for Rails specifically in this regard.

Performance

In pure performance terms ... it depends.

  • An integer (e.g. a key to a lookup table) is stored as a fixed width field in PostgreSQL, with no TOAST table. It's fast to compare. It doesn't need any encoding-aware collation handling, etc. It's fast and simple.

    However, for lookup table use you're incurring the cost of foreign key constraint lookups on inserts, and your selects have to do an additional join against the lookup table to get the "real" value, so it's far from free for write-heavy tables. Removing values is possible, but expensive due to the need to re-validate foreign key constraints.

  • An enum is just an integer with validation. You're not firing foreign key constraint triggers, so it's quite a lot faster for insert/update/delete. For SELECT there's just a cheap lookup of a cached system catalog table to get the name, and the extra data used to transfer the enum name instead of integer value. In practice it's the fastest option in almost all cases. However, PostgreSQL doesn't currently support dropping values from enums. (Patches welcome)

  • A string is stored as a varlena and may require a TOAST table. However, if it's always stored inline (which you can force by limiting the string's length and/or setting the column's storage type to "MAIN") then there's no extra lookup to get the value. PostgreSQL still has to do correct text-encoding collations on sorts, though, which can be significantly slower than sorting integers.

    You can mitigate the impact of collation by defining the column as COLLATE "C", forcing byte-oriented POSIX collations. I don't know whether Rails supports it natively, but you can tack the clause on as extra SQL, and it won't care about the column collation in day-to-day operation. You'll want a CHECK constraint to restrict the column to allowable values, e.g.

    CREATE TABLE mytable (
       ... blah ....,
       thekey varchar(5) 
         COLLATE "C" 
         CONSTRAINT thekey_allowed_value 
           CHECK (thekey IN ('fred','bob','joe')) 
         NOT NULL,
       ... blah ...
    );
    
    ALTER TABLE mytable ALTER COLUMN thekey SET STORAGE MAIN;
    

    so efficient string keys are a bit fiddlier, but not really problematic.

    If you use a foreign key to a string-keyed lookup table instead of the CHECK constraint you get a single central value store, but you then have all the disadvantages of a lookup table for insert/update foreign key check costs.

  • Domains over strings are just a convenience, and are identical in performance terms to using a string with a check constraint. They give you the advantage of central control over the allowed values list, without the need for a foreign key lookup. However, applications can't easily determine the allowed values list from a DOMAIN at runtime, unlike an enum or lookup table.

  • String keys in the default collation with no constraints can be inefficient. If the key is big it might get stored out-of-line compressed in a TOAST table, where an extra seek may be required to fetch the actual value. Additional CPU time is required to perform locale-aware collation calculations if you ORDER BY the key, but not if you simply test for equality.

    In practice the difference is generally minimal.

share|improve this answer
    
That's hell of an answer, thanks for your time. –  nicooga Oct 16 '14 at 20:23

Not the answer you're looking for? Browse other questions tagged or ask your own question.