Why is “Select * from table” considered bad practice

Question

Yesterday I was discussing with a "hobby" programmer (I myself am a professional programmer). We came across some of his work, and he said he always queries all columns in his database (even on/in production server/code).

I tried to convince him not to do so, but wasn't so successful yet. In my opinion a programmer should only query what is actually needed for the sake of "prettiness", efficiency and traffic. Am I mistaken with my view?

possible duplicate of Confusion about proper use of * wildcard in SQL — gnat, 2 days ago
Good reasons at SO - stackoverflow.com/questions/3180375/select-vs-select-column — Bratch, 2 days ago

gbjbaanb · Accepted Answer · 2014-04-03 12:45:45Z

up vote 44 down vote accepted

Think about what you're getting back, and how you bind those to variables in your code.

Now think what happens when someone updates the table schema to add (or remove) a column, even one you're not directly using.

Using select * when you're typing queries by hand is fine, not when you're writing queries for code.

edited 2 days ago

answered 2 days ago

gbjbaanb
16.3k12961

4

Performance, network load, etc. etc. are far more important than convenience of getting the columns back in the order and with the name you want. – jwenting 2 days ago

13

@jwenting really? performance matters more than correctness? Anyway, I don't see that "select *" performs better than selecting only the columns you want. – gbjbaanb 2 days ago

4

@Bratch, in real life production environments, you might have hundreds of applications using the same tables and there's no possible way all those applications can be maintained properly. You are correct in sentiment, but practically, the argument fails just due to the realities of working in copmanies. Schema changes to active tables happens all the time. – user1068 2 days ago

9

I don't understand the point in this answer. If you add a column to a table, both the SELECT * and the SELECT [Columns] will work, the only difference is that if the code needs to bind to the new column, the SELECT [Columns] will need to be modified whereas the SELECT * will not. If a column is removed from a table, the SELECT * will break at the point of binding, whereas the SELECT [Columns] will break when the query is executed. It seems to me that the SELECT * is the more flexible option as any changes to the table would only require changes to the binding. Am I missing something? – TallGuy 2 days ago

6

@gbjbaanb then access the columns by name. Anything else would be obviously stupid unless you specified the column order in the query. – immibis 2 days ago

show 12 more comments

MichaelT · Answer 2 · 2014-04-04 01:50:20Z

Schema Changes

Fetch by order --- If the code is fetching column # as the way to get the data, a change in the schema will cause the column numbers to readjust. This will mess up the application and bad things will happen.
Fetch by name --- If the code is fetching column by name such as foo, and another table in the query adds a column foo, the way this is handled may cause problems when trying to get the right foo column.

Either way, a schema change can cause problems with the extraction of the data.

Further consider if a column that was being used is removed from the table. The select * from ... still works but errors out when trying to pull the data out of the result set. If the column is specified in the query, the query will error out instead giving a clear indiciation as to what and where the problem is.

Data overhead

Some columns can have a significant amount of data associated with them. Selecting back * will pull all the data. Yep, here's that varchar(4096) thats on 1000 rows that you've selected back giving you an additional possible 4 megabytes of data that you're not needing, but is sent across the wire anyways.

Related to the schema change, that varchar might not exist there when you first created the table, but now its there.

Failure to convey intent

When you select back * and get 20 columns but only need 2 of them, you are not conveying the intent of the code. When looking at the query that does a select * one doesn't know what the important parts of it are. Can I change the query to use this other plan instead to make it faster by not including these columns? I don't know because the intent of what the query returns isn't clear.

Lets look at some SQL fiddles that explore those schema changes a bit more.

First, the initial database: http://sqlfiddle.com/#!2/a67dd/1

DDL:

create table one (oneid int, data int, twoid int);
create table two (twoid int, other int);

insert into one values (1, 42, 2);
insert into two values (2, 43);

SQL:

select * from one join two on (one.twoid = two.twoid);

And the columns you get back are oneid=1, data=42, twoiid=2, and other=43.

Now, what happens if I add a column to table one? http://sqlfiddle.com/#!2/cd0b0/1

alter table one add column other text;

update one set other = 'foo';

And my results from the same query as before are oneid=1, data=42, twoiid=2, and other=foo.

A change in one of the tables disrupts the values of a select * and suddenly your binding of 'other' to an int is going to throw an error and you don't know why.

If instead your SQL statement was

select 
    one.oneid, one.data, two.twoid, two.other
from one join two on (one.twoid = two.twoid);

The change to table one would not have disrupted your data. That query runs the same before the change and after the change.

This should have been the selected answer, since it's more complete than the selected one. — Mark Freedman, 2 days ago
+1 I agree this is a better answer. It EXPLAINS why in stead of just HINTING at why. — Tonny, 2 days ago
Breaking everything by adding a column is also a good reason why code should always access columns in a datareader by name not by hard-coded ordinal... — Julia Hayward, yesterday
@gbjbaanb It is to me. But a lot of people come to write SQL queries without a formal background/training. To them is may not be obvious. — Tonny, yesterday

Konrad Morawski · Answer 3 · 2014-04-03 13:08:25Z

up vote 32 down vote

Another concern: if it's a JOIN query and you're retrieving query results into an associative array (as could be the case in PHP), it's bug-prone.

The thing is that

if table foo has columns id and name
if table bar has columns id and address,
and in your code you are using SELECT * FROM foo JOIN bar ON foo.id = bar.id

guess what happens when someone adds a column name to the bar table.

The code will suddenly stop working properly, because now the name column appears in the results twice and if you're storing the results into an array, data from second name (bar.name) will overwrite the first name (foo.name)!

It's quite a nasty bug because it's very non-obvious. It can take a while to figure out, and there's no way the person adding another column to the table could have anticipated such undesirable side effect.

(True story).

So, don't use *, be in control of what columns you are retrieving and use aliases where appropriate.

edited 2 days ago

answered 2 days ago

Konrad Morawski
4,45031438

okay in this case (which i consider sort of rare) it could be a major issue. But you could still avoid(and most people probably will) it by querying with the wildcard and just add an alias for the identical column names. – the baconing 2 days ago

3

In theory, but if you use a wildcard for convenience you rely on it to automatically give you all columns in existence and never bother to update the query as the tables grow. If you are specifying each and every column, you are forced to go to the query to add another one to your SELECT clause and this is when you hopefully spot the name is not unique. BTW I don't think it's so rare in systems with large databases. As I said, I once spent a couple of hours hunting this bug in a big mudball of PHP code. And I found another case just now: stackoverflow.com/q/17715049/168719 – Konrad Morawski 2 days ago

2

I spend an hour last week trying to get this through a a consultants head. He is supposed to be a SQL guru... Sigh... – Tonny 2 days ago

add comment

user116777 · Answer 4 · 2014-04-03 12:27:18Z

Querying every column might be perfectly legitimate, in many cases.

Always querying every column isn't.

It's more work for your database engine, which has to go off and rummage around its internal metadata to work out which columns it needs to deal with before it can get on with the real business of actually getting the data and sending it back to you. OK, it's not the biggest overhead in the world, but system catalogs can be an appreciable bottleneck.

It's more work for your network, because you're pulling back any number of fields when you might only want one or two of them. If somebody [else] goes and adds a couple of dozen extra fields, all of which contains big chunks of text, you're throughput suddenly goes through the floor - for no readily apparent reason. This is made worse if your "where" clause isn't particularly good and you're pulling back lots of rows as well - that's potentially a lot of data tromping its way across the network to you (i.e. it's going to be slow).

It's more work for your application, having to pull back and store all of this extra data that it quite probably doesn't care about.

You run the risk of columns changing their order. OK, you shouldn't have to worry about this (and you won't if you select only the columns you need) but, if you go get them all at once and somebody [else] decides to rearrange the column order within the table, that carefully crafted, CSV export that you give to accounts down the hall suddenly goes all to pot - again, for no readily apparent reason.

BTW, I've said "someone [else]" a couple of times, above. Remember that databases are inherently multi-user; you may not have the control over them that you think you do.

I would think that always querying every column may be legitimate for things like schema-agnostic table viewing facilities. Not a terribly common situation, but in the context of internal-use-only tools such things can be handy. — supercat, 2 days ago
@supercat That is just about the ONLY valid use-case for a "SELECT *" that I can think of. And even then I would prefer to limit the query to "SELECT TOP 10 *" (in MS SQL) or add "LIMIT 10" (mySQL) or add "WHERE ROWNUM <=10" (Oracle). Usually in that case it's more about "what columns are there and some sample data" than the complete content. — Tonny, 2 days ago
@Tonny: SQL Server changed their default scripts to add the TOP limitation; I'm not sure how important that is if code reads as many as it cares to display and then disposes the query. I think query responses are processed somewhat lazily, though I don't know the details. In any case, I think that rather than saying it "isn't legitimate", it would be better to say "...is legitimate in far fewer"; basically, I'd summarize the legitimate cases as those where the user would have a better idea what's meaningful than the programmer. — supercat, 2 days ago
@supercat I can agree to that. And I really like the way you put it in your last sentence. I have to remember that one. — Tonny, 2 days ago

lorenzog · Answer 5 · 2014-04-03 12:08:45Z

The short answer is: it depends on what database they use. Relational databases are optimized for extracting the data you need in a fast, reliable and atomic way. On large datasets and complex queries it's much faster and probablly safer than SELECTing * and do the equivalent of joins on the 'code' side. Key-value stores might not have such functionalities implemented, or might not be mature enough to use in production.

That said, you can still populate whatever data structure you're using with SELECT * and work out the rest in code but you'll find performance bottlenecks if you want to scale.

The closest comparison is sorting data: you can use quicksort or bubblesort and the result will be correct. But won't be optimized, and definitely will have issues when you introduce concurrency and need to sort atomically.

Of course, it's cheaper to add RAM and CPUs than investing in a programmer that can do SQL queries and has even a vague understanding of what a JOIN is.

Learn SQL! It isn't that hard. It is the "native" language of databases far and wide. It's powerful. It's elegant. It has stood the test of time. And there's no way you're going to write a join on the "code" side that's more efficient than the join in the database, unless you are really inept at doing SQL joins. Consider that in order to do a "code join," you have to pull all the data from both tables in even a simple 2-table join. Or are you pulling index stats and using those to decide which table data to pull before you join? Didn't think so... Learn to use the database correctly, people. — Craig, 10 hours ago

ratchet freak · Answer 6 · 2014-04-03 12:02:36Z

because if the table gets new columns then you get all those even when you don't need them. with varchars this can become a lot of extra data that needs to travel from the DB

some DB optimizations may also extract the non fixed length records to a separate file to speed up access to the fixed length parts, using select* defeats the purpose of that

zkent · Answer 7 · 2014-04-03 18:44:14Z

IMO, its about being explicit vs implicit. When I write code, I want it to work because I made it work, not just because all of the parts just happen to be there. If you query all records and your code works, then you'll have the tendency to move on. Later on if something changes and now your code doesn't work, its a royal pain to debug lots of queries and functions looking for a value that should be there and the only values reference are *.

Also in an N-tiered approach, its still best to isolate database schema disruptions to the data tier. If your data tier is passing * to the business logic and most likely on the the presentation tier, you are expanding your debugging scope exponentially.

This is probably one of the most important reasons here, and it's got just a tiny fraction of the votes. The maintainability of a codebase littered with select * is much worse! — Eamon Nerbonne, 14 hours ago

dj bazzie wazzie · Answer 8 · 2014-04-03 12:27:48Z

up vote 1 down vote

Apart from overhead, something you want to avoid in the first place, I would say that as an programmer you don't depend on column order defined by the database administrator. You select each column even if you need them all.

answered 2 days ago

dj bazzie wazzie
1723

3

Agree, though I'd also recommend pulling out values from a result set by column name in any case. – Rory Hunter 2 days ago

Seconded, carried. Use the column names, do not depend on the column order. The column order is a brittle dependency. The names should have (you hope) been derived from some actual design effort, or you explicitly alias composite columns or computations or conflicting column names in your query, and reference the explicit alias that you specified. But relying on the order is pretty much just duct tape and prayer... – Craig 10 hours ago

add comment

m3th0dman · Answer 9 · 2014-04-03 13:43:48Z

I don't see any reason why you shouldn't use for the purpose it's build - retrieve all the columns from a database. I see three cases:

A column is added in the database and you want it in code also. a) With * will fail with a proper message. b) Without * will work, but won't do what you expect which is pretty bad.
A column is added in database and you do not want it in code. a) With * will fail; this means that * does no longer applies since it's semantics means "retrieve all". b) Without * will work.
A column is removed Code will fail either way.

Now the most common case is case 1 (since you used * which means all you most probably want all); without * you can have code that works fine but doesn't do what expected which is much, much worst that code that fails with a proper error message.

I'm not taking into consideration the code which retrieves the column data based on column index which is error-prone in my opinion. It's much more logic to retrieve it based on column name.

MichaelT · Answer 10 · 2014-04-03 19:31:52Z

Think of it this way... if you query all columns from a table that has just a few small string or numeric fields, that total 100k of data. Bad practice, but it will perform. Now add a single field that holds, say, an image or a 10mb word document. now your fast performing query immediately and mysteriously start performing poorly, just because a field was added to the table... you may not need that huge data element, but because you've done Select * from Table you get it anyway.

this seems to merely repeat point already made few hours ago in a first answer and in couple of other answers — gnat, 2 days ago

asked	2 days ago
viewed	6132 times
active	yesterday

current community

your communities

more stack exchange communities

Why is “Select * from table” considered bad practice

10 Answers

Schema Changes

Data overhead

Failure to convey intent

protected by Thomas Owens♦ yesterday

Not the answer you're looking for? Browse other questions tagged web-development database sql mysql bad-code or ask your own question.

Linked

Hot Network Questions