Why don't databases create their own indexes automatically?

Question

I would have thought that databases would know enough about what they encounter often and be able to respond to the demands they're placed under that they could decide to add indexes to highly requested data.

a more accurate analogy is does your ECU alter the power supplied to the fuel pump to fix fuel/oil flow rates and compensate for dirty lines? to which the answer is yes..
A database can already put an index on a table it just currently requires us to command it to, a car physically cannot replace a tire, until we build it some arms to use.
If you google "self tuning databases" you'll find plenty of research on this. Maybe in the future it will be common to have some element of this.

Thomas Stringer · Answer 1 · 2013-06-04 13:20:03Z

up vote 14 down vote

The index design that you put in place is something more of an art than a science. The RDBMS isn't smart enough to take common workloads and design a smart indexing strategy. It is up to human intervention (read: DBA) to analyze workload and determine what is the best approach.

If there was no penalty of having indexes then it would be a shotgun approach to just add an infinite number of indexes. But because data modification (INSERTS, UPDATES, and DELETES) have impact on the enabled indexes on a table then there is going to be that variable overhead of these indexes.

It takes human design and strategy to smartly create indexes that'll maximize read performance, while having the least amount of data modification overhead.

answered Jun 4 at 13:20

Thomas Stringer
15.1k32457

2

Also a human can anticipate future workload and create additional indexes to improve performance for new queries/changing workloads. – Kenneth Fisher Jun 4 at 13:29

1

@KennethFisher the idea wasn't to anticipate future workloads, but to improve on recurring ones. for an example, i constantly hit the user table getting a list of all the posts from a user, this happens once every 10 queries and queries happen about 300/sec, couldn't the DB work out that an index on that would be advantageous just from the hit rate? – Jharwood Jun 4 at 13:34

4

MS SQL Server for example does have a series of DMVs that are the optimizer's suggestions for additional indexes. They still need to be reviewed and put in place by a human but they are closer to what you are looking for. Having reviewed them though, I would hate to have the RDBMS just throw them in place. Some are good suggestions, some less so. – Kenneth Fisher Jun 4 at 13:42

2

@KennethFisher: the DB engine could insert one, measure the effect and remove it if it degrades performance, much like a JIT compiler does. – André Paramés Jun 4 at 14:52

2

@Jharwood Normally indexes lock tables because of the amount of data involved. An index is actually a b-tree with a large number of levels. If the data is modified during the building of the index, especially on a large table, the number of updates to the index could be very expensive and slow the entire database down. – JNK♦ Jun 4 at 15:15

show 20 more comments

Martin Smith · Answer 2 · 2013-06-07 11:02:06Z

Some databases do already (kind of) create indexes automatically.

In SQL Server the execution plan can sometimes include an Index Spool operator where the RDBMS dynamically creates an indexed copy of the data. However this spool is not a persistent part of the database kept in synch with the source data and it cannot be shared between query executions, meaning execution of such plans may end up repeatedly creating and dropping temporary indexes on the same data.

Perhaps in the future RDBMSs will have the capacity to dynamically drop and create persistent indexes according to workload.

The process of index optimisation is in the end just a cost benefit analysis. Whilst it is true that humans may have more information about relative importance of queries in a workload in principle there is no reason why this information could not be made available to the optimiser. SQL Server already has a resource governor that allows sessions to be classified into different workload groups with different resource allocations according to priority.

The missing index DMVs mentioned by Kenneth are not intended to be implemented blindly as they only consider the benefits to a specific query and make no attempt to take account of the cost of the potential index to other queries. Nor does it consolidate similar missing indexes. e.g. the output of this DMV may report missing indexes on A,B,C and A,B INCLUDE(C)

Some current issues with the idea are

The quality of any automated analysis that does not actually create the index will be highly dependant upon the accuracy of the costing model.
Even within the field of automated analysis an offline solution will be able to be more thorough than an online solution as it is imperative that an online solution should not add large book keeping overhead to the live server and interfere with its primary purpose of executing queries.
The indexes created automatically in response to workload will necessarily be created in response to queries that would have found them useful so will lag behind solutions that create the indexes in advance.

It is probably reasonable to expect the accuracy of costing models to improve over time but point 2 looks trickier to solve and point 3 is inherently insoluble.

Nevertheless probably the vast majority of installs are not in this idealised situation with skilled staff who continuously monitor, diagnose, and anticipate (or at least react to) changes in workloads.

The AutoAdmin project at Microsoft Research has been running since 1996

The goal of this project is to make databases self-tuning and self-administering by exploiting knowledge of the workload

The project home page lists several intriguing projects. One is particularly relevant to the question here

Another interesting problem arises when there is no DBA available (e.g. an embedded database or a small business). In such scenarios, a low touch continuous index tuning approach may become important. We have explored solutions ...[in] “An Online Approach to Physical Design Tuning” in ICDE 2007.

The authors state

With increasingly common DBMS features like online indexes , it is appealing to explore more automatic solutions to the physical design problem that advance the state of the art.

The paper introduces an algorithm

Its main characteristics are:

As queries are optimized, we identify a relevant set of candidate indexes that would improve performance. This feature allows query processing to continue in parallel with indexes that are built in the background.

At execution time, we track the potential beneﬁts that we lose by not having such candidate indexes and also the utility of existing indexes in the presence of queries, updates, and space constraints.

After we gather enough “evidence” that a physical design change is beneﬁcial, we automatically trigger index creations or deletions.

The online nature of our problem implies that we will generally lag behind optimal solutions that know the future. However, by carefully measuring evidence, we ensure that we do not suffer from “late” decisions signiﬁcantly, thus bounding the amount of incurred loss

The implementation of the algorithm allows for throttling in response to changes in server load and also can abort index creation if during creation the workload changes and expected benefit falls below the point that it is deemed worthwhile.

The conclusion of the authors on the topic of Online versus traditional physical tuning.

The online algorithms in this work are useful when DBAs are uncertain about the future behavior of the workload, or have no possibility of doing a comprehensive analysis or modelling. If a DBA has full information about the workload characteristics, then a static analysis and deployment by existing tools (e.g., [2, 3]) would be a better alternative.

The conclusions here are similar to those in another paper Autonomous Query-driven Index Tuning

Our approach cannot beat the index advisor if the whole workload is known in advance. However, in dynamic environments with evolving and changing workloads the query-driven approach produces better results.

BlueRaja - Danny Pflughoeft · Answer 3 · 2013-06-04 16:55:51Z

In fact, there are some databases that do this. For example, Google's BigTable and Amazon's SimpleDB automatically create indices (though neither are RDBMS's). There is also at least one MySQL engine that does this. SQL Server also keeps track of indices it thinks you should create, though it doesn't go so far as actually creating them.

The problem is surprisingly difficult to get correct, so it's no wonder that most databases don't automatically create them (BigTable/SimpleDB gets away with it because it doesn't allow arbitrary joins, which makes thing significantly easier). Also, creating indices on the fly is a time-consuming process that requires exclusive access to the entire table - definitely not something you want happening while the table is on-line.

However, given the number of LAMP web applications out there that were written by amateurs who don't even know what an index is, I still think this feature could be beneficial, at least for MySQL.

I would say that comparing BigTable (and its derivates, such as Cassandra, HBase, etc) to RDBMS solutions is comparing apples to oranges - BigTable and derivates are more like gigantic key-value or columnar stores, and the row key is inherently an index.
Exactly. The question is tagged with rdbms and I don't think that BigTable falls in the category.
@ypercube: ...Yes, I mentioned that in my answer; but it's still worth knowing about, at the very least as a point of interest. I also several mentioned other databases that are RDBMS's which do this, and explained why it's not common. This is definitely not deserving of a downvote...

Matt · Answer 4 · 2013-06-08 17:57:34Z

While there are some extensive answers already, they seem to skirt around the real answer: Indexes aren't always desirable.

With the car analogy mentioned in comments, you'd be better of saying why aren't all cars fitted with extreme sports packages? Partly it's expense, but it's also down to the fact that a lot of people don't need or want low profile tires and rock hard suspension; it's unnecessarily uncomfortable.

So maybe you have 1,000 reads for every insert, why not have an auto created index? If the table is wide and the queries are varied, why not have several? Maybe the commit is time critical and the reads aren't; in the circumstances it might be unacceptable to slow down your insert. Maybe you're working with limited disk space and you can't afford to have additional indexes eating into the space you've got.

The point is, indexes aren't automatically created because they aren't the answer to everything. Designing indexes isn't simply a case of saying "hey this will speed up my reads", there are other factors to consider.

+1 while it is certainly possible and feasible to automate this stuff, we aren't always going to better off with a bunch of magic indexes implemented by a system that has no insight into how the data will be used tomorrow, never mind your write vs. read trade-off threshold. I blogged a bit about this the other day, but clearly there's a lot more to talk about.

JamesRyan · Answer 5 · 2013-06-04 13:21:10Z

up vote 3 down vote

They can analyse past queries and suggest/create indexes however this doesn't work optimally because indexes strike a balance to speed up what you want optimised at a cost and the server can not know your intentions.

answered Jun 4 at 13:21

JamesRyan
41618

asked	4 days ago
viewed	583 times
active	today

Why don't databases create their own indexes automatically?

5 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged index rdbms or ask your own question.

Visit Chat

Why don't databases create their own indexes automatically?

5 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged index rdbms or ask your own question.

Visit Chat

Related