11

Is it possible to create an index on a column with type of text array. Tried using GIN indexes, but queries do not seem to be using those indexes.

-- Example:
CREATE TABLE users (
   name VARCHAR(100),
   groups TEXT[],
);

-- Query: 
SELECT name FROM users WHERE ANY(groups) = 'Engineering';

Also what is the best way to perform GROUP BY on groups column efficiently so that it can give groups and count.

2
  • 3
    What is preventing you from normalization? Feb 9, 2011 at 15:19
  • Actual table has got multiple multi-valued columns. Db is relatively normalized expect for these columns where I wanted to avoid excessive joins as it is slowing down the performance. The 'User' table I used is just an example.
    – Anoop
    Feb 9, 2011 at 15:41

3 Answers 3

7

A gin index can be used:

CREATE TABLE users (
 name VARCHAR(100),
 groups text[]
);

CREATE INDEX idx_users ON users USING GIN(groups);

-- disable sequential scan in this test:
SET enable_seqscan TO off;

EXPLAIN ANALYZE
SELECT name FROM users WHERE  groups @> (ARRAY['Engineering']);

Result:

"Bitmap Heap Scan on users  (cost=4.26..8.27 rows=1 width=218) (actual time=0.021..0.021 rows=0 loops=1)"
"  Recheck Cond: (groups @> '{Engineering}'::text[])"
"  ->  Bitmap Index Scan on idx_users  (cost=0.00..4.26 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1)"
"        Index Cond: (groups @> '{Engineering}'::text[])"
"Total runtime: 0.074 ms"

Using aggregate functions on an array, that will be another problem. The function unnest() might help.

Why don't you normalize your data? That will fix all problems, including many problems you didn't encouter yet.

4
  • I think I didn't use ARRAY['Engineering'] in the query and as a result GIN index never got used. Regarding normalization, actual table and use case is different from what I have mentioned. Actual table do have multiple columns which are multivalued represented using array of strings. I was trying to avoid multiple joins as the expected number of records per table is of the order of millions.
    – Anoop
    Feb 9, 2011 at 15:35
  • 1
    I think the operator @> makes all the difference, it looks like ANY() can't use the index. Feb 9, 2011 at 15:39
  • Specifically when you expect large intermediate and/or result tables you should make your data as accessible to the DBMS as you can. Normalization allows it to work with a reference in several cases where otherwise you'd work with values. Feb 9, 2011 at 15:46
  • Using explain I see no difference between operator @> en any(), both give identically: explain select count(*) from kwetsbaar.kwo where bijztype @> (array['WON'])::varchar[]; Finalize Aggregate (cost=505166.56..505166.57 rows=1 width=8) -> Gather (cost=505166.14..505166.55 rows=4 width=8) Workers Planned: 4 -> Partial Aggregate (cost=504166.14..504166.15 rows=1 width=8) -> Parallel Seq Scan on kwo (cost=0.00..500662.60 rows=1401414 width=0) Filter: (bijztype @> '{WON}'::character varying[]) with GIN index, 8 million rows.
    – Jan
    Feb 3, 2022 at 15:53
1

I think the best way to handle this would be to normalize your model. The following will probably contain errors as I didn't try it, but the idea should be clear:

CREATE TABLE users (id INTEGER PRIMARY KEY, name VARCHAR(100) UNIQUE);
CREATE TABLE groups (id INTEGER PRIMARY KEY, name VARCHAR(100) UNIQUE);
CREATE TABLE user_group (
    user INTEGER NOT NULL REFERENCES users,
    group INTEGER NOT NULL REFERENCES groups);
CREATE UNIQUE INDEX user_group_unique ON user_group (user, group);

SELECT users.name
    FROM user_group
    INNER JOIN users ON user_group.user = users.id
    INNER JOIN groups ON user_group.group = groups.id
    WHERE groups.name = 'Engineering';

The resulting execution plan should be fairly efficient already; you can optimize still by indexing ON user_group(group), which allows an index_scan rather than a sequential_scan to find the members of a particular group.

1
  • I wanted to avoid normalization for this particular use case. That is the reason why I was forced to use array of strings data type.
    – Anoop
    Feb 9, 2011 at 15:32
0

There is array_to_tsvector() function as of PostgreSQL 9.6 now.

https://pgpedia.info/a/array_to_tsvector.html

So maybe quite efficient is to create tsvectors and use gin index on them.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

Not the answer you're looking for? Browse other questions tagged or ask your own question.