Postgres regular expressions and regexp_split_to_array

Question

In postgresql, I need to extract the first two words in the value for a given column. So if the value is "hello world moon and stars" or "hello world moon" or even just "hello world", I need "hello world".

I was hoping to use regexp_split_to_array but it doesn't seem that I can use this and access the elements returned in the same query?

Do I need to create a function for what I'm trying to do?

VGe0rge · Accepted Answer · 2016-12-21 10:56:13Z

27

I can't believe that 5 years ago and no one noticed that you can access elements from regexp_split_to_array function if you surround them with parenthesis.

I saw many people tried to access the elements of the table like this:

select regexp_split_to_array(my_field, E'my_pattern')[1] from my_table

The previous will return an error, but the following will not :

select (regexp_split_to_array(my_field, E'my_pattern'))[1] from my_table

answered Dec 21, 2016 at 10:56

VGe0rge

1,1103 gold badges16 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

schweik Over a year ago

Very important note! I spent some hours with googling this. The problem is that both regexp_split_to_array() and regexp_matches() gives the result as an array even the regexp returns a single string only.

mu is too short · Accepted Answer · 2011-03-21 08:37:23Z

10

You can use POSIX regular expressions with PostgreSQL's substring():

select substring('hello world moon' from E'^\\w+\\s+\\w+');

Or with a very liberal interpretation of what a word is:

select substring('it''s a nice day' from E'^\\S+\\s+\\S+');

Note the \S (non-whitespace) instead of \w ("word" character, essentially alphanumeric plus underscore).

Don't forget all the extra quoting nonsense though:

The E'' to tell PostgreSQL that you're using extending escaping.
And then double backslashes to get single backslashes past the string parser and in to the regular expression parser.

If you really want to use regexp_split_to_array, then you can but the above quoting issues apply and I think you'd want to slice off just the first two elements of the array:

select (regexp_split_to_array('hello world moon', E'\\s+'))[1:2];

I'd guess that the escaping was causing some confusion; I usually end up adding backslashes until it works and then I pick it apart until I understand why I needed the number of backslashes that I ended up using. Or maybe the extra parentheses and array slicing syntax was an issue (it was for me but a bit of experimentation sorted it out).

edited Mar 21, 2011 at 8:37

answered Mar 21, 2011 at 0:54

mu is too short

436k71 gold badges862 silver badges821 bronze badges

7 Comments

Joey Adams Over a year ago

You can use the dollar-quoted string syntax to avoid the double backslashes: $$^\w+\s+\w+$$

mu is too short Over a year ago

@Joey Adams: But then you run into issues with regular expressions that are anchored at the end. I usually just use E'' syntax so that I don't have to remember yet another set of quoting rules.

user533832 Over a year ago

@mu yes, it's ugly and almost ridiculous, but still far easier IMO than escaping which is IMO a minefield - for anyone else reading these comments, just choose a pattern that doesn't clash with the end, eg: $re$^\w+\s+\w+$$re$

user533832 Over a year ago

@mu note that select substring(E'it\'s a nice day' from E'^\\w+\\s+\\w+'); probably does not give the desired result

mu is too short Over a year ago

@JackPDouglas: I think we're dangerously close to fighting over vi versus emacs now (for the record: ed(1) is the standard text editor; ed man! man ed!). I find it harder to find the regex in a twisted pile of dollar signs than to mentally undouble the backslashes. In the end, use whatever works and move on to more interesting problems :)

|

scriptThis · Accepted Answer · 2011-03-20 22:52:03Z

2

found one answer:

select split_part('hello world moon', ' ', 1) || ' ' || split_part('hello world moon', ' ', 2);

answered Mar 20, 2011 at 22:52

scriptThis

1811 gold badge1 silver badge6 bronze badges

Comments

score 1 · Accepted Answer · 2011-03-21 15:12:25Z

1

select substring(my_text from $$^\S+\s+\S+$$) from v;

  substring
-------------
 hello world
 hello world
 hello world
(3 rows)

where for the purpose of demonstration, v is:

create view v as select 'hello world moon and stars' as my_text union all 
                 select 'hello world mood' union all 
                 select 'hello world';

if you want to ignore whitespace at the beginning:

select substring(my_text from $$^\s*\S+\s+\S+$$) from v;

edited Mar 21, 2011 at 15:12

answered Mar 21, 2011 at 8:18

user533832

3 Comments

mu is too short Over a year ago

I still prefer the regexp_split_to_array(text, E'\\s+'))[1:2] approach, that seems to match the intent better than messing around with substring and what a "word" is.

user533832 Over a year ago

@mu I'd be concerned about how heavy that would be on my system when the strings are long. There is no reason to suppose from the question that there aren't tens of thousands of words in the field

mu is too short Over a year ago

True enough. Hard to say what the best approach would be without more details and, most importantly, experimentation on real data. I tend to start with clarity and a direct expression of my intent; if that isn't good enough, then it is time for cleverness and cycle bumming.

Collectives™ on Stack Overflow

Postgres regular expressions and regexp_split_to_array

4 Answers 4

1 Comment

7 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

7 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related