PATINDEX Workbench

Av rating:
Total votes: 30
Total comments: 13

Downloads

PATINDEX Workbench

12 May 2011

/*The PATINDEX function of SQL Server packs powerful magic, but it is easy to get it wrong. Phil Factor returns to the Workbench format to give a tutorial of examples, samples and cookbook ideas to demonstrate the ways that this underrated function can be of practical use. It is intended to be pasted into SSMS and used as a basis for experiment.*/

/* The PATINDEX function is a treasure, and it is the obvious way in SQL Server to locate text in strings. It uses the same wildcard pattern as the LIKE operator, works as fast, but gives back more information. It is closest in functionality to CHARINDEX. People worry that it does not allow a full RegEx, but that is missing the point. It will do useful string searching, as I'll show you, and it is optimized for performance in a database. A RegEx search performs a different service and is slower at doing those routine jobs that PATINDEX is intended for.

It is great for quick checks on your data. Here, we do an elementary check on a stored IP address */

SELECT PATINDEX ('%.%.%.%','23.245.6.49');--returns with the index of the first dot, or 0 if there aren't three in the string

SELECT PATINDEX ('%.%.%.%','23.245.6,49');--returns 0

/* So here, the % 'wildcard' character has a special meaning, which is 'any number of any character'. BOL puts this as 'Any string of zero or more characters.' For some reason, most developers seem to think that there is a rule that you can only use them at the start and/or end of a pattern. Not true. You can use them anywhere, as many as you like, but not adjacent as all but the first are ignored. The meaning 'One of any character' is denoted by the underscore wildcard character, '_'. So what does the IP check we have shown you comprise? Nothing more than counting that there are three dots in the string. We can do more but it gets rather more complicated.

note that, because we are just testing for existence, and aren't going to extract it, we could also use LIKE */

SELECT case WHEN '23.245.6.49' LIKE '%.%.%.%' THEN 1 ELSE 0 END;

/*What if you want to specify the % character instead of using it in its wildcard meaning? Simple. Do this [%] like here.*/

SELECT PATINDEX ('%[%]%','You have scored 90% in your exam');

/*and the same applies to the other wildcard characters*/

SELECT PATINDEX (

'%[_][%][[]%',

'You have scored 90% in your exam and shouted "D_%[]!"');

Here are a few other more practical examples of using the '%' wildcard.

Select a list of all objects in the current database whose name begins with 'sp' and ends with 'table' (case insensitive)*/

SELECT name FROM sys.objects

WHERE PATINDEX('sp%Table',NAME COLLATE Latin1_General_CI_AI)>0;

/*Select all objects (not tables!) in the current database which have the following three words FROM WHERE and ORDER in that order with or without gaps in between*/

SELECT name FROM sys.Objects

WHERE PATINDEX ('%FROM%WHERE%AND%',object_definition(object_ID))>0;

/*List all LATIN collations that are case and accent insensitive (ignoring case) */

SELECT name,description FROM fn_helpcollations()

WHERE PATINDEX ('latin%case-insensitive%accent-insensitive%',

[description] COLLATE Latin1_General_CI_AI)>0;

/* you'll notice that we need to use the COLLATE keyword to enforce the type of search. It may not make a difference if your database is already set to a suitable collation, but then database collation can change!

This becomes more important if we specify the range of allowable characters.

We can specify the range of characters to look for by listing them between angle-brackets. For example, in our IP search, we can do an obvious improvement by making sure there are at least one number before the dots! */

SELECT PATINDEX ('%[0-9].%[0-9].%[0-9].%','278.2.6.49');--returns with the index of the number before the first dot or 0 if there aren't three in the string

SELECT PATINDEX ('%[0-9].%[0-9].%[0-9].%','278.A.6.49');--returns 0

/*Nice, but you can do it with the LIKE command. Where PATINDEX scores is where you need to extract the information.

We can easily pick up a number embedded in other characters

if we were lucky to be given the task of picking up three-digit numbers from a string, (or zero if there aren't any) that is ridiculously easy and can be done inline within a SQL query */

DECLARE @SampleString VARCHAR(255)

SELECT @SampleString=' the current valve weight is not 56 mg as before, but 067 milligrams';

SELECT SUBSTRING(

@SampleString+'000 ',--put the default on the end

PATINDEX('%[^0-9][0-9][0-9][0-9][^0-9]%',@SampleString+'000 ')+1,

3); --three characters

-- 067

/* See what we've done? We've added a default at the end so that we don't have to cope with passing back a 0 from the PATINDEX when it hits a string without the correctly formatted number in it. We look for the transition between a character that isn't numeric, to one that is. Then we look for three valid consecutive numbers followed by a character that isn't a number.

The same technique can be used where you want to trim off whitespace before or after a string. You might think that RTRIM and LTRIM do this but they are slightly broken, in that they only trim off the space character. What about linebreaks or tabs? */

Declare @PaddedString VARCHAR(255), @MatchPattern VARCHAR(20)

SELECT @PaddedString='

Basically we just want this

@Matchpattern='%[^' + CHAR(0)+'- ]%';

/* This match pattern looks for the first occurrence in the string of a character that isn't a control character. You'll need to specify a binary sort-order to be certain that this works, so we use the COLLATE clause to specify that we want a binary collation that understands that control characters range from 0 to 32 (space character) We'd normally want to add all the other space characters such as non-break space.*/

--now this will find the index of the start of the string

SELECT PATINDEX(@Matchpattern,@PaddedString collate SQL_Latin1_General_CP850_Bin);

And we can easily then use this to create a function that really trims a string.

IF OBJECT_ID(N'Trim') IS NOT NULL

DROP FUNCTION Trim

CREATE FUNCTION Trim

/**

summary: >

This procedure returns a string with all leading and trailing blank space removed. It is similar to the TRIM functions in most current computer languages. You can change the value of the string assigned to @BlankRange, which is then used by the PATINDEX function. The string can be a rangee.g. a-g or a list of characters such as abcdefg.

Author: Phil Factor

Revision: 1.1 changed list of control character to neater range.

Revision: 1.2 added explicit collation.

date: 28 Jan 2011

example:

- code: dbo.Trim(' 678ABC ')

- code: dbo.Trim(' This has leading and trailing spaces ')

- code: dbo.Trim(' left-Trim This')

- code: dbo.Trim('Right-Trim This ')

returns: >

Input string without trailing or leading blank characters, however these characters are defined in @BlankRange

**/ (@String VARCHAR(MAX))

RETURNS VARCHAR(MAX)

BEGIN

DECLARE @BlankRange CHAR(255),

@FirstNonBlank INT,

@LastNonBlank INT

IF @String IS NULL

RETURN NULL--filter out null strings

SELECT @BlankRange = CHAR(0)+'- '+CHAR(160)

/* here is where you set your definition of what constitutes a blank character. We've just chosen every 'control' character, the space character and the non-breaking space. Your requirements could be different!*/

SELECT @FirstNonBlank = PATINDEX(

'%[^' + @BlankRange + ']%',

@String collate SQL_Latin1_General_CP850_Bin)

SELECT @lastNonBlank = 1

+ LEN(@String + '|')

- (PATINDEX(

'%[^' + @BlankRange + ']%',

REVERSE(@String) collate SQL_Latin1_General_CP850_Bin))

IF @FirstNonBlank > 0

RETURN SUBSTRING(@String,@FirstNonBlank, @LastNonBlank-@firstNonBlank)

RETURN '' --nothing would be left

END;

/* PATINDEX allows you to do some subtle things such as chopping strings into words. We'll start by doing something really simple such as chopping up a serialised string of numbers into a table of numbers */

DECLARE @string VARCHAR(255),@start INT, @LenString INT, @End int

DECLARE @numbers table (number NUMERIC (9,4))

SET NOCOUNT on

SELECT @String=' 23 455 5.789 45.0 67 06978 000 ',

@Start=PATINDEX( '%[^0-9.][0-9.]%',@string),

@LenString=LEN(@string+'|')-1

WHILE @start<@lenString

begin

SELECT @end=PATINDEX(

'%[0-9.][^0-9.]%',

SUBSTRING(@String,@Start,@LenString)+' ')

insert INTO @numbers (number) SELECT SUBSTRING(@String,@start,@end)

SELECT @Start=@start

+@End

+PATINDEX(

'%[^0-9.][0-9.]%',

SUBSTRING(@String,@start+@end,@LenString)+'0')

end;

SELECT * FROM @numbers;

/* ALL we're doing here is defining the range of valid character within a number, anything from 0 to 9, or a dot in this instance, and looking for the transitions between 'number and 'not-number' */

/* Let's try something a bit trickier, and closer to a real chore. Let's find a UK postcode. (apologies to all other nations who are reading this)

The validation rules are that the length must be between 6 and 8 characters of which one is a space. This divides the three-character local code to the right of the space from the sorting-office code to the left of the space. The local characters are always a numeric character followed by two alphabetic characters. The Sorting Office code the left of the gap, can be between 2 and 4 characters and the first character must be alpha.

Before you get too excited, I must point out the the postcode validation is more complex. We can't use it because PATINDEX uses only wildcards and hasn't the OR expression or the iterators. We can do quite well though...*/

SELECT PATINDEX('[A-Z][A-Z0-9]% [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]','CB4 0WZ');

--0 if invalid. >0, 1 if valid

/*This needs some explanation. the '[' and ']' brackets enclose a list of characters. you can list them, without commas, or specify a range. Here, in the last two expressions, we have done both. [ABD-HJLNP-UW-Z] is a shorthand for [ABDEFGHJLNPQRSTUWXYZ]. This looks laborious, but works speedily, since SQL Server works hard to optimize LIKE and PATINDEX

This does everything but validate that the sorting office code is between two and four characters. If you were determined to do this, you'd need to run three checks*/

-- '[A-Z][A-Z0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'

-- '[A-Z][A-Z0-9]_ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'

-- '[A-Z][A-Z0-9]__ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'

--Which, if we just wanted to check the validity, we can do in a number of ways

Select case WHEN 'CB4 0WZ' like '[A-Z][A-Z0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'

or 'CB4 0WZ' like '[A-Z][A-Z0-9]_ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'

or 'CB4 0WZ' like '[A-Z][A-Z0-9]__ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]' then 1 else 0 end;

--With 2008, we can do this

SELECT MAX(PATINDEX([matched],'CB4 0WZ'))

FROM (VALUES

('[A-Z][A-Z0-9] [0-9][A-Z][A-Z]')

,('[A-Z][A-Z0-9]_ [0-9][A-Z][A-Z]')

,('[A-Z][A-Z0-9]__ [0-9][A-Z][A-Z]')) AS f([Matched]);

--0 if invalid. >0, 1 if valid

/* What if you wanted to do the more common chore of extracting the postcode from an address-line and putting it in its own field? This is where you have to stop using LIKE as it won't cut the mustard. If you are of a nervous disposition in your SQL-writing please turn away now.*/

Select stuff([address],start+1,length-start-fromend,''),

Substring([address],start,length-start-fromend)

from

(--we have a derived table with the results we need for the chopping

SELECT MAX(PATINDEX([matched],[address])) as start,

MAX(PATINDEX([ReverseMatch],reverse([address]+' ')))-1 as fromEnd,

len([address]+'|') as [length],

[Address]

FROM (VALUES--first the forward match, then the reverse match

('% [A-Z][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z] %' )

,('% [A-Z][A-Z0-9][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z0-9][A-Z] %')

,('% [A-Z][A-Z0-9][A-Z0-9][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z0-9][A-Z0-9][A-Z] %'))

AS f([Matched],ReverseMatch)

cross join --normally this would be a big table, of course

(select 'Simple Talk Publications, Newnham House, Cambridge Business Park,

Cambridge, CB10 7EC' union all Select '20 Milton Street, Inverness SWB7 7EC' ) as g([Address])

group by [address]

having MAX(PATINDEX([matched],[address]))>0) work;

--and this technique allows you to process a huge table. It is surprisingly fast.

--first we create our sample table...

Create table [Addresses]

(Address_ID int identity primary key,

[Address] varchar(200) not null,

[alteredAddress] Varchar(200) null,

Postcode varchar(12));

/* ...and then we'll populate it with 100,000 rows via SQL Data Generator (we'll use a RegEx to fill the address column in).

Hopefully, I'll remember to put it in the speech bubble at the head of the article for the other SQLDG freaks). Then we are going to pull out the postcode information, place the modified address without the postcode in a second column, and put the extracted postcode into its own column so we can subsequently do lightning searches based on postcode. This whole messy process runs in five seconds on my test machine. If you did a neat cursor-based process, it would take minutes.*/

Update Addresses

Set alteredAddress=[modified],

Postcode=[extracted]

from Addresses inner join

(Select

Address_ID,--the address ID

--the modified address without the postcode (if there was one!)

stuff([address],start,length-start-fromend+2,'') as [modified],

--the postcode itself

Substring([address],start,length-start-fromend+2) as [extracted]

from

(--we have a derived table with the results we need for the chopping

--process to save having to calculate it more than once

SELECT MAX(PATINDEX([matched],[address])) as start,

MAX(PATINDEX([ReverseMatch],reverse([address]+' '))-1) as fromEnd,

len([address]+'|')-1 as [length],

[Address] as [address],

min(Address_ID) as address_ID

FROM (VALUES--first the forward match, then the reverse match

('% [A-Z][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z] %' )

,('% [A-Z][A-Z0-9][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z0-9][A-Z] %')

,('% [A-Z][A-Z0-9][A-Z0-9][A-Z0-9] [0-9][A-Z][A-Z]%', '%[A-Z][A-Z][0-9] [A-Z0-9][A-Z0-9][A-Z0-9][A-Z] %'))

AS f([Matched],ReverseMatch)

cross join addresses

group by [address]

having MAX(PATINDEX([matched],[address]))>0)

work)

alteredData

on AlteredData.Address_ID=Addresses.Address_ID;

/* so here is a puzzle to end off with. You have a field with an email address somewhere in it, and you need to extract it. Here's one way of pulling it out. It looks a bit complicated, but it is fast */

SELECT CASE WHEN AtIndex=0 THEN '' --no email found

ELSE RIGHT(head, PATINDEX('% %', REVERSE(head) + ' ') - 1)

+ LEFT(tail + ' ', PATINDEX('% %', tail + ' '))

END EmailAddress

FROM (SELECT RIGHT(EmbeddedEmail, [len] - AtIndex) AS tail,

LEFT(EmbeddedEmail, AtIndex) AS head, AtIndex

FROM (SELECT PATINDEX('%[A-Z0-9-]@[A-Z0-9-]%', EmbeddedEmail+' ') AS AtIndex,

LEN(EmbeddedEmail+'|')-1 AS [len],

embeddedEmail

FROM (

SELECT 'The Imperial Oil Company [email protected] 123 Main St'

) AS ListOfCompanies (EmbeddedEmail)

)g;

EmailAddress

-------------------------

[email protected]

/*What are we doing here? The principle is simple. We look for the embedded '@' sign, and then run a check forwards to get the end of the string containing the '@' character. Then we reverse the start of the string and look for the beginning. When we have these, it is just a simple matter of assembling the email address. The SQL looks laborious, but looks can deceive, so it always pays to test it out. Let's test it out by recreating our addresses table, and stocking the address column with an additional email record. */

Create table [EmailAddresses]

(Address_ID int identity primary key,

[Address] varchar(200) not null,

email Varchar(50))

/* now we stock it with 100,000 records with an address column with an embedded email address. Next we extract the email address. How long does it take? 6 seconds for 100,000 rows.

Update EmailAddresses

Set email=emailAddress

from EmailAddresses inner join

(SELECT CASE WHEN AtIndex=0 THEN '' --no email found

ELSE RIGHT(head, PATINDEX('% %', REVERSE(head) + ' ') - 1)

+ LEFT(tail + ' ', PATINDEX('% %', tail + ' '))

END AS emailAddress, Address_ID

FROM (SELECT RIGHT(Address, [len] - AtIndex) AS tail,

LEFT(Address, AtIndex) AS head, AtIndex,Address_ID

FROM (SELECT PATINDEX('%[A-Z0-9-]@[A-Z0-9-]%', Address+' ') AS AtIndex,

LEN(Address+'|')-1 AS [len],

Address, address_ID

FROM EmailAddresses

)emails

ON emails.address_ID=EmailAddresses.Address_ID;

So there we have it. In summary

When using PATINDEX,

specify the collation if you use character ranges
If a problem seems tricky, see if you can detect transitions between character types
use the angle brackets to 'escape' wildcard characters.
Think of the % wildcard as meaning 'any number of any character', or 'Any string of zero or more characters.'
remember that you can specify the wildcard parameter as a column as well as a literal or variable.
you can use cross joins to do multiple searches to simulate the OR condition of RegExes.
experiment wildly when you get a spare minute. Occasionally, you'll be surprised.

This article has been viewed 18431 times.

Subject:	did not work
Posted by:	afonso freitas - [email protected] (not signed in)
Posted on:	Monday, May 16, 2011 at 6:16 AM
Message:	I tried this example but it did not work, this theory does not work in practice

Subject:	Thanks
Posted by:	afonso freitas - [email protected] (not signed in)
Posted on:	Monday, May 16, 2011 at 8:31 AM
Message:	EXCELLENT ARTICLE. CONGRATULATIONS

Subject:	Correction
Posted by:	Anonymous (not signed in)
Posted on:	Monday, May 23, 2011 at 4:17 AM
Message:	/Select all objects (not tables!) in the current database which have the following three words FROM WHERE and ORDER in that order with gaps in between/ SELECT name FROM sys.Objects WHERE PATINDEX ('%FROM%WHERE%AND%',object_definition(object_ID))>0; Will also find the string 'FROMWHEREAND'. Great article.

Subject:	about the article
Posted by:	needa_flats (not signed in)
Posted on:	Monday, May 23, 2011 at 5:07 AM
Message:	did not work to me either... http://www.lettingsmall.com/

Subject:	Expected 3, got a 4
Posted by:	mjbmjbmjbmjb (not signed in)
Posted on:	Monday, May 23, 2011 at 6:50 AM
Message:	For me, this (below) is returning 3 and I was expecting, probably wrongly, 4... SELECT PATINDEX ('%[0-9].%[0-9].%[0-9].%','278.2.6.49');--returns with the index of the first dot or 0 if there aren't three in the string

SQL Categories

Custom RSS feeds

Related Articles

Phil Factor

Top Rated

Most Viewed

Why Join

Subject:	Re: Expecting 3, got a 4
Posted by:	Phil Factor (view profile)
Posted on:	Monday, May 23, 2011 at 7:29 AM
Message:	It is telling you the index where it got the match. This will have been the number before the dot rather than the dot. I've amended the article as I rashly copied the previous comment!

Subject:	Re: 'did not work for me either'
Posted by:	Phil Factor (view profile)
Posted on:	Monday, May 23, 2011 at 7:32 AM
Message:	I contacted the first person who said it didn't work. He then managed to get it working with some help. If you can tell me what the problem is I'll do my best to fix it. Please remember that PatIndex doesn't work with TEXT datatype. I should have mentioned that!

Subject:	Re: FromWhereAnd
Posted by:	Phil Factor (view profile)
Posted on:	Monday, May 23, 2011 at 7:38 AM
Message:	You're right. I've update the comments!

Subject:	Amazing
Posted by:	Lyndon (view profile)
Posted on:	Monday, May 23, 2011 at 8:08 AM
Message:	I've just arranged a meeting for tomorrow at CB4 0WE. It must be very close.

Subject:	FromWhereAnd
Posted by:	mctillett (view profile)
Posted on:	Monday, May 23, 2011 at 9:50 AM
Message:	That's better, I had been signed out when I posted anonymously earlier. As an alternative to changing the comments Phil, how about using the pattern '%FROM_%WHERE_%AND%' :-D

Subject:	One question...
Posted by:	Hugh Yeman (view profile)
Posted on:	Wednesday, September 26, 2012 at 7:14 AM
Message:	Thank you for this. It's helpful to me as I learn T-SQL. One thing I haven't been able to puzzle out: what is the purpose of concatenating a pipe symbol onto a string before calling the LEN function on it? You do things like the following several times. @LenString=LEN(@string+'\|')-1 There must be an obvious reason, but I haven't been able to suss it. @string is already declared as a VARCHAR so it can't be to guarantee type.

Subject:	selecting text between two letters or numbers
Posted by:	caruncles (view profile)
Posted on:	Tuesday, November 20, 2012 at 11:33 AM
Message:	I guess I'm a simpleton but didn't see the answer to my question on your page. Is it possible to select just the text between two letters or numbers? for example, everthing between '{' and '}', without selecting the braces.