Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I know there are a bunch of other regex questions, but I was hoping someone could point out what is wrong with my regex. I have done some research into it and it looks like it should work. I used rubular to test it, yes I know that is regex for ruby, but the same rules I used should apply to python from what it looks like in the python docs

Currently I have

a = ["SDFSD_SFSDF234234","SDFSDF_SDFSDF_234324","TSFSD_SDF_213123"]
c = [re.sub(r'[A-Z]+', "", x) for x in a]

which returns

['SDFSD_SFSDF', 'SDFSDF_SDFSDF_', 'TSFSD_SDF_']

But I want it to return

['SDFSD_SFSDF', 'SDFSDF_SDFSDF', 'TSFSD_SDF']

I try to use this regex

c = [re.sub(r'$?_[^A-Z_]+', "", x) for x in a]

but I am getting this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
  File "/usr/lib64/python2.6/re.py", line 245, in _compile
    raise error, v # invalid expression

Can anyone help me figure out what I am doing wrong?

share|improve this question
3  
That's now what your code returns: c should be ['_234234', '__234324', '__213123']. –  arshajii Jul 17 '13 at 21:59
add comment

5 Answers

up vote 0 down vote accepted

The error in:

c = [re.sub(r'$?_[^A-Z_]+', "", x) for x in a]

Is caused by the ?, it is not preceded by any characters so it doesn't know what to match 0 or 1 times. If you change it to:

>>> [re.sub(r'_?[^A-Z_]+$', "", x) for x in a]
['SDFSD_SFSDF', 'SDFSDF_SDFSDF', 'TSFSD_SDF']

It works as you expect.

Another thing, $ is used to detonate the end of the line, so it probably shouldn't be the first character.

share|improve this answer
add comment
import re

a = ["SDFSD_SFSDF234234","SDFSDF_SDFSDF_234324","TSFSD_SDF_213123"]
c = [re.match(r'[A-Z_]+[A-Z]', x).group() for x in a]

print c

Results:

['SDFSD_SFSDF', 'SDFSDF_SDFSDF', 'TSFSD_SDF']

Please note, that "re.sub" which you use in your example is a regex replace command, not a search. Your regex seems to be matching for what you're asking for, not what you're trying to get rid of to get what you're asking for.

share|improve this answer
add comment

You could insert 'lookahead' into your regexp. Written as (?=...) your regexp will match only text followed by whatever you put in the . So in your case you could choose to ignore the underscore unless it is followed by [A-Z]. Your reg exp will look like this: r'[A-Z]+_(?[A-Z])' so an underscore not followed by letters will be ignored.

share|improve this answer
add comment

Without regex using rstrip:

a = ["ends_with_underscore_", "does_not", "multiple_____"]
b = [ x.rstrip("_") for x in a]
print b
>> ['ends_with_underscore', 'does_not', 'multiple']
share|improve this answer
add comment
>>> import re
>>> a = ["SDFSD_SFSDF234234","SDFSDF_SDFSDF_234324","TSFSD_SDF_213123"]
>>> c = [re.sub('_?\d+','',x) for x in a]
>>> c
['SDFSD_SFSDF', 'SDFSDF_SDFSDF', 'TSFSD_SDF']
>>>

It's short and simple. Basically, it's saying "replace everything that is a stream of digits or a stream of digits preceded by an _".

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.