You can do it like this:
import re
s = r'''\begin{theorem}[Weierstrass Approximation] \label{wapprox}
but not match
\begin{theorem}[Weierstrass Approximation]
\label{wapprox}'''
p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\S\n]*(?=\S)')
print(p.sub(r'\1\n', s))
pattern details:
( # capture group 1
\\
(?:begin|end)
# trick to emulate an atomic group
(?=( # the subpattern is enclosed in a lookahead and a capture group (2)
(?:{[^}]*}|\[[^]]*])*
)) # the lookahead is naturally atomic
\2 # backreference to the capture group 2
)
[^\S\n]* # eventual horizontal whitespaces
(?=\S) # followed by a non whitespace character
Explanation: if you write a pattern like (\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)
you can't prevent cases that have a newline character before the next token. See the following scenario:
(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*
(?=\S)
matches:
\begin{theorem}[Weierstrass Approximation]
\label{wapprox}
But since (?=\S)
fails (because the next character is a newline) the backtracking mechanism occurs:
(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*
)[^\S\n]*(?=\S)
matches:
\begin{theorem}
[Weierstrass Approximation]
\label{wapprox}
and (?=\S)
now succeeds to match the [
character.
An atomic group is a non capturing group that forbids the backtracking in the subpattern enclosed in the group. The notation is (?>subpattern)
. Unfortunately the re module doesn't have this feature, but you can emulate it with the trick (?=(subpattern))\1
.
Note that you can use the regex module (that has this feature) instead of re:
import regex
p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)')
or
p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\S\n]*+(?=\S)')