2

I am trying to write a Python script to tidy up my LaTeX code. I would like to find instances in which an environment is started, but there are non-whitespace characters after the declaration before the next newline. For example, I would like to match

\begin{theorem}[Weierstrass Approximation] \label{wapprox}

but not match

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}

My goal is to insert (using re.sub) a newline character between the end of the declaration and the first non-whitespace character. Sloppily stated, I want to find something like

(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\S)

to do a replacement. I've tried

expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.M)

but this isn't quite working. As the last group, it's matching only the last paired of {,} or [,].

2
  • 1
    A less complex solution would probably be to write a tokenizer/lexer for LaTeX that splits the input into tokens and copies them one-by-one into a second buffer. As you copy them you can determine whether you want to insert extra spaces or newline. As you loop through each token, if you encounter a '\begin{(\w+)}' token, then enter a state that ensures a newline is inserted before the next non-whitespace token is copied. Attempting to do full-document analysis on a LaTeX document using regular expression is liable to be fragile. Commented Aug 25, 2015 at 11:22
  • 4
    As always, don't use regex to parse structured languages.
    – tripleee
    Commented Aug 25, 2015 at 11:55

1 Answer 1

2

You can do it like this:

import re

s = r'''\begin{theorem}[Weierstrass Approximation] \label{wapprox}

but not match

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}'''

p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\S\n]*(?=\S)')

print(p.sub(r'\1\n', s))

pattern details:

(   # capture group 1
    \\
    (?:begin|end)
    # trick to emulate an atomic group
    (?=(  # the subpattern is enclosed in a lookahead and a capture group (2)
        (?:{[^}]*}|\[[^]]*])*
    ))  # the lookahead is naturally atomic
    \2  # backreference to the capture group 2
)
[^\S\n]* # eventual horizontal whitespaces
(?=\S) # followed by a non whitespace character

Explanation: if you write a pattern like (\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) you can't prevent cases that have a newline character before the next token. See the following scenario:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) matches:

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

But since (?=\S) fails (because the next character is a newline) the backtracking mechanism occurs:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) matches:

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

and (?=\S) now succeeds to match the [ character.

An atomic group is a non capturing group that forbids the backtracking in the subpattern enclosed in the group. The notation is (?>subpattern). Unfortunately the re module doesn't have this feature, but you can emulate it with the trick (?=(subpattern))\1.

Note that you can use the regex module (that has this feature) instead of re:

import regex

p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)')

or

p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\S\n]*+(?=\S)')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.