Using regular expressions in Python to parse LaTeX code

Question

I am trying to write a Python script to tidy up my LaTeX code. I would like to find instances in which an environment is started, but there are non-whitespace characters after the declaration before the next newline. For example, I would like to match

\begin{theorem}[Weierstrass Approximation] \label{wapprox}

but not match

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}

My goal is to insert (using re.sub) a newline character between the end of the declaration and the first non-whitespace character. Sloppily stated, I want to find something like

(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\S)

to do a replacement. I've tried

expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.M)

but this isn't quite working. As the last group, it's matching only the last paired of {,} or [,].

A less complex solution would probably be to write a tokenizer/lexer for LaTeX that splits the input into tokens and copies them one-by-one into a second buffer. As you copy them you can determine whether you want to insert extra spaces or newline. As you loop through each token, if you encounter a '\begin{(\w+)}' token, then enter a state that ensures a newline is inserted before the next non-whitespace token is copied. Attempting to do full-document analysis on a LaTeX document using regular expression is liable to be fragile. — Simon Broadhead, Commented Aug 25, 2015 at 11:22

Casimir et Hippolyte · Accepted Answer · 2015-08-25 12:14:03Z

You can do it like this:

import re

s = r'''\begin{theorem}[Weierstrass Approximation] \label{wapprox}

but not match

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}'''

p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\S\n]*(?=\S)')

print(p.sub(r'\1\n', s))

pattern details:

(   # capture group 1
    \\
    (?:begin|end)
    # trick to emulate an atomic group
    (?=(  # the subpattern is enclosed in a lookahead and a capture group (2)
        (?:{[^}]*}|\[[^]]*])*
    ))  # the lookahead is naturally atomic
    \2  # backreference to the capture group 2
)
[^\S\n]* # eventual horizontal whitespaces
(?=\S) # followed by a non whitespace character

Explanation: if you write a pattern like (\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) you can't prevent cases that have a newline character before the next token. See the following scenario:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) matches:

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

But since (?=\S) fails (because the next character is a newline) the backtracking mechanism occurs:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) matches:

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

and (?=\S) now succeeds to match the [ character.

An atomic group is a non capturing group that forbids the backtracking in the subpattern enclosed in the group. The notation is (?>subpattern). Unfortunately the re module doesn't have this feature, but you can emulate it with the trick (?=(subpattern))\1.

Note that you can use the regex module (that has this feature) instead of re:

import regex

p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)')

or

p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\S\n]*+(?=\S)')

Collectives™ on Stack Overflow

Using regular expressions in Python to parse LaTeX code

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related