Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have a function which parses PHP array declarations from files. The function then returns a dictionary with the keys being the keys of the PHP array and the values in python are the values from the PHP array.

Example file:

$lang['identifier_a'] = 'Welcome message';
$lang['identifier_b'] = 'Welcome message.
You can do things a,b, and c here.

Please be patient.';
$lang['identifier_c'] = 'Welcome message2.
You can do things a,b, and c here.
Please be patient.';
$lang['identifier_d'] = 'Long General Terms and Conditions with more text';
$lang['identifier_e'] = 'General Terms and Conditions';
$lang['identifier_f'] = 'Text e';

Python function

def fetch_lang_keys(filename):
    from re import search;
    import mmap;

    ''' fetches all the language keys for filename '''
    with open(filename) as fi:
        lines = fi.readlines();

    data = {};
    for line in lines:
        obj = search("\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"](.{1,})[\'|\"];", line);
#        re.match(r'''\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"](.{1,})[\'|\"];''', re.MULTILINE | re.VERBOSE);

        if obj:
            data[obj.group(1)] = obj.group(2);

    return data;

This function should return a dictionary which should look like this:

data['identifier_a'] = 'Welcome message'
data['identifier_b'] = 'Welcome message.
You can do things a,b, and c here.

Please be patient.';
// and so on

The regexp which is used in the function works for everything except for identifier_b and identifier_c, because the regular expression does not match blank lines and/or lines which do not end with ;. The wildcard operator with ; at the end did work either, because it matched too much.

Do you have any idea of how to solve this? I looked into lookahead assertions, but failed to use them properly. Thanks.

share|improve this question
    

3 Answers 3

up vote 1 down vote accepted

This regex seems to work. -

\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"]((?:.|\n)+?)[\'|\"];
                                          ^^^^^^^^^^

Demo here-

share|improve this answer
    
How would i allow this character '|' in the values too? (Without quotation marks) –  Philipp Feb 22 at 12:07
    
Another special case occured which i am not to sure about: The value of the PHP array contained '...to Member\'s value...', which was not matched. Any idea? –  Philipp Feb 22 at 12:14

Well, why my answer is not a solution for your regexp problem, but nevertheless: why don't you wish to use a "real PHP parser" instead of home-brew regexp's? It could be much more reliable and might even be faster, and certainly a more maintainable solution.

Quick googling gave me: https://github.com/ramen/phply . But also I've found this: Parse PHP file variables from Python script . Hope this help.

share|improve this answer
    
phply unfortunately does not work as it does not parse the multi line php strings correctly. –  Philipp Feb 22 at 12:27
    
well, its unit-tests have testcases for multiline strings, but probably they don't cover your case. I'll check. –  user3159253 Feb 22 at 12:50

It doesn't work because the dot doesn't match newlines. You must use the singleline modifier (re.DOTALL) instead of the multiline modifier. Example:

obj = re.search(r'\$lang\[[\'"](.+?)[\'"]\] = [\'"](.+?)[\'"];', line, re.DOTALL);
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.