Code Review Stack Exchange is a question and answer site for peer programmer code reviews. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I'm processing a list of files, and want to check the filenames against a list of regex, like:

IGNORE_FILES = [
    re.compile('^./Vendor.*'),
    re.compile('^./Pods.*'), 
    …
]

def in_blacklist(file):
    return len(list(filter(lambda r: r.match(file) != None, IGNORE_FILES))) > 0

def in_whitelist(file):
    return SWIFT_FILE_REGEX.match(file) != None

def files():
    valid_files = []
    for root, dirs, files in os.walk('.'):
        for file in files:
            if in_whitelist(file) and not in_blacklist(root):
                valid_files.append("%s/%s" % (root, file))
    return valid_files

I'm looking for a smoother way to write:

len(list(filter(lambda r: r.match(root) != None, IGNORE_FILES))) == 0

I feel that it's not so easy to read, and especially with the list(..)

share|improve this question
    
Are there any reasons you're not using string comparisons ? – Dex' ter yesterday
    
You mean as opposed to regex? – netigger yesterday
    
A "smoother way to write" as in, writing it into a well-named function and calling that instead? I hope so... that one-liner is pretty dreadful! – Phrancis yesterday
    
@Phrancis Putting 'return len(list(filter(lambda r: r.match(root) != None, IGNORE_FILES))) == 0' into a function doesn't make much of a difference, surley, the initial code-snippet will look better, but my question was specifically about that part of the line – netigger yesterday
1  
Thanks that looks better =) – Phrancis yesterday
up vote 4 down vote accepted

You can build one regex out of the list of regexes by

'|'.join(IGNORE_FILES) 

(Well, before running re.compile on them...)

Then your in_blacklist function becomes as easy to write and read as in_whitelist.

I further assume that you don't want to ignore paths that start with any letter, but those starting with a literal dot (which is what os.walk('.') yields). So you need to escape those leading dots.

Also, in your files() function, you should move your in_blacklist check out of the inner loop. No need to check the same root directory multiple times.

So:

IGNORE_FILES = re.compile('|'.join([
    r'^\./Vendor.*',
    r'^\./Pods.*', 
]))

def in_blacklist(file):
    return IGNORE_FILES.match(file) != None

def in_whitelist(file):
    return SWIFT_FILE_REGEX.match(file) != None

def files():
    valid_files = []
    for root, dirs, files in os.walk('.'):
        if in_blacklist(root):
            continue
        for file in files:
            if in_whitelist(file):
                valid_files.append("%s/%s" % (root, file))
    return valid_files

And for more portability, you could replace "%s/%s" % (root, file) by os.path.join(root, file), but I wouldn't insist on this too hard...

share|improve this answer
    
That's a good idea. I will go with that change :) Though I wonder if t might impact the regex in any way? I mean to limit the possibilities with the regexp, (for example groups would be strange with this, though Im fine with it since I dont plan to use groups here..) – netigger yesterday
    
It makes the regex more computationally expensive, but should not change semantics. And as long as all your regexes start with '^', the regex engine can detect a non-match early on and move on to the next one. I'm not sure whether there might be some worst-case scenarios for the regex engine where its computation time exceeds the time you need for individual checks, but if so, they should be rare. – pixelbrei yesterday

You can use a comprehension. Comprehensions are like easier to write maps and filters. Your in_blacklist can actually be a generator comprehension. One that yields rather than array.appends. Wrapped in an any. This is if there is at least one match it's True. And you don't really want to check the rest.

You can also change your files function to use one, all you need to do is move the discouraged printf string format to the beginning and keep the rest the same. You can also move if not in_blacklist(root) up so that if the root is not allowed then you're not spending time on the children.

IGNORE_FILES = [
    re.compile('^./Vendor.*'),
    re.compile('^./Pods.*'), 
    …
]

def in_blacklist(string):
    return any(r.match(string) is not None for r in IGNORE_FILES)

def in_whitelist(file):
    return SWIFT_FILE_REGEX.match(file) is not None

def files():
    return [
        '{}/{}'.format(root, file)
        for root, dirs, files in os.walk('.')
        if not in_blacklist(root)
        for file_name in files
        if in_whitelist(file_name)
    ]
share|improve this answer
    
Hmm, I actually tried with a list comprehension, but didn't get it correctly. In in_blacklist you dont need [ and ] for the list comprehension? – netigger yesterday
    
@netigger As I said it's not a list comprehension, it's a generator comprehension. Which allows you to remove the () if it's the sole argument to a function as any((i for i in range(10))) would look ridiculous. – Joe Wallis yesterday
    
Ah, missed that. Thank you for clarifying! – netigger yesterday

I find your use of "blacklist" and "whitelist" problematic.

What is a "whitelist"? Which takes precedence — a whitelist or a blacklist? (Adding an e-mail address to a whitelist should let it bypass a spam filter.)

As @JoeWallis has already pointed out, if the directory is in the blacklist, then there is no need to iterate through its contents. To add to the confusion, the blacklist works on directories, but the whitelist works on filenames. It would be clearer to name them exclude_dirs and filename_filter. If, as I suspect, you are using the filename filter to look for a .swift extension, consider using fnmatch as a more natural alternative to a regex.

share|improve this answer

Type hints

Since you are using Python 3.x you could take advantage of the new type hints. According to PEP 484:

This PEP aims to provide a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.

Of these goals, static analysis is the most important. This includes support for off-line type checkers such as mypy, as well as providing a standard notation that can be used by IDEs for code completion and refactoring.

Even if you don't use static code analysis at the moment, type hints still have the advantage of making the code easier to read and understand.


In your case, adding type hints (along with adding a verb like is_something instead of something) to functions would make the code more clear. Since this is about file names, I'd also suggest using file_name instead of file, so it's clear we're dealing with a string and not an actual file. For example:

def is_in_whitelist(file_name: str) -> bool:
    return SWIFT_FILE_REGEX.match(file_name) != None

In the same vein, IGNORE_FILES may be better named as IGNORED_ROOT_NAMES since that is what they actually are. pixelbrei's answer addressed a great way to simplify it.

share|improve this answer
1  
Great hints as well. Though Atom certainly freaks out when I add '-> bool'. It will lose the syntax coloring for the next function. – netigger yesterday
    
Strange, maybe Atom has some fix for it... I've used that syntax in IDLE as well as a few online REPL with no problems... – Phrancis yesterday

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.