I have a directory with 'sets' of files that start with a state name followed by 4 or 5 digits (typically indicating year). Each 'file set' contains 3 files a .txt, a .png, and a .jpg.
Example of files in directory:
California1998_reform_paper.txt
California1998_reform_paper.pdf
California1998_reform_paper.jpg
California2002_waterdensity_paper.txt
California2002_waterdensity_paper.pdf
California2002_waterdensity_paper.jpg
Based on a users input I am trying to write some code that can put each of these file sets into a list of lists. Ultimately I would like to iterate over the list of lists. That said, I am not married to any one data type if a dictionary or something else may be more efficient.
I would like the user to be able to enter either:
- The state name i.e. 'California' - to get all files from California
OR - The state name + year i.e. 'California1998' to get all files from California 1998
import os
import regex
directory = #path to directory
input = 'California1998'
# Does input match proper format? If not error.
mm = regex.match('^([a-z]+)([0-9]{4,5})|^([a-z]+)', input)
dir = str(os.listdir(directory))
if mm.group(1):
state = mm.group(1)
number = mm.group(2)
state_num = state + number
fileset = regex.findall(state_num, dir)
elif mm.group(3):
state = mm.group(3)
fileset = regex.findall(state + r'[0-9]{4,5}', dir)
else:
print('Put some error message here')
# Does input exist? If not error.
if len(fileset) > 0:
fileset = tuple(set(sorted(fileset)))
else:
print('Put some error message here')
# Get list of lists
state_num_files = [[file.path
for file in os.scandir(directory)
if file.name.startswith(state_num)]
for state_num in fileset]
return state_num_files
Above is the code I have thus far. It first uses regex.match
to check the input, then regex.findall
to find all matching state + years. I then create a sorted()
set()
from this list, which is converted into a tuple()
called fileset
. The last bit of code is a nested list comprehension that produces the list of lists by iterating through all files in the directory and iterating through all the state + years in fileset
.
It certainly works, but seems repetitive and slower than it needs to be. My goal is to increase efficiency and remove any unnecessary iteration.
Thoughts on improvements:
- Possibly replace each
regex.findall
with a nested list comprehension? and thus remove thestate_num_files
nested comprehension at the end of script?
Any thoughts are greatly appreciated!