Multi-threaded Error Handling Data Fetching Script

Question

I've been working on a Python script that fetches student data from an API using multiple threads for concurrency. The script retrieves both student information and prospectus data and saves them to JSON files.

I would appreciate feedback on the following aspects of the code:

Exception Handling: The current exception handling is quite broad (except:). I'd like to know if there are better practices for handling exceptions, especially related to network errors.
Code Duplication: I've noticed some duplication of code, especially when re-fetching data after encountering an invalid token. How can I refactor the code to make it more concise and maintainable?
Thread Safety: Given the concurrent nature of the script, I am particularly interested in your insights regarding potential race conditions, especially concerning the management of the TokenChecker list utilized for inter-thread communication. Are there more robust strategies to ensure thread safety in this context?

Note: The API vulnerability exploit mentioned here has already sent to the dev team and has been patched up. I'm just here to get some feedback on the code itself. Thanks for understanding!

import requests
from get_token import get_token
import json
import threading
from datetime import datetime

def write_file(filename, data):
    f = open(filename, 'w')
    f.write(data)
    f.close

def main():

    token = get_token()

    headers = {
        "Authorization" : token
    }


    def fetch_data(idnum, TokenChecker):
        print(f"Fetching Student: {idnum}")

        while True:
            try:
                getStudentResponse = requests.get(f'https://apiname/get_info?studid={idnum}', headers=headers, timeout=(3.05, 5))
                getProspectusResponse = requests.get(f'https://apiname/prospectus?studid={idnum}', headers=headers, timeout=(3.05, 5))

                student_status_code = getStudentResponse.status_code
                prospectus_status_code = getProspectusResponse.status_code
                if((student_status_code == 200 and prospectus_status_code == 200) or (student_status_code == 500 or prospectus_status_code == 500 )):
                    break
            except:
                print(f"Fetched Timeout on ID: {idnum}")

        write_file(f'prospectus/{idnum.split("-")[0]}/{idnum}.json', getProspectusResponse.text)
        write_file(f'student_info/{idnum.split("-")[0]}/{idnum}.json', getStudentResponse.text)
        
        TokenChecker.append({"responses" : [getStudentResponse, getProspectusResponse]})        
        return [getStudentResponse, getProspectusResponse]


    for i in range(2019, 2025):
        max_id = 10000
        num_threads = 50

        for j in range(0, max_id, num_threads):

            TokenChecker = []

            threads = []

            for k in range(num_threads):
                idnum = f"{str(i).zfill(4)}-{str(j + k).zfill(4)}"
                t = threading.Thread(target=fetch_data, args=(idnum, TokenChecker,))
                t.daemon = True
                threads.append(t)
            
            for k in range(num_threads):
                threads[k].start()

            for k in range(num_threads):
                threads[k].join()

            getStudentResponse, getProspectusResponse = TokenChecker[len(TokenChecker) - 1]['responses'] 

            student_info = json.loads(getStudentResponse.text)
            prospectus = json.loads(getProspectusResponse.text)

            try: 
                if(student_info['message'] == 'Token is invalid' or prospectus['message' == 'Token is invalid']):

                    print(f'Change Token at: {datetime.now().strftime("%d/%m/%Y %H:%M:%S")}')

                    headers = {
                        "Authorization": get_token()
                    }

                    threads = []

                    for k in range(num_threads):
                        idnum = f"{str(i).zfill(4)}-{str(j + k).zfill(4)}"
                        t = threading.Thread(target=fetch_data, args=(idnum, TokenChecker,))
                        t.daemon = True
                        threads.append(t)
                    
                    for k in range(num_threads):
                        threads[k].start()

                    for k in range(num_threads):
                        threads[k].join()
            except:
                pass
    
    print("Fetching Done.")
                
    
if __name__ == "__main__":
    main()

Kate · Accepted Answer · 2024-03-30 17:20:53Z

Exception Handling: The current exception handling is quite broad (except:).

It is worse than broad in the main block, it is nonexistent since you are swallowing all exceptions, making your program blind to errors and logic faults:

    except:
        pass

I would recommend to log the errors and stop your program. Investigate errors, and only then continue when you've made your application more aware and more robust.

In the other try block you have this bit of code:

except:
    print(f"Fetched Timeout on ID: {idnum}")

But it is misleading, because the error could be anything else than a timeout. Consider using a logger and then logging.exception to dump the stack trace, which will contain more useful details, going down to the offending line number. A minimalistic example would be along these lines:

import logging

try:
    do_something_bad()
except:
    logging.exception("Exception occurred")

Preferably do logging to a file as well, not just the console or you could miss error messages. The Python docs have a lot of details, which take some time to assimilate but I feel like every Python programmer should get acquainted with this module and use it extensively and not just for debugging purposes.

The requests module has its own exception class, these are the exceptions you may want to handle. It is even possible to recover gracefully from transient network errors using urllib3 Retry capabilities as described here.

Regarding the HTTP status code: you normally always expect a 200 response. Sometimes 201 is used in APIs that create resource objects. Checking for 500 does not suffice, you could for example stumble on 502, 503, and others + the 4xx class errors such as 404.

So any response other than 200 should normally be treated as an error.

This code is not thread-safe if you were actually writing to the same file from several concurrent threads. Here it looks like you are writing to a different file name each time, so no problem in this case.

If you need to have multiple threads writing to the same file several approaches are possible, like using a queue, but I feel that threading.Lock is more convenient. It is not hard to implement and requires minimal refactoring on your end. Here you have a neat example: Thread-Safe Write to File in Python.

Thread safety doesn't seem to be an immediate concern here, but server response needs to be checked more accurately.

For an application like that, it could make sense to output the last ID successfully fetched in case of exception/crash, and add a command line flag (argparse) to resume scraping from that ID, so that you do not start the process all over.

Lots of reading, but definitely useful and reusable information.

Before your review, I must admit that my understanding of exception handling and logging was quite limited. Your explanation of the pitfalls of broad except statements and the importance of logging stack traces has opened my eyes. — K. Cinches, Mar 31 at 8:24

Stack Exchange Network

Multi-threaded Error Handling Data Fetching Script

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
multithreading
or ask your own question.

Hot Network Questions

Multi-threaded Error Handling Data Fetching Script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonmultithreading or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
multithreading
or ask your own question.