Efficient Multithreaded File Processessing in Java

Question

This method in my code uses a producer-consumer design to read many files and process them for use in an NLP algorithm. The completion service collects files as they are processed, which means that docId != index. Given that I need to retrieve probabilities elsewhere where docID == index, I've implemented a priority queue to sort. However, I do this by sequentially retrieving documents from a list of futures and then putting them into the queue.

I know this implementation isn't great. I keep staring at the concurrentskiplistset thinking I should be able to implement it for better performance and not have these issues. I'm brand new to concurrency and Java, so I'm sure there's a lot to critique.

How can I make this multithreading more efficient?

Vectorizer:

public class Vectorizer {
public Vectorizer(){
}
public PriorityQueue<Document> readAll(File fileDir) 
throws InterruptedException, ExecutionException, IOException{
    //read each file. When each file is vectorized, put it in a minibatch.
    //producer-consumer threading structure.
    int NUM_THREADS = Runtime.getRuntime().availableProcessors();
    BlockingQueue<LinkedList<String>> queue = new ArrayBlockingQueue<>(50);
    ExecutorService service = Executors.newFixedThreadPool(NUM_THREADS);
    CompletionService<Document> completionService = 
               new ExecutorCompletionService<Document>(service);
    //TODO: is list the most efficient structure to handle future?
    List<Future<Document>> docs = new ArrayList<Future<Document>>();
    for (int i = 0; i < (NUM_THREADS - 1); i++) {
        docs.add(completionService.submit(new DocumentConsumer(queue)));
    }
    // Wait for ReadFile to complete
    service.submit(new ReadFile(queue, fileDir)).get();
    service.shutdownNow();  // interrupt CPUTasks
    // Wait for DocumentConsumer to complete
    service.awaitTermination(365, TimeUnit.DAYS);

    //do things with processed docs.
    //should I be doing this, though?
    PriorityQueue<Document> Documents = new PriorityQueue<Document>();
    for(Future<Document> d : docs){
        try{
        Document doc = d.get();
        Documents.add(doc);
        System.out.println(Integer.toString(doc.Cj));
        } catch(ExecutionException e) {
            e.getCause();e.printStackTrace();
        }
    }
    return Documents;
}

Document object (without compare and overload):

public class Document implements Comparator<Document>{
int docId, Cj;
HashMap<String,Integer> termTable;
public Document(int id,int Cj, HashMap<String,Integer> termMap){
    this.docId = id;
    this.Cj = Cj;

    this.termTable = termMap;
}

docId is set during file reading. I read in a file and append the count to the first line, which is then set during processing.

The rest of the code is available here.

Pimgd · Answer 1 · 2014-08-07 08:17:37Z

I didn't look at the specific performance issue you described. However, my general review:

Readability

for (int i = 0; i < (NUM_THREADS - 1); i++) {
    docs.add(completionService.submit(new DocumentConsumer(queue)));
}

Starting at 0 and ending at somevalue - 1. Why don't you start at 1? You don't even do anything with the index, so it's not like it matters.

        e.getCause();e.printStackTrace();

Don't put multiple statements on the same line. It turns your code into a wall of code. You think understanding a wall of text is hard? Try a wall of code...

Naming

Use camelCase for methods and variables. Use PascalCase for types. Use ALL_CAPS for constants. This means that...

int NUM_THREADS = Runtime.getRuntime().availableProcessors();

should be final and

PriorityQueue<Document> Documents = new PriorityQueue<Document>();

should be renamed to documents.

public Document(int id,int Cj, HashMap<String,Integer> termMap){

Should rename Cj to cj. ... And even better would be to rename it to something that explains what it is. Right now I have no idea what a cj is.

This is really helpful! Thanks. – mtbarta Aug 8 '14 at 1:59 — mtbarta, Aug 8 '14 at 1:59

asked	1 year ago
viewed	1285 times
active	1 year ago

current community

your communities

more stack exchange communities

Efficient Multithreaded File Processessing in Java

1 Answer 1

Readability

Naming

Your Answer

Not the answer you're looking for? Browse other questions tagged java multithreading or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Efficient Multithreaded File Processessing in Java

1 Answer 1

Readability

Naming

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java multithreading or ask your own question.

Related

Hot Network Questions