This method in my code uses a producer-consumer design to read many files and process them for use in an NLP algorithm. The completion service collects files as they are processed, which means that docId != index
. Given that I need to retrieve probabilities elsewhere where docID == index
, I've implemented a priority queue to sort. However, I do this by sequentially retrieving documents from a list of futures and then putting them into the queue.
I know this implementation isn't great. I keep staring at the concurrentskiplistset
thinking I should be able to implement it for better performance and not have these issues. I'm brand new to concurrency and Java, so I'm sure there's a lot to critique.
How can I make this multithreading more efficient?
Vectorizer:
public class Vectorizer {
public Vectorizer(){
}
public PriorityQueue<Document> readAll(File fileDir)
throws InterruptedException, ExecutionException, IOException{
//read each file. When each file is vectorized, put it in a minibatch.
//producer-consumer threading structure.
int NUM_THREADS = Runtime.getRuntime().availableProcessors();
BlockingQueue<LinkedList<String>> queue = new ArrayBlockingQueue<>(50);
ExecutorService service = Executors.newFixedThreadPool(NUM_THREADS);
CompletionService<Document> completionService =
new ExecutorCompletionService<Document>(service);
//TODO: is list the most efficient structure to handle future?
List<Future<Document>> docs = new ArrayList<Future<Document>>();
for (int i = 0; i < (NUM_THREADS - 1); i++) {
docs.add(completionService.submit(new DocumentConsumer(queue)));
}
// Wait for ReadFile to complete
service.submit(new ReadFile(queue, fileDir)).get();
service.shutdownNow(); // interrupt CPUTasks
// Wait for DocumentConsumer to complete
service.awaitTermination(365, TimeUnit.DAYS);
//do things with processed docs.
//should I be doing this, though?
PriorityQueue<Document> Documents = new PriorityQueue<Document>();
for(Future<Document> d : docs){
try{
Document doc = d.get();
Documents.add(doc);
System.out.println(Integer.toString(doc.Cj));
} catch(ExecutionException e) {
e.getCause();e.printStackTrace();
}
}
return Documents;
}
Document object (without compare and overload):
public class Document implements Comparator<Document>{
int docId, Cj;
HashMap<String,Integer> termTable;
public Document(int id,int Cj, HashMap<String,Integer> termMap){
this.docId = id;
this.Cj = Cj;
this.termTable = termMap;
}
docId
is set during file reading. I read in a file and append the count to the first line, which is then set during processing.
The rest of the code is available here.