Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing.
2
votes
1answer
110 views
CSMR for large-scale text-prcessing
I'm working on a project for large-scale text-processing, which is a first implementation of the basic idea of CSMR. CSMR is an algorithm that measures the similarity between documents by calculating ...
5
votes
2answers
165 views
Map reduce tester ported from bash to Python
My MapReduce tester is clearly ported from Shell, short of args=None for line in args or read_input(), what's a better way of ...
2
votes
1answer
47 views
Mockable adapters
The inspiration for this question comes from a talk by IanCooper; in particular, this remark:
Never mock somebody else's interface... you want to mock your abstraction of how you call the third ...
6
votes
1answer
383 views
Shortest Path in a Unit Cost DAG BFS in Hadoop
Just got done implementing a shortest path algo of a unit cost DAG BFS in Hadoop and wanting to get anyone's opinion on how I could have done it better. The main issue I see is that I lose the ...
7
votes
1answer
10k views
Calculate min, max, average, and variance on a large dataset
I got a piece of Java code using Hadoop to calculate min, max, average and variance on a large dataset made of (index value) couples separated by a newline:
...
4
votes
1answer
144 views
Efficient way to copy unordered String into ordered String
I'm doing this in Hadoop Java where I'm reading a String. The string is huge that has been tokenized and put in an array. It has key-value pairs but they are not in any order. I want this order to be ...