Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing.

learn more… | top users | synonyms

2
votes
1answer
110 views

CSMR for large-scale text-prcessing

I'm working on a project for large-scale text-processing, which is a first implementation of the basic idea of CSMR. CSMR is an algorithm that measures the similarity between documents by calculating ...
5
votes
2answers
165 views

Map reduce tester ported from bash to Python

My MapReduce tester is clearly ported from Shell, short of args=None for line in args or read_input(), what's a better way of ...
2
votes
1answer
47 views

Mockable adapters

The inspiration for this question comes from a talk by IanCooper; in particular, this remark: Never mock somebody else's interface... you want to mock your abstraction of how you call the third ...
6
votes
1answer
383 views

Shortest Path in a Unit Cost DAG BFS in Hadoop

Just got done implementing a shortest path algo of a unit cost DAG BFS in Hadoop and wanting to get anyone's opinion on how I could have done it better. The main issue I see is that I lose the ...
7
votes
1answer
10k views

Calculate min, max, average, and variance on a large dataset

I got a piece of Java code using Hadoop to calculate min, max, average and variance on a large dataset made of (index value) couples separated by a newline: ...
4
votes
1answer
144 views

Efficient way to copy unordered String into ordered String

I'm doing this in Hadoop Java where I'm reading a String. The string is huge that has been tokenized and put in an array. It has key-value pairs but they are not in any order. I want this order to be ...