The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

learn more… | top users | synonyms

0
votes
0answers
81 views

Non HBase solution for huge data that has update and delete in sequential manner

I have to design an application where there are around 5K structured base text files (file.txt) with data and format as below: Primary key is OrgId + ItemId OgId|^|ItemId|^|segmentId|^|Sequence|^|...
-1
votes
1answer
437 views

Is this Big Data architecture good enough to handle many requests per second?

I want to ask for a review of my big data app plan. I haven’t much experience in that field, so every single piece of advice would be appreciated. Here is a link to a diagram of the architecture: My ...
3
votes
4answers
361 views

Can someone explain the technicalities of MapReduce in layman's terms?

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, ...
0
votes
1answer
76 views

#Apache-flink: Stream processing or Batch processing using Flink

I am tasked with redesigning an existing catalog processor and the requirement goes as below Requirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' ...
2
votes
2answers
100 views

SRP in the “big data” setting

We have a codebase at work that: Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items” These “micro-items” are then clustered together to find “macro-...
0
votes
0answers
487 views

How to use Hadoop HBase with Spring Boot without knowing the schema of the database ahead of time

I have created a basic application with spring boot and HSQL which connects an in-memory HSQL database with an angularjs front end using spring-boot and spring JPA with Hibernate. I am now trying to ...
-3
votes
1answer
154 views

Should I use NoSQL or HDFS for storage?

I have millions of tweets currently stored in HDFS and I plan to analyze them from Spark (Data mining, text mining, Frequent Term-Based Text Clustering, Social Network Analysis) however, do not know ...
2
votes
0answers
417 views

Best practices for dashboard of near real-time analytics

I’m currently building a dashboard to view some analytics about the data generated by my company's product. We use MySQL as our database. The SQL queries to generate the analytics from the raw live ...
1
vote
0answers
80 views

Improve communication between controller and trackers in a Twitter fetcher tool using RabbitMQ or Apache Flume

I've been working for a time with some researches developing a tool to fetch tweets from Twitter and process them in some way. The first prototype "worked" but became a pain as we used sockets to ...
0
votes
1answer
313 views

Is hadoop designed only for “simple” data processing jobs, where communications between the distributed nodes are sparse?

I am not a professional coder, but rather an engineer/mathematician that uses computer to solve numerical problems. So far most of my problems are math-related, such as solving large scale linear ...
3
votes
1answer
830 views

Hadoop and Object Reuse, Why?

In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any ...
2
votes
1answer
7k views

How best to implement a Dashboard from data in HDFS/Hadoop [closed]

We have a bunch of data (several TB) in Hadoop HDFS and it's growing. We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc....
3
votes
2answers
1k views

Text search - big data problem

I have a problem I was hoping I could get some advice on! I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured. I have a 'category list'...
5
votes
2answers
9k views

Optimal way to store 18 billion key, value pairs [closed]

I have around 200 million new objects coming in, and a 90 day retention policy, so that leaves me with 18 billion records to be stored in the form of key-value pairs. Key and value both will be a ...
2
votes
1answer
1k views

How best to merge/sort/page through tons of JSON arrays?

Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is ...
1
vote
1answer
333 views

Is it smart to design a command and control server, that will monitor system resources and spin up/spin down servers at times of peak?

I am building an application that will be modular, in a way that it will be a set of separate systems communicating with each other. It uses Hadoop on all systems, and HBase on 3 of the 4. Scaling ...
4
votes
3answers
2k views

Why do HDFS clusters have only a single NameNode?

I'm trying to understand better how Hadoop works, and I'm reading The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode ...
4
votes
2answers
976 views

Asynchronous Java

I'm wondering if I wanted to implement a web service based on java that does web analytics, what sort of architecture should I use. The actualy processing of the Big Data would be done by Hadoop. ...
3
votes
3answers
33k views

Is cloudera hadoop certification worth the investment [duplicate]

I am considering investing time to learn Hadoop and it's related technologies. The problem is that my current day job will not be using Hadoop any time soon and even if I learn from books, blogs ...
1
vote
1answer
122 views

How do you control nodes in a server farm?

I've been reading about hadoop and multi-node setups, and it says in the documentation that you must have a JVM and hadoop software already running on those nodes. My question is, do people install ...
5
votes
2answers
1k views

Can map-reduce say “Hello World”?

Gathering that map-reduce is being used to process huge amounts of data, I set out to understand it. My queries were: What class of problems does it aim to solve? How does it help breaking down of ...
4
votes
2answers
1k views

how to convince other we should move to hadoop?

Everything I've read about Hadoop seems like exactly the technology we need to make our enterprise more scalable. We have terabytes of raw data that is in non-relational form (text files of some kind)....