The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
0
votes
0answers
81 views
Non HBase solution for huge data that has update and delete in sequential manner
I have to design an application where there are around 5K structured base text files (file.txt) with data and format as below:
Primary key is OrgId + ItemId
OgId|^|ItemId|^|segmentId|^|Sequence|^|...
-1
votes
1answer
437 views
Is this Big Data architecture good enough to handle many requests per second?
I want to ask for a review of my big data app plan. I haven’t much experience in that field, so every single piece of advice would be appreciated.
Here is a link to a diagram of the architecture: My ...
3
votes
4answers
361 views
Can someone explain the technicalities of MapReduce in layman's terms?
When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, ...
0
votes
1answer
76 views
#Apache-flink: Stream processing or Batch processing using Flink
I am tasked with redesigning an existing catalog processor and the requirement goes as below
Requirement
I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' ...
2
votes
2answers
100 views
SRP in the “big data” setting
We have a codebase at work that:
Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items”
These “micro-items” are then clustered together to find “macro-...
0
votes
0answers
487 views
How to use Hadoop HBase with Spring Boot without knowing the schema of the database ahead of time
I have created a basic application with spring boot and HSQL which connects an in-memory HSQL database with an angularjs front end using spring-boot and spring JPA with Hibernate. I am now trying to ...
-3
votes
1answer
154 views
Should I use NoSQL or HDFS for storage?
I have millions of tweets currently stored in HDFS and I plan to analyze them from Spark (Data mining, text mining, Frequent Term-Based Text Clustering, Social Network Analysis) however, do not know ...
2
votes
0answers
417 views
Best practices for dashboard of near real-time analytics
I’m currently building a dashboard to view some analytics about the data generated by my company's product.
We use MySQL as our database. The SQL queries to generate the analytics from the raw live ...
1
vote
0answers
80 views
Improve communication between controller and trackers in a Twitter fetcher tool using RabbitMQ or Apache Flume
I've been working for a time with some researches developing a tool to fetch tweets from Twitter and process them in some way. The first prototype "worked" but became a pain as we used sockets to ...
0
votes
1answer
313 views
Is hadoop designed only for “simple” data processing jobs, where communications between the distributed nodes are sparse?
I am not a professional coder, but rather an engineer/mathematician that uses computer to solve numerical problems. So far most of my problems are math-related, such as solving large scale linear ...
3
votes
1answer
830 views
Hadoop and Object Reuse, Why?
In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any ...
2
votes
1answer
7k views
How best to implement a Dashboard from data in HDFS/Hadoop [closed]
We have a bunch of data (several TB) in Hadoop HDFS and it's growing.
We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc....
3
votes
2answers
1k views
Text search - big data problem
I have a problem I was hoping I could get some advice on!
I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured.
I have a 'category list'...
5
votes
2answers
9k views
Optimal way to store 18 billion key, value pairs [closed]
I have around 200 million new objects coming in, and a 90 day retention policy, so that leaves me with 18 billion records to be stored in the form of key-value pairs.
Key and value both will be a ...
2
votes
1answer
1k views
How best to merge/sort/page through tons of JSON arrays?
Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is ...
1
vote
1answer
333 views
Is it smart to design a command and control server, that will monitor system resources and spin up/spin down servers at times of peak?
I am building an application that will be modular, in a way that it will be a set of separate systems communicating with each other. It uses Hadoop on all systems, and HBase on 3 of the 4.
Scaling ...
4
votes
3answers
2k views
Why do HDFS clusters have only a single NameNode?
I'm trying to understand better how Hadoop works, and I'm reading
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode ...
4
votes
2answers
976 views
Asynchronous Java
I'm wondering if I wanted to implement a web service based on java that does web analytics, what sort of architecture should I use. The actualy processing of the Big Data would be done by Hadoop.
...
3
votes
3answers
33k views
Is cloudera hadoop certification worth the investment [duplicate]
I am considering investing time to learn Hadoop and it's related technologies. The problem is that my current day job will not be using Hadoop any time soon and even if I learn from books, blogs ...
1
vote
1answer
122 views
How do you control nodes in a server farm?
I've been reading about hadoop and multi-node setups, and it says in the documentation that you must have a JVM and hadoop software already running on those nodes.
My question is, do people install ...
5
votes
2answers
1k views
Can map-reduce say “Hello World”?
Gathering that map-reduce is being used to process huge amounts of data, I set out to understand it.
My queries were:
What class of problems does it aim to solve?
How does it help breaking down of ...
4
votes
2answers
1k views
how to convince other we should move to hadoop?
Everything I've read about Hadoop seems like exactly the technology we need to make our enterprise more scalable. We have terabytes of raw data that is in non-relational form (text files of some kind)....