Newest 'apache-spark' Questions

3

votes

1answer

32 views

Classifying and counting database entries using Scala map and flatMap

I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...

asked Sep 7 at 9:39

Shams Tabraiz Alam

161

0

votes

1answer

112 views

Unit testing Spark transformation on DataFrame

Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...

scala apache-spark

asked Jul 18 at 21:46

wrschneider

1313

0

votes

0answers

31 views

Spark Code for doing SQL Operations on a/ multiple JSON files

Input is one or many json files to run a query and print the result. I have tried both Collect and saveAsTextFile but both are slow and would appreciate suggestions to help speed things up. Here is ...

sql json apache-spark

asked Jun 27 at 7:57

Neelesh Salian

2168

3

votes

0answers

109 views

PySpark Dataframes program to process huge amounts of server data from a parquet file

I'm new to spark and dataframes and I'm looking for feedback on what bad or inefficient processes might be in my code so I can improve and learn. My program reads in a parquet file that contains ...

python beginner python-3.x apache-spark

asked Apr 20 at 15:55

flybonzai

22018

0

votes

1answer

117 views

Python + spark to parse and save logs

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block parser.py: <...

python performance logging apache-spark

asked Mar 29 at 8:43

Loom

29519

2

votes

1answer

201 views

Implementing an inner product using pyspark

I'm trying to implement a dot product using pyspark in order to learn pyspark's syntax. I've currently implemented the dot product like so: ...

python functional-programming apache-spark

asked Jan 20 at 1:50

Thunder Shiviah

305

1

vote

0answers

103 views

Increase performance of Spark-job Collaborative Recommendation.

This is my first Spark Application. I am using "ALS.train" for training the model - Model Factorization. The total time that the Application takes is approx 45 mins. Note: I think takeOrdered is the ...

java performance machine-learning apache-spark

asked Jan 18 at 7:57

Jaspinder Virdee

836

1

vote

0answers

91 views

collect and parallelize in spark performance

the below code is working fine, but as this is new for me, please help to improve performance. as I have not included yet my complex logic but confused over collect and parallelize. Aim is to collect ...

performance scala apache-spark hiveql

asked Dec 30 '15 at 14:47

Kalpesh

61

10

votes

1answer

9k views

Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame

Maybe I totally reinvented the wheel, or maybe I've invented something new and useful. Can one of you tell me if there's a better way of doing this? Here's what I'm trying to do: I want a generic <...

python mapreduce apache-spark

asked Dec 26 '15 at 0:01

Nathaniel

5316

4

votes

0answers

147 views

RandomForest multi-class classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...

scala machine-learning apache-spark

asked Dec 3 '15 at 18:41

Huga

232

7

votes

2answers

402 views

Average movie rankings

Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's ...

python performance statistics mapreduce apache-spark

asked Jun 29 '15 at 23:41

JasonAizkalns

1087

0

votes

1answer

325 views

Class for finding the median of a two-dimensional space

I have a simple static class that it's purpose is given an RDD of Point to find the median of each dimension and return that as a new ...

java performance static coordinate-system apache-spark

asked Nov 25 '14 at 13:02

Aki K

704517

8

votes

2answers

19k views

Producing a sorted wordcount with Spark

I'm currently learning how to use Apache Spark. In order to do so, I implemented a simple wordcount (not really original, I know). There already exists an example on the documentation providing the ...

java performance java-8 apache-spark

asked Jul 10 '14 at 12:04

fxm

143115

5

votes

1answer

1k views

Why does the LR on spark run so slowly?

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...

performance scala machine-learning apache-spark

asked Dec 31 '13 at 8:15

Tim

263

current community

your communities

more stack exchange communities

Tagged Questions

Classifying and counting database entries using Scala map and flatMap

Unit testing Spark transformation on DataFrame

Spark Code for doing SQL Operations on a/ multiple JSON files

PySpark Dataframes program to process huge amounts of server data from a parquet file

Python + spark to parse and save logs

Implementing an inner product using pyspark

Increase performance of Spark-job Collaborative Recommendation.

collect and parallelize in spark performance

Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame

RandomForest multi-class classification

Average movie rankings

Class for finding the median of a two-dimensional space

Producing a sorted wordcount with Spark

Why does the LR on spark run so slowly?

Hot Network Questions

your communities

Tagged Questions

Related Tags