#
apache-spark
Here are 933 public repositories matching this topic...
酷玩 Spark: Spark 源代码解析、Spark 类库等
-
Updated
May 26, 2019 - Scala
Interactive and Reactive Data Science using Scala and Spark.
-
Updated
Jun 2, 2020 - JavaScript
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
python
scala
apache-spark
pytorch
keras-tensorflow
bigdl
distributed-deep-learning
deep-neural-network
analytics-zoo
-
Updated
Sep 11, 2020 - Jupyter Notebook
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
-
Updated
Sep 7, 2020 - Java
imback82
commented
Sep 6, 2020
The current azure-pipelines.yaml is highly duplicated, especially the Test stages (E2E Tests, E2E Backward Compatibility Tests, and E2E Forward Compatibility Tests).
This should be refactored to remove duplication to make it easy to maintain (e.g, adding a new Spark version to test against).
Apache Spark docker image
-
Updated
Aug 15, 2020 - Dockerfile
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
kubernetes
spark
apache-spark
kubernetes-operator
kubernetes-controller
kubernetes-crd
google-cloud-dataproc
-
Updated
Aug 31, 2020 - Go
PySpark + Scikit-learn = Sparkit-learn
-
Updated
Oct 24, 2017 - Python
(Deprecated) Scikit-learn integration package for Apache Spark
-
Updated
Dec 3, 2019 - Python
A curated list of awesome Apache Spark packages and resources.
-
Updated
Jul 16, 2020
data-science
machine-learning
spark
apache-spark
bigdata
data-transformation
pyspark
data-extraction
data-analysis
data-wrangling
dask
data-exploration
data-preparation
data-profiling
data-cleansing
big-data-cleaning
data-cleaner
cudf
-
Updated
Sep 9, 2020 - Jupyter Notebook
C# and F# language binding and extensions to Apache Spark
streaming
spark
apache-spark
csharp
fsharp
bigdata
dataset
spark-streaming
eventhubs
mapreduce
dataframe
rdd
dstream
mobius
kafka-streaming
near-real-time
-
Updated
Nov 1, 2019 - C#
R interface for Apache Spark
-
Updated
Sep 10, 2020 - R
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
-
Updated
Jan 24, 2017 - Scala
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
python
airflow
spark
apache-spark
scheduler
s3
data-engineering
data-lake
warehouse
redshift
data-migration
livy
etl-framework
apache-airflow
emr-cluster
etl-pipeline
etl-job
data-engineering-pipeline
airflow-dag
goodreads-data-pipeline
-
Updated
Mar 9, 2020 - Python
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
data-science
machine-learning
spark
apache-spark
deep-learning
hadoop
tensorflow
keras
keras-models
optimization-algorithms
data-parallelism
distributed-optimizers
-
Updated
Jul 25, 2018 - Python
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
-
Updated
Jan 8, 2020 - Scala
A command-line tool for launching Apache Spark clusters.
-
Updated
Aug 3, 2020 - Python
REST web service for the true real-time scoring (<1 ms) of R, Scikit-Learn and Apache Spark models
-
Updated
Aug 5, 2020 - Java
Streaming System 相关的论文读物
streaming
apache-spark
storm
stream-processing
spark-streaming
dataflow
flink
heron
drizzle
millwheel
s4
streaming-engine
spe
stream-processing-engine
-
Updated
Mar 31, 2018
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
python
vagrant
data-science
data
machine-learning
airflow
kafka
spark
apache-spark
analytics
machine-learning-algorithms
python3
amazon-ec2
python-3
apache-kafka
amazon-web-services
predictive-analytics
agile-data
data-syndrome
agile-data-science
-
Updated
Jul 29, 2020 - Jupyter Notebook
A list about Apache Kafka
infrastructure
kafka
apache-spark
stream-processing
apache-kafka
kafka-streams
data-processing
data-pipeline
streaming-data
-
Updated
Dec 22, 2019
The Internals of Spark Structured Streaming
-
Updated
Sep 11, 2020
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
-
Updated
Sep 14, 2015 - Shell
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
-
Updated
Sep 3, 2020 - Scala
A boilerplate for writing PySpark Jobs
-
Updated
Jul 1, 2020 - Python
Improve this page
Add a description, image, and links to the apache-spark topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the apache-spark topic, visit your repo's landing page and select "manage topics."
MLflow seems to have a length limit of 5000 when setting tags (see below).