#
data-lake
Here are 144 public repositories matching this topic...
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
-
Updated
Aug 10, 2021 - Java
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
python
airflow
spark
apache-spark
scheduler
s3
data-engineering
data-lake
warehouse
redshift
data-migration
livy
etl-framework
apache-airflow
emr-cluster
etl-pipeline
etl-job
data-engineering-pipeline
airflow-dag
goodreads-data-pipeline
-
Updated
Mar 9, 2020 - Python
Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark
kubernetes
sql
spark
hive
hadoop
jdbc
thrift
data-lake
spark-sql
kyuubi-server
thrift-jdbc
odbc-server
-
Updated
Sep 1, 2021 - Scala
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
infrastructure
aws
postgres
data
airflow
cloudformation
cassandra
cluster
aws-s3
aws-sdk
data-warehouse
data-engineering
data-lake
aws-ec2
postgresql-database
data-modeling
cassandra-database
etl-pipeline
data-engineering-pipeline
airflow-operators
-
Updated
Mar 5, 2020 - Python
Generic Data Ingestion & Dispersal Library for Hadoop
-
Updated
Jun 3, 2021 - Java
Use SQL to build ELT pipelines on a data lakehouse.
sql
apache-spark
etl
pipelines
data-engineering
data-lake
data-transfer
delta
data-integration
upsert
elt
data-pipeline
datalake
data-ingestion
spark-sql
zeppelin-notebook
apache-iceberg
lakehouse
incremental-updates
-
Updated
Aug 20, 2021 - JavaScript
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
python
java
kubernetes
mqtt
cloud
kafka
mongodb
tensorflow
terraform
gcp
grpc
data-lake
confluent
hivemq
kafka-connect
kafka-streams
ksql
ksqldb
tiered-storage
tensorflow-io
-
Updated
Nov 5, 2020 - Jupyter Notebook
Enterprise-grade, production-hardened, serverless data lake on AWS
-
Updated
Aug 30, 2021 - Python
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
aws
data
privacy
big-data
s3
data-lake
parquet
gdpr
right-to-be-forgotten
amazon-s3
data-erasure
ccpa
-
Updated
Aug 25, 2021 - Python
Reference Architectures for Datalakes on AWS
glue
amazon-emr
data-transformation
data-lake
data-catalog
data-analytics
hive-metastore
emr-cluster
ingest-data
-
Updated
May 13, 2020 - HTML
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
-
Updated
Jul 12, 2021 - Scala
Framework to quickly build and maintain Smart Data Lakes
scala
spark
hive
hadoop
transform-data
data-lake
data-pipelines
comprehensive
deltalake
smart-data-lake
-
Updated
Aug 30, 2021 - Scala
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
azure
data-storage
resources
data-engineering
data-lake
azure-storage
batch-processing
data-engineer
azure-data-factory
microsoft-azure
azure-portal
azure-cosmos-db
azure-services
polybase
certification-prep
azure-databricks
exam-prep
azure-certification
dp-200
sql-dw
-
Updated
Aug 5, 2020
Apache Spark Course Material
-
Updated
Jul 26, 2020 - Scala
Query API for aggregated Zeebe data
-
Updated
Aug 26, 2021 - Kotlin
Personal Data Engineering Projects
postgres
airflow
spark
cassandra
mongodb
data-warehouse
data-engineering
data-lake
scrapy
data-modeling
aws-redshift
star-schema
ingest-data
data-engineering-nanodegree
-
Updated
Apr 1, 2021 - Jupyter Notebook
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
real-time
reddit
sentiment-analysis
data-stream
tutorials
data-lake
kinesis-firehose
self-learning
sentiment-classification
amazon-athena
aws-glue
delivery-stream
-
Updated
Apr 20, 2021 - Python
A K8s-based infrastructure for analytics
infrastructure
data-science
machine-learning
streaming
spark
analytics
data-lake
k8s
lambda-architecture
data-mill
-
Updated
Jan 15, 2020 - Shell
Apache Spark 3 - Structured Streaming Course Material
-
Updated
Sep 26, 2020 - Python
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
python
aws
airflow
sql
spark
analytics
s3
jobs
pyspark
data-engineering
data-lake
redshift
jobseeker
jobsearch
data-modeling
data-pipeline
jobscheduler
-
Updated
Aug 30, 2021 - Python
Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.
-
Updated
Aug 25, 2021 - Java
This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.
aws
airflow
cassandra
aws-s3
postgresql
data-engineering
data-lake
data-modeling
udacity-nanodegree
data-pipeline
data-warehousing
aws-redshift
-
Updated
Aug 30, 2021 - PLpgSQL
Improve this page
Add a description, image, and links to the data-lake topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the data-lake topic, visit your repo's landing page and select "manage topics."
The
fs:ReadConfigpermission on the resource "*" is required when viewing a specific repository from the UI.