lakeFS - Data version control for your data lake | Git for data
-
Updated
Jun 19, 2023 - Go
lakeFS - Data version control for your data lake | Git for data
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Personal Data Engineering Projects
Generic Data Ingestion & Dispersal Library for Hadoop
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Enterprise-grade, production-hardened, serverless data lake on AWS
Use SQL to build ELT pipelines on a data lakehouse.
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Analytics APIs for Snowflake, BigQuery, DuckDB, PosgreSQL
Apache Spark 3 - Structured Streaming Course Material
data load tool (dlt) is an open source Python library that makes data loading easy
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Apache Spark Course Material
Add a description, image, and links to the data-lake topic page so that developers can more easily learn about it.
To associate your repository with the data-lake topic, visit your repo's landing page and select "manage topics."