-
Updated
Jan 10, 2022 - Go
#
data-cleaning
Here are 1,070 public repositories matching this topic...
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
tsv
devops
json
statistics
csv
command-line
json-data
tabular-data
data-reduction
unix-toolkit
statistical-analysis
csv-format
devops-tools
data-regression
data-processing
command-line-tools
data-cleaning
streaming-algorithms
streaming-data
miller
The standard package for machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Works with most datasets and models.
machine-learning
machine-learning-algorithms
weak-supervision
semi-supervised-learning
unsupervised-learning
data-cleaning
clean-data
learning-with-confident-examples
noisy-data
data-centric
confident-learning
latent-estimation
robust-machine-learning
learning-with-noisy-labels
-
Updated
Jan 10, 2022 - Python
Jupyter notebook and datasets from the pandas Q&A video series
-
Updated
Oct 23, 2020 - Jupyter Notebook
General Assembly's 2015 Data Science course in Washington, DC
python
data-science
machine-learning
natural-language-processing
course
clustering
naive-bayes
linear-regression
scikit-learn
jupyter-notebook
pandas
data-visualization
web-scraping
data-analysis
ensemble-learning
logistic-regression
decision-trees
regular-expressions
data-cleaning
model-evaluation
-
Updated
Apr 18, 2016 - Jupyter Notebook
data-science
machine-learning
spark
bigdata
data-transformation
pyspark
data-extraction
data-analysis
data-wrangling
dask
data-exploration
data-preparation
data-cleaning
data-profiling
data-cleansing
big-data-cleaning
data-cleaner
cudf
dask-cudf
-
Updated
Jan 6, 2022 - Python
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
visualization
nodejs
javascript
linq
json
data
csv
pandas
data-visualization
data-analysis
data-wrangling
data-management
data-manipulation
data-cleaning
data-munging
data-cleansing
data-forge
-
Updated
Nov 17, 2021 - TypeScript
jgirault-qs
commented
Jul 23, 2021
Describe the bug
pa.errors.SchemaErrors.failure_cases only returns the first 10 failure_cases
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera. 0.6.5
- (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read [this guide](https://matthewrocklin.c
Schema-Inspector is a simple JavaScript object sanitization and validation module.
-
Updated
Oct 15, 2021 - JavaScript
Machine learning on dirty tabular data
-
Updated
Dec 19, 2021 - Python
Open
Write tests
msamogh
commented
Mar 18, 2019
Write unit test coverage for SafeDataset and SafeDataLoader, along with the functions in utils.py.
visualization
security
data
machine-learning
server
voice
python3
voice-recognition
generation
transcription
voice-control
data-cleaning
voice-assistant
encryption-decryption
voice-recording
voice-activity-detection
wake-word-detection
featurization
voice-computing
-
Updated
Dec 13, 2021 - Python
Easy to use Python library of customized functions for cleaning and analyzing data.
python
data-science
data-visualization
feature-selection
data-analysis
klib
data-preprocessing
data-cleaning
-
Updated
Dec 31, 2021 - Python
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
probabilistic-programming
bayesian-inference
data-cleaning
probabilistic-graphical-models
data-cleansing
-
Updated
Nov 23, 2021 - Julia
An R package for data screening
-
Updated
Dec 18, 2020 - HTML
Exploratory data analysis 📊 using python 🐍 of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
data-science
exploratory-data-analysis
eda
data-visualization
kaggle-competition
data-analytics
data-analysis
data-wrangling
data-cleaning
kaggle-dataset
data-cleansing
data-science-python
data-analysis-python
kaggle-used-cars-dataset
-
Updated
Jan 2, 2019 - Jupyter Notebook
Data Science Feature Engineering and Selection Tutorials
python
data-science
machine-learning
tutorial
jupyter
notebook
scikit-learn
exploratory-data-analysis
tutorials
pandas
feature-selection
xgboost
feature-engineering
features
data-cleaning
pandas-profiling
sweetviz
pyrasgo
-
Updated
Nov 2, 2021 - Jupyter Notebook
python
gui
gpu
datasets
dask
optimus
data-preparation
data-cleaning
data-profiling
bumblebee
prepare-data
cudf
dask-cudf
-
Updated
Jan 8, 2022 - Vue
Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms
cran
r
openrefine
clustering
fuzzy-matching
rstats
ngram
approximate-string-matching
data-clustering
data-cleaning
-
Updated
Nov 7, 2019 - C++
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
python
docker
airflow
sql
database
s3
s3-bucket
data-visualization
python3
data-warehouse
metabase
data-engineering
data-analytics
data-analysis
redshift
data-processing
data-cleaning
data-warehousing
data-orchestration
-
Updated
Apr 18, 2020 - Python
machine-learning
deep-learning
data-transformation
data-visualization
machine-learning-library
machine-learning-api
datasets
data-cleaning
ludwig
data-augmentation
automl
tpot
machine-learning-models
model-compression
model-deployment
autokeras
voice-computing
data-cleaning-pipeline
autopytorch
-
Updated
Dec 13, 2021 - Python
A Machine Learning System for Data Enrichment.
-
Updated
Sep 15, 2018 - Python
Cleans Reddit Text Data 📜 🧹
-
Updated
Apr 14, 2020 - Python
Kiarii
commented
Jun 16, 2017
Context
At the moment Data type and Column header are links, although differently color coded; this brings about some confusion
Idea
- Move the Data type switching menu into the Column Edit menu:
- which would mean there
Grateful Data isn't programming code, but an online tutorial about data acquisition, cleaning and enriching, using publicly accessible data on the band the Grateful Dead as examples. Read the Wiki to find out how to use the sample data.
data-acquisition
ace
digital-humanities
data-cleaning
grateful-dead
open-refine
grateful-data
ace-repertory-search
-
Updated
Sep 3, 2019
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
machine-learning
data-mining
deep-learning
data-visualization
feature-selection
recommendation-system
data-analysis
boosting-algorithms
feature-engineering
data-cleaning
hybrid-model
bagging
-
Updated
Apr 29, 2020 - Jupyter Notebook
A simple command line interface to the datamade/dedupe library.
-
Updated
Nov 15, 2021 - Jupyter Notebook
Improve this page
Add a description, image, and links to the data-cleaning topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the data-cleaning topic, visit your repo's landing page and select "manage topics."

A note from Uwe Ligges of CRAN:
I don't know about DOIs. Anyone have a thought on this? Is it only appropriate for packages associated with a research paper?