Skip to content
#

data-profiling

Here are 32 public repositories matching this topic...

amraf1002
amraf1002 commented Oct 29, 2020

We are trying to use GE with GCP DataProc clusters. While cluster creation we are installing great-expectations==0.12.4. This installs ruamel.yaml==0.15.35 as dependency. After cluster creation if we try to import great_expectations we get error:

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/default/lib/python3.6/site-packages/great_expectations/_

The program compares two files at a time and does the following 1.Gathering metadata on the individual tables(column count,record count,list of columns with datatype etc) 2.Identifying matching columns between tables based on names as well as data. Using machine learning, we are handling syntactic as well as semantic variations of column names for accurate matching. 3. Finding duplicate columns in single table with the option to deduplicate if required 4. Finding columns with missing data/null values.

  • Updated Feb 17, 2018
  • Python

Identified data types for each distinct column value on 1900 data sets. For each column, summarized semantic types present in the column, using Fuzzy Logic, Levenshtein distance. Identified & derived inference the 3 most frequent 311 complaint types by borough.

  • Updated Apr 15, 2020
  • Python

Improve this page

Add a description, image, and links to the data-profiling topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-profiling topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.