Skip to content
#

datacleaner

Here are 7 public repositories matching this topic...

If you have some csv file and having CRLF, LF in between data and you want to create some table (Hive table). You will face issue that some of column have null value. It’s because line terminator in hive is \n and if and \n or \r coming between data it treating as line terminator before actual line terminator and rest for column is getting null value. I tried multiple option like spark, hive serde and many more but I found good with perl. Today I a sharing my Perl script to remove all newline and special characters.

  • Updated Dec 10, 2017
  • Perl

CSVParser is a tool to parse csv file using univocity and commons csv parsers. It cleans new line (\n) character & special characters between data. It also handle various garbage data like odd no of quotes or delimiters in side quotes. It validate each record with specified delimiter count and separate it out to _GoodRecords.CSV and _BadRecords.CSV file. This is a Data Cleaner tool to run before ingestion to Data Lake. It make sure data is in right csv format to build table on it.

  • Updated Jan 19, 2019
  • Java

Improve this page

Add a description, image, and links to the datacleaner topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the datacleaner topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.