The data-cleansing tag has no wiki summary.
1
vote
3answers
36 views
multi-column factorize in pandas
The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of ...
1
vote
1answer
33 views
Fill in missing pandas data with previous non-missing value, grouped by key
I am dealing with pandas DataFrames like this:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 NaN
5 2 NaN
6 1 300
7 1 NaN
I would like to replace each NAN 'x' with the ...
1
vote
1answer
21 views
notepad++: keep regex (multi occurence per line) and line structure, remove other characters
I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line ...
3
votes
3answers
80 views
Performing Operations on a Subset Using Data Table
I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular ...
0
votes
2answers
36 views
Completely stripping certain HTML Tags in Django forms
I have a ModelForm that posts news items to a database, and it uses a javascript textarea to allow the authorized poster to insert certain pieces of HTML to style text, like bold and italics. However, ...
0
votes
2answers
68 views
How to remove hyperlinks, email ids, etc from a text document using regex?
I have some text documents which contains:
Different types of emails addresses: I mean public domain such as gmail, yahoo,
etc and private emails as well such as [email protected]...
Different ...
0
votes
0answers
36 views
'cleaning' data for automated SQL insertion via php
I'm inserting data in an SQL table, via php, that is being pulled from a third party data source. Occasionally this third party source will contain some character like a single quote that will cause ...
2
votes
2answers
143 views
How can I subset rows in a data frame in R based on a vector of values?
I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a ...
0
votes
0answers
37 views
Data-cleansing trigger help for MySQL, please
I am extremely new to the MySQL environment and databases in general. I am currently working on a project for work and am having a great deal of difficulty trying create a MySQL trigger. I'm sure ...
0
votes
1answer
51 views
Data cleaning of dollar values and percentage in R
I've been searching for a number of packages in R to help me in converting dollar values to nice numerical values. I don't seem to be able to find one (in plyr package for example). The basic thing ...
0
votes
1answer
171 views
Looking for dictionary words in text file using dictionary in python
I read the how to check dictionary words
And I got the idea to check my text file using dictionaries. I have read the pyenchant instructions, and I thought that if I use get_tokenizer to give me ...
1
vote
1answer
69 views
Google refine cross-reference between row and column
I'm not sure if this can be achieved in Google Refine at all. But basically, I have data like this.
The first table is the table of all the users. The second table show all the friends. However, ...
0
votes
3answers
388 views
Word 2007 - Macro to clean text
I'm new to VBA and am trying to write a macro that will format some text for me.
I can't seem to figure it out.
This is what the original data looks like:
This is sentence one of paragraph one. ...
0
votes
1answer
102 views
Code for “finding and deleting” complete strings but not substrings in R?
I am trying to find a way of quickly cleaning large datasets based on the occurrence of certain strings. I have a data.frame that looks like this:
created_at actor_attributes_email type
3/11/12 ...
2
votes
1answer
113 views
fingerprinting entire iPhone music library with echoprint
I'm wondering how intensive it would be to fingerprint an iphone 4+'s entire music library with echoprint. how long should I expect it take to analyze 2-3k songs? Is this even reasonable?