Pandas is a Python data analysis library.
2
votes
0answers
17 views
Counting matches won by teams
This script counts how many matches were won by the team with a first half, second half, and full game time of possession differential advantage.
The data is stored in a pandas data frame:
...
-1
votes
0answers
45 views
Memory efficient alternative passing mostly static variables w/ 1 dynamic variable to an external function (COM) instead of a for loop in Python
Massive data problem, 7/8 variables are repeated 1000 times in a for loop, sending 8 input vectors * 50,000 values per loop for each function call. Only 1 input ...
1
vote
1answer
44 views
SKlearn automate data pre treatment
I want to make a simple wrapper for sklearn models. The idea is that the wrapper automatically takes care of factors (columns of type "object") replacing them with ...
4
votes
1answer
38 views
3
votes
0answers
44 views
Python Script to query OSRM (drive-times)
The below script takes two arguments
Path to the OSRM Map
Path to a .csv containing the columns ...
3
votes
1answer
61 views
Backing up files to remote servers
I have code that backup files to remote servers.
There are 2 pandas data frames:
storedfiles: that points to the files already stored in previous runs
...
2
votes
1answer
47 views
Analysis of call center employee performance
This code basically calculates a few metrics based on numerators and denominators and then it bins the data into quintiles. This is a common operation that I perform for various data sets. My issue is ...
3
votes
1answer
69 views
PANDAS spatial clustering
I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the .loc part takes most time for bigger datasets. I wonder ...
11
votes
3answers
542 views
Analyze frequency and content of political fundraising E-mails
Since I'm a big politics nerd, I wanted to write a little script that would analyze the frequency and content of political fundraising emails. I signed up for the e-mails of 6 campaigns, donated a ...
4
votes
1answer
62 views
Data analytics on static file of 50,000+ tweets
I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB ...
4
votes
1answer
42 views
Appending unique data to a postgres table
I receive daily files (filename_%m_%d_%Y.csv) from a client and I read those in pandas, process them, and store them in Postgres. Sometimes there are delays and we do not get the data for a few days. ...
4
votes
3answers
83 views
Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files
I am reading text files that contain data from observations. The format is not Fixed Width or Delimited, so I built a generator ...
4
votes
1answer
68 views
Using Pandas to parse adwords export
I did this exercise yesterday mostly as practice, but it has some utility as well in day to day. I was basically attempting to take a string that looked like the following:
...
5
votes
1answer
54 views
Slicing time spans into calendar months
I have apparently correct code that still runs for weeks on my data (tens of millions of rows). I show the entire code for reference (and maybe other gains to be made), but the key operation is in the ...
1
vote
1answer
34 views
Multi-linear regression with trivial model refinement algorithms
I wrote this code for an assignment. It was originally meant to be written in Maple, but got very frustrated with some of Maple's idiosyncrasies that I decided to play around with Pandas instead. This ...
2
votes
2answers
76 views
Generating a PANDAS DataFrame of simulated coin tosses
It's taking my machine quite a long time to execute 1 billion (1st loop x 10, 2nd loop x 1000, 3rd loop x 100,000) instructions. Suggestions for performance enhancements? Sources of potential concern: ...
4
votes
1answer
60 views
Random Forest Code Optimization
I am new to Python. I have built a model with randomforest in python. But I think my code is not optimized. Please look into my code and suggest if I have deviated from best practices.
Overview about ...
6
votes
2answers
139 views
Project Euler #19: Counting Sundays in the 20th century using Pandas
Project Euler #19 asks:
How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?
I'm hoping I wasn't too off course from the spirit of the ...
1
vote
1answer
88 views
Operating multiple columns of one pandas DataFrame using data from another
I have a DataFrame of data from a survey that was repeated over several years, asking people about their income and how much money they had in savings. For simplicity, let's pretend it looks like ...
2
votes
1answer
77 views
Creating a time-course dependent, correlation-based directed graph with Networkx
I have a correlation matrix containing 4 time points, each with multiple samples. Each sample is identified with a time point with its name. What I am trying to accomplish here is to create a directed ...
4
votes
1answer
36 views
Reading groups of files and concatenating them
I have made some adjustments to some code that you can see on this thread:
Read daily files and concatenate them
I would like to make some further refinements and make sure I am on the right track ...
3
votes
2answers
129 views
Read daily files and concatenate them
Edit - here is my modified code: http://jsfiddle.net/#&togetherjs=GzytydCsRh
Can someone take a look and give me some feedback? It seems a bit long still but that is the first time I used ...
8
votes
1answer
168 views
A big “Game of Life”
Our quest: Create a big simulation for Conway's Game of Life, and record the entire simulation history.
Current Approach: Cython is used for an iterate method. The ...
5
votes
1answer
735 views
A custom Pandas dataframe to_string method
Oftentimes I find myself converting pandas.DataFrame objects to lists of formatted row strings, so I can print the rows into, e.g. a ...
3
votes
1answer
62 views
Reading and processing a file using Pandas
I am trying to read a file using pandas and then process it. For opening the file I use the following function:
...
7
votes
1answer
524 views
Tkinter GUI for making very simple edits to pandas DataFrames
It is part of a separate application that allows users to interact very loosely with different databases and check for possible errors and make corrections.
...
4
votes
0answers
48 views
Outputting scatter plots
I have written a python function that outputs scatter plots using Matplotlib after processing the data a little. It works but it's painfully slow. I was wondering if anybody had any suggestions as to ...
7
votes
4answers
1k views
Chi Square Independence Test for Two Pandas DF columns
I want to calculate the scipy.stats.chi2_contingency() for two columns of a pandas DataFrame. The data is categorical, like this:
...
3
votes
1answer
254 views
Imputing values with non-negative matrix factorization
X is a DataFrame w/ about 90% missing values and around 10% actual values. My goal is to use nmf in a successive imputation loop to predict the actual values I have ...
7
votes
1answer
66 views
Querying houses similar to a given house
I was given this task as an interview coding challenge and was wondering If the code is well structured and follows python guidelines. I chose to sort the houses based on a similarity metric and then ...
4
votes
3answers
242 views
Rating tennis players in a database, taking days to run
I have this project in data analysis for creating a ranking of tennis players. Currently, it takes more than 6 days to run on my computer.
Can you review the code and see where's the problem?
...
3
votes
2answers
746 views
Extracting contents of dictionary contained in Pandas dataframe to make new dataframe columns
I created a Pandas dataframe from a MongoDB query.
c = db.runs.find().limit(limit)
df = pd.DataFrame(list(c))
Right now one column of the dataframe corresponds ...
3
votes
1answer
106 views
Data cleansing and formatting script
This is a script that creates a base dataframe from a sqlite database, adds data to it (also from SQLite), cleanse it and formats it all to an Excel file. I feel like it is incredibly verbose and my ...
3
votes
1answer
138 views
Fetching, processing, and storing Mixpanel analytics data to SQLite
I'm a self-taught Python programmer and I never really learned the fundamentals of programming, so I want to see how to improve upon this script and make it adhere to best practices.
The script has ...
1
vote
1answer
727 views
Split excel file with multiple sheets, manipulate the data and create final out file
I have an excel file with 20+ separate sheets containing tables of data. My script iterates through each sheet, manipulates the data into the format I want it and then saves it to a final output file. ...
2
votes
1answer
37 views
Excel Laboratory Data Entry from Python 2.7
I've written a script to automate the entry of laboratory instrument data into an Excel spreadsheet using pandas and win32com.
I've got the script functioning correctly, but it is painfully slow. In ...
5
votes
1answer
192 views
Reading an Excel file and comparing the amino acid sequence of each data pair
Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer.
...
5
votes
1answer
1k views
Efficient Pandas to MySQL “UPDATE… WHERE”
I have a pandas DataFrame and a (MySQL) database with the same columns. The database is not managed by me.
I want to update the values in the database in an "UPDATE... WHERE" style, updating only ...
4
votes
1answer
38 views
Identifying surface events happening at specific time intervals
Here is some code I wrote to surface Mint.com transactions that occur at monthly intervals, in order to identify subscriptions I may be paying for without realizing it.
I'd like to have some friends ...
0
votes
1answer
305 views
Simple k-means implemention using Python3 and Pandas
Is there anything I can improve? The distance function is Pearson correlation.
...
2
votes
1answer
111 views
SQL GROUPING SETS in Python using Pandas
The code below is intended to provide SQL's GROUPING SETS functionality in Python with the aid of Pandas.
Background on SQL GROUPING SETS
There are at least two advantages to doing this in Python:
...
6
votes
2answers
89 views
Speed up script that calculates distribution of every character from user input
I have a data set with close to 6 million rows of user input. Specifically, users were supposed to type in their email addresses, but because there was not pattern validation put in place we have a ...
4
votes
0answers
232 views
Parsing URLs in Pandas DataFrame
My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).
I'm pulling the data using the AdWords API and ...
3
votes
2answers
433 views
Matplotlib-venn and keeping lists of the entries
Having come upon the wonderful little module of matplotlib-venn I've used it for a bit, I'm wondering if there's a nicer way of doing things than what I have done so far. I know that you can use the ...
1
vote
0answers
250 views
Speed up Pandas DataFrame expansion to include time-lagged information about events
Using pandas and Python 3, information about a simple timeseries data set is being processed. Within the span of .5 seconds, 3 names are being said. We record the onset of each utterance, the length ...
6
votes
2answers
169 views
Monte Carlo estimation of the Hypergeometric Function
I am trying to implement the algorithm described in the paper Statistical Test for the Comparison of Samples from Mutational Spectra (Adams & Skopek, 1986) DOI: 10.1016/0022-2836(87)90669-3:
$$p ...
1
vote
0answers
241 views
Speeding up filtering function in Pandas
I have a CSV file with 400 000 rows and the following headers:
...
1
vote
0answers
316 views
Speed up projection of a bipartitie network for a big file using NetworkX and Pandas
I have a pretty big file (3 million lines) with each line being a person-to-event relationship. Ultimate, I want to project this bipartite network onto a single-mode, weighted, network, and write it ...
10
votes
1answer
981 views
Simplifying Python Pandas code for selecting co-occurrences in a window of time
I am a beginner at programming. I was able to build the thing below, which achieves what I want with a small dataset. With larger datasets, my RAM gets swamped bringing the computer to a halt (2014 ...
1
vote
1answer
1k views
Parse Bloomberg Excel/CSV with Pandas DataFrame
I retrieved Bloomberg data using the Excel API. In the typical fashion, the first row contains tickers in every fourth column, and the second row has the labels Date, PX_LAST, [Empty Column], Date, ...