Pandas is a Python data analysis library.

learn more… | top users | synonyms

1
vote
2answers
54 views

Reading from a .txt file to a pandas dataframe

Having a text file './inputs/dist.txt' as: ...
3
votes
1answer
44 views

Applying different equations to a Pandas DataFrame

I wrote a task using pandas and I'm wondering if the code can be optimized. The code pretty much goes as: All the dataframes are 919 * 919. socio is a dataframe ...
5
votes
2answers
128 views

Split latitude/longitude by degree to make file names and folder directory names

This code works and does exactly as I want, but I am wondering if/how it could be done faster/more efficiently? This runs quickly on a demo file, but bogs down when I introduce much larger files (3-...
1
vote
2answers
346 views

Finding the states with the three most populous counties

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here Question: ...
2
votes
1answer
39 views

Simplify DataFrame operations in Python

I am working on learning how to do frequency analysis of Server Fault question tags to see if there is any useful data that I can glean from them. I'm storing the raw data in Bitbucket for global ...
4
votes
2answers
55 views

Select the n most frequent items from a pandas groupby dataframe

I´m working on trying to get the n most frequent items from a pandas dataframe similar to ...
1
vote
1answer
32 views

Calls many stored dataframes, change them and combine them

I call many dataframes (using pickle), then I extract the interesting values of each one and finally combine them. Is there a better way to do it? For example with a for loop call them automatically ...
3
votes
1answer
68 views

Compressing time series by removing repeated samples

I work on a project with time series data. So there are samples (\$y\$), and each sample has a timestamp (\$x\$). The data will be visualized, but often there are time series which contain samples ...
4
votes
3answers
93 views

Plotting from a Pandas dataframe

I want to improve my code. Is it possible to get the plot without repeating the same instructions multiple lines? The data comes from a Pandas' dataframe, but I am only plotting the last column (...
0
votes
0answers
37 views

Cross-join time-series using asof criteria while limiting number of rows

I have a unique requirement around time-series analysis and have presented my requirements along with a simple working solution. I have also worked out sample example to help understand my ...
0
votes
0answers
15 views

Sending the accuracies of 4 cross-validations to a pandas dataframe

I want to improve my code. It performs 4 cross-validations: 10x20, 10x10, 10x5 10x2. Then it stores the values in Pandas where I will get the average of each column and multiply them by 100. You ...
3
votes
0answers
55 views

Creating multiple observations from single observation depending on conditions

I have written a function in python with three loops which is time consuming. Is it possible to do the same operation in less time with some other way. Here are my code and sample data you can run at ...
5
votes
3answers
126 views

Matching values from html table for updating values in pandas dataframe

This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it: Pandas is a Python package providing fast, flexible, and expressive data structures ...
3
votes
0answers
184 views

Pandas calculation speed of stock beta on many dataframes

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to Python and want to calculate a rolling 12month ...
1
vote
1answer
218 views

Using pandas and sklearn for forecasting stock market return

I have been using R for stock analysis and machine learning purpose but read somewhere that python is lot faster than R, so I am trying to learn Python for that. I am using Yhat's rodeo IDE (Python ...
6
votes
1answer
151 views

Similarity research : K-Nearest Neighbour(KNN) using a linear regression to determine the weights

I have a set of houses with categorical and numerical data. Later I will have a new house and my goal will be to find the 20 closest houses. The code is working fine, and the result are not so bad but ...
2
votes
0answers
52 views

Using pandas.io.pytables.TableIterator for several iterations

I'm writing method that takes a pandas iterator and then computes something based on that. The problem is that some of these computations require iterating twice over the data. For example the ...
4
votes
1answer
79 views

Extracting time duration in the session from 30 million rows

I am looking for making my code faster. I am working on yoochoose recsys 2015 dataset.. and trying to perform some transformations.. [recsys2015], it has got 30 million plus rows of data. The goal of ...
3
votes
0answers
59 views

KNN pipeline w/ cross_validation_scores

Using the wine quality dataset, I'm attempting to perform a simple KNN classification (w/ a scaler, and the classifier in a pipeline). It works, but I've never used ...
3
votes
1answer
102 views

Calculate working minutes between two timestamps

I've created a function to calculate the working minutes between two timestamps. I class working minutes as those between 9am - 5pm and not on weekends or national holidays. The holidays are those ...
3
votes
1answer
34 views

Filling in gym membership prices by joining PANDAS dataframes

I have two dataframes, df which contains a list of members in addition to the type of contract that they purchased on a given date, ...
3
votes
2answers
72 views

Optimize a simple and quick python script for transposing a .csv file

I need to transpose the following file output1.csv, which is is a result from a quantum chemistry calculation into a single colum efficiently: ...
1
vote
0answers
32 views

Grouping logs and providing counts

The code is supposed to group start and end time logs and provide log counts and unique ID counts. The grouping will be variable 1 hour, 6 hours, 12 hours, 24 hours, etc. What is the better way to ...
1
vote
0answers
27 views

ChiSquare test Code in Python

I was trying to write a code from scratch for Chi-square test.This is the code that I had written in python using pandas.I had a doubt whether the code can produce the desired output or can be written ...
3
votes
0answers
376 views

Rolling OLS algorithm in a dataframe

I want to be able to find a solution to run the following code in a much faster fashion (ideally something like dataframe.apply(func) which has the fastest speed, ...
1
vote
2answers
74 views

Permute and count between nested dictionaries

Goal: Permute inner values between two dictionaries with matching outer key, Save permuatations in counter, Move to next outer key, update counter with new permutations. Problem: Many lookups on ...
3
votes
0answers
55 views

Analysis of TV station preferences for various demographic groups

I have a notebook here which details my code and may be more legible. I worked on a project that flattens a table that initially looked like this (this is table1): ...
1
vote
1answer
752 views

Python Pandas Apply with a Lambda Function

I have a table in pandas that has two columns, QuarterHourDimID and StartDateDimID ; these columns give me an ID for each date / ...
2
votes
1answer
65 views

Counting matches won by teams

This script counts how many matches were won by the team with a first half, second half, and full game time of possession differential advantage. The data is stored in a pandas data frame: ...
0
votes
0answers
71 views

Memory efficient alternative passing mostly static variables w/ 1 dynamic variable to an external function (COM) instead of a for loop in Python

Massive data problem, 7/8 variables are repeated 1000 times in a for loop, sending 8 input vectors * 50,000 values per loop for each function call. Only 1 input ...
1
vote
1answer
64 views

SKlearn automate data pre treatment

I want to make a simple wrapper for sklearn models. The idea is that the wrapper automatically takes care of factors (columns of type "object") replacing them with ...
4
votes
1answer
74 views

Find if items in a list of list matches to another list

We have two lists: ...
3
votes
0answers
161 views

Python Script to query OSRM (drive-times)

The below script takes two arguments Path to the OSRM Map Path to a .csv containing the columns ...
3
votes
1answer
71 views

Backing up files to remote servers

I have code that backup files to remote servers. There are 2 pandas data frames: storedfiles: that points to the files already stored in previous runs ...
3
votes
1answer
111 views

Analysis of call center employee performance

This code basically calculates a few metrics based on numerators and denominators and then it bins the data into quintiles. This is a common operation that I perform for various data sets. My issue is ...
3
votes
1answer
260 views

PANDAS spatial clustering

I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the .loc part takes most time for bigger datasets. I wonder ...
13
votes
3answers
636 views

Analyze frequency and content of political fundraising E-mails

Since I'm a big politics nerd, I wanted to write a little script that would analyze the frequency and content of political fundraising emails. I signed up for the e-mails of 6 campaigns, donated a ...
4
votes
1answer
102 views

Data analytics on static file of 50,000+ tweets

I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB ...
4
votes
1answer
54 views

Appending unique data to a postgres table

I receive daily files (filename_%m_%d_%Y.csv) from a client and I read those in pandas, process them, and store them in Postgres. Sometimes there are delays and we do not get the data for a few days. ...
4
votes
3answers
99 views

Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files

I am reading text files that contain data from observations. The format is not Fixed Width or Delimited, so I built a generator ...
4
votes
1answer
148 views

Using Pandas to parse adwords export

I did this exercise yesterday mostly as practice, but it has some utility as well in day to day. I was basically attempting to take a string that looked like the following: ...
5
votes
1answer
71 views

Slicing time spans into calendar months

I have apparently correct code that still runs for weeks on my data (tens of millions of rows). I show the entire code for reference (and maybe other gains to be made), but the key operation is in the ...
1
vote
1answer
42 views

Multi-linear regression with trivial model refinement algorithms

I wrote this code for an assignment. It was originally meant to be written in Maple, but got very frustrated with some of Maple's idiosyncrasies that I decided to play around with Pandas instead. This ...
2
votes
2answers
132 views

Generating a PANDAS DataFrame of simulated coin tosses

It's taking my machine quite a long time to execute 1 billion (1st loop x 10, 2nd loop x 1000, 3rd loop x 100,000) instructions. Suggestions for performance enhancements? Sources of potential concern: ...
4
votes
1answer
102 views

Random Forest Code Optimization

I am new to Python. I have built a model with randomforest in python. But I think my code is not optimized. Please look into my code and suggest if I have deviated from best practices. Overview about ...
7
votes
2answers
293 views

Project Euler #19: Counting Sundays in the 20th century using Pandas

Project Euler #19 asks: How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)? I'm hoping I wasn't too off course from the spirit of the exercise ...
1
vote
1answer
204 views

Operating multiple columns of one pandas DataFrame using data from another

I have a DataFrame of data from a survey that was repeated over several years, asking people about their income and how much money they had in savings. For simplicity, let's pretend it looks like this:...
2
votes
1answer
164 views

Creating a time-course dependent, correlation-based directed graph with Networkx

I have a correlation matrix containing 4 time points, each with multiple samples. Each sample is identified with a time point with its name. What I am trying to accomplish here is to create a directed ...
4
votes
1answer
42 views

Reading groups of files and concatenating them

I have made some adjustments to some code that you can see on this thread: Read daily files and concatenate them I would like to make some further refinements and make sure I am on the right track ...