Pandas is a Python data analysis library.

learn more… | top users | synonyms

5
votes
1answer
19 views

Terminal handling charges for different countries

I have the following problem that I solved using Python (Numpy/Pandas) and the code is provided later on. I mainly program in Java and wrote this program for a job interview as Python developer. I ...
4
votes
1answer
23 views

Calculating speed from a Pandas Dataframe with Time, X, and Y columns

I'm trying to calculate speed between consecutive timepoints given data that is in a '.csv' file with the following columns: "Time Elapsed", "x", and "y". The ultimate goal is to get the data into a ...
2
votes
0answers
22 views

Calculating frequencies of each obs in the data

I am currently attempting to make some code more maintainable for a research project I am working on. I am definitely looking to create some more functions, and potentially create a general class to ...
5
votes
0answers
70 views

Sanitize and standardize street addresses using Google Maps lookups

Take in incomplete user input street addresses, clean it, segregate based on word count and run it into Google maps places api and output completed & standardised addresses based on Json object ...
2
votes
1answer
28 views

Check value existence in two lists and gather the result in a csv

I have two lists of protein sequences, I have to check every entry's existence in the two lists, say like ...
3
votes
0answers
50 views

Efficiently writing string comparison functions in Python

Let's say I work for a company that hands out different types of loans. We are getting our loan information from from a big data mart from which I need to calculate some additional things to calculate ...
3
votes
1answer
34 views

Categorization algorithm for discrete variables

I am trying to categorize some data. For that I check the distribution of the data. Then I split based on the number of appearance of each value. The algorithm I have is working so far but really ...
1
vote
0answers
33 views

bullet-proof way to generate a list of object attributes (if object has them)

Background I wrote a a couple functions that make it possible to generate a list of attributes for any given object that has them. I use the list that is ...
1
vote
0answers
61 views

Adjust p-values using Benjamini-Hochberg FDR

Background I need to find the adjusted p-values for multiple hypothesis tests using Benjamini-Hochberg FDR. To do this I have adapted portions of Christopher Naugler's Bonferroni Calculator for use ...
6
votes
1answer
51 views

Expanding url from shortened url obtained from tweet

I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those. ...
5
votes
0answers
25 views

Undoing corrections to a big dataframe

I have 2 dataframes. The first one (900 lines) contains corrections that have been applied to a deal. The second dataframe (140 000 lines) contains the list of deals with corrected values. What I am ...
5
votes
2answers
112 views

Code for creating combinations taking a long time to finish

I have the following code that I'm using to create combinations of elements in my dataset: ...
6
votes
0answers
54 views
5
votes
0answers
24 views

Coalesce consecutive failures in a DataFrame of hourly sensor readings

I have a PANDAS DataFrame that contains sensor data that is recorded every hour (sample included below). It is important to note that every hour is not necessarily in the dataframe, as sometimes the ...
3
votes
0answers
78 views

Subtract multiple columns in PANDAS DataFrame by a series (single column)

Background I have tons of very large pandas DataFrames that need to be normalized with the following operation; log2(data) - mean(log2(data)) Example Data The example DataFrame ...
8
votes
2answers
79 views

Extract unique terms from a PANDAS series

Background I have process tons of DataFrames with shapes of ~230 columns x ~2000-50000+ rows. Here is an extremely simplified example; ...
6
votes
0answers
36 views

Pandas code for calculating distance and time between waypoints for large files

This is very similar to other code I've posted, however this is designed for very large csv files, for example 35gb. The files typically look like this: ...
8
votes
2answers
104 views

Curried function

From the question and since I'm currently learning functional programming I was inspired to write the following (curried) function: ...
5
votes
1answer
84 views

Clustering points on a sphere

I have written a short Python program which does the following: loads a large data file (\$10^9+\$ rows) where each row is a point on a sphere. The code then loads a pre-determined triangular grid on ...
4
votes
1answer
54 views

Optimize calculation for the amount of time between when a value changes in Pandas

I have some sensor data that contains timestamps from when a machine is turned on and an indicator variable showing whether or not the machine is actively running. The data is mostly recorded every ...
6
votes
0answers
110 views

Calculating T-Test within Large Pandas Dataframes

The below code runs a t-statistic within a large dataframe (rnadf) based on masked values from another dataframe (cnvdf_maked). ...
4
votes
1answer
50 views

PANDAS code for calculating distance between waypoints

I've written some python code designed to take a csv of waypoints for a series of trips, and calculate the distance of each trip by the sum of the distance between the waypoints. An example csv might ...
4
votes
1answer
87 views

Conditional Concatenation of a Pandas DataFrame

I am concatenating columns of a Python Pandas Dataframe and want to improve the speed of my code. My data has the following structure: ...
-1
votes
2answers
143 views

Interpret YYYYMMDD as the nth day of the year [closed]

I am provided a bunch of dates in the format (YYYYMMDD) such as: date='20170503' My goal is to convert that date into the ...
0
votes
1answer
39 views

Calls one column of a dataframe, turns it into an array and plots it

My code calls 1 column of a dataframe, turns it into an array and plots it. I want to able to do this for all the columns without having to repeat the code many times. How can I do this? ...
1
vote
2answers
531 views

Reading from a .txt file to a pandas dataframe

Having a text file './inputs/dist.txt' as: ...
4
votes
1answer
53 views

Applying different equations to a Pandas DataFrame

I wrote a task using pandas and I'm wondering if the code can be optimized. The code pretty much goes as: All the dataframes are 919 * 919. socio is a dataframe ...
5
votes
2answers
134 views

Split latitude/longitude by degree to make file names and folder directory names

This code works and does exactly as I want, but I am wondering if/how it could be done faster/more efficiently? This runs quickly on a demo file, but bogs down when I introduce much larger files (3-...
1
vote
2answers
1k views

Finding the states with the three most populous counties

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here Question: ...
2
votes
1answer
75 views

Simplify DataFrame operations in Python

I am working on learning how to do frequency analysis of Server Fault question tags to see if there is any useful data that I can glean from them. I'm storing the raw data in Bitbucket for global ...
5
votes
2answers
106 views

Select the n most frequent items from a pandas groupby dataframe

I´m working on trying to get the n most frequent items from a pandas dataframe similar to ...
1
vote
1answer
35 views

Calls many stored dataframes, change them and combine them

I call many dataframes (using pickle), then I extract the interesting values of each one and finally combine them. Is there a better way to do it? For example with a for loop call them automatically ...
3
votes
1answer
83 views

Compressing time series by removing repeated samples

I work on a project with time series data. So there are samples (\$y\$), and each sample has a timestamp (\$x\$). The data will be visualized, but often there are time series which contain samples ...
4
votes
3answers
162 views

Plotting from a Pandas dataframe

I want to improve my code. Is it possible to get the plot without repeating the same instructions multiple lines? The data comes from a Pandas' dataframe, but I am only plotting the last column (...
0
votes
0answers
43 views

Cross-join time-series using asof criteria while limiting number of rows

I have a unique requirement around time-series analysis and have presented my requirements along with a simple working solution. I have also worked out sample example to help understand my ...
0
votes
0answers
15 views

Sending the accuracies of 4 cross-validations to a pandas dataframe

I want to improve my code. It performs 4 cross-validations: 10x20, 10x10, 10x5 10x2. Then it stores the values in Pandas where I will get the average of each column and multiply them by 100. You ...
3
votes
0answers
57 views

Creating multiple observations from single observation depending on conditions

I have written a function in python with three loops which is time consuming. Is it possible to do the same operation in less time with some other way. Here are my code and sample data you can run at ...
5
votes
3answers
134 views

Matching values from html table for updating values in pandas dataframe

This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it: Pandas is a Python package providing fast, flexible, and expressive data structures ...
3
votes
0answers
285 views

Pandas calculation speed of stock beta on many dataframes

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to Python and want to calculate a rolling 12month ...
1
vote
1answer
465 views

Using pandas and sklearn for forecasting stock market return

I have been using R for stock analysis and machine learning purpose but read somewhere that python is lot faster than R, so I am trying to learn Python for that. I am using Yhat's rodeo IDE (Python ...
6
votes
1answer
186 views

Similarity research : K-Nearest Neighbour(KNN) using a linear regression to determine the weights

I have a set of houses with categorical and numerical data. Later I will have a new house and my goal will be to find the 20 closest houses. The code is working fine, and the result are not so bad but ...
2
votes
0answers
58 views

Using pandas.io.pytables.TableIterator for several iterations

I'm writing method that takes a pandas iterator and then computes something based on that. The problem is that some of these computations require iterating twice over the data. For example the ...
5
votes
1answer
83 views

Extracting time duration in the session from 30 million rows

I am looking for making my code faster. I am working on yoochoose recsys 2015 dataset.. and trying to perform some transformations.. [recsys2015], it has got 30 million plus rows of data. The goal of ...
3
votes
0answers
78 views

KNN pipeline w/ cross_validation_scores

Using the wine quality dataset, I'm attempting to perform a simple KNN classification (w/ a scaler, and the classifier in a pipeline). It works, but I've never used ...
3
votes
1answer
130 views

Calculate working minutes between two timestamps

I've created a function to calculate the working minutes between two timestamps. I class working minutes as those between 9am - 5pm and not on weekends or national holidays. The holidays are those ...
3
votes
1answer
37 views

Filling in gym membership prices by joining PANDAS dataframes

I have two dataframes, df which contains a list of members in addition to the type of contract that they purchased on a given date, ...
3
votes
2answers
78 views

Optimize a simple and quick python script for transposing a .csv file

I need to transpose the following file output1.csv, which is is a result from a quantum chemistry calculation into a single colum efficiently: ...
1
vote
0answers
32 views

Grouping logs and providing counts

The code is supposed to group start and end time logs and provide log counts and unique ID counts. The grouping will be variable 1 hour, 6 hours, 12 hours, 24 hours, etc. What is the better way to ...
1
vote
0answers
33 views

ChiSquare test Code in Python

I was trying to write a code from scratch for Chi-square test.This is the code that I had written in python using pandas.I had a doubt whether the code can produce the desired output or can be written ...