Newest 'data-mining+python' Questions

1 vote

1 answer

317 views

Better way to create a contingency table with pandas for film genres from a Film DataFrame

From a public dataset available on film rating I created a contingency table as follow. Honestly I don't like all these "for-loops" I think the quality of the code can be definitely improved ...

Andrea Ciufo

599

asked Jan 10, 2022 at 18:57

2 votes

0 answers

197 views

producer-consumer Pipeline problem implementation in asyncio

I wrote this code to make a non-blocking manager along with pipeline operations using asyncio, my main concern is to catch received items producer, and when the received operation is complete. I want ...

etyzz

21

asked May 15, 2021 at 21:52

2 votes

0 answers

748 views

How to avoid bottlenecks json processing with Apache Beam?

I have a input with some transaction data in json input (in this case a file) ...

Lin

357

asked Mar 8, 2020 at 4:31

2 votes

1 answer

505 views

Analyzing patient treatment data using Pandas

I work in the population health industry and get contracts from commercial companies to conduct research on their products. This is the general code to identify target patient groups from a provincial ...

KubiK888

225

asked Apr 12, 2019 at 19:35

3 votes

1 answer

3k views

Finding word association strengths from an input text

I have the written the following (crude) code to find the association strengths among the words in a given piece of text. ...

Kristada673

337

asked Feb 26, 2019 at 7:50

2 votes

2 answers

314 views

Python program to rank based on the frequency of names that appears in text files

I've written a python program to rank the names that appear in the file(s) based on their frequency. In other words, there are multiple files and want to rank the frequency of the names that appears ...

nsivakr

163

asked Feb 18, 2019 at 15:49

5 votes

1 answer

6k views

k-means using numpy

This is k-means implementation using Python (numpy). I believe there is room for improvement when it comes to computing distances (given I'm using a list comprehension, maybe I could also pack it in a ...

Adel Redjimi

381

asked Oct 7, 2018 at 10:05

3 votes

0 answers

524 views

Pandas data extraction task taking too much memory. How to optimize for memory usage?

I need to process some data (one of its columns contains a json/dict with params- I need to extract those params to individual columns of their own; catch- some rows have some parameters, others have ...

James Kumar

131

asked Jun 27, 2018 at 6:14

3 votes

1 answer

584 views

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

I have a csv file with more than 700,000,000 records in this structure: ...

phoebe

33

asked Apr 11, 2018 at 6:36

5 votes

0 answers

213 views

Code for training machine learning linear regression and SVM

Ok , for my final year project I've wrote this piece of code to train my machine learning model on a this dataset , here the code i used ...

Espoir Murhabazi

241

asked Oct 5, 2017 at 20:11

6 votes

3 answers

10k views

Gradient descent for linear regression using numpy/pandas

I currently follow along Andrew Ng's Machine Learning Course on Coursera and wanted to implement the gradient descent algorithm in python3 using ...

Hericks

351

asked Jul 25, 2017 at 17:51

3 votes

1 answer

95 views

Calculating frequencies of each obs in the data

I am currently attempting to make some code more maintainable for a research project I am working on. I am definitely looking to create some more functions, and potentially create a general class to ...

zackymo21

81

asked Mar 22, 2017 at 3:52

7 votes

1 answer

475 views

PANDAS DataFrame operations to analyze top Server Fault tags [closed]

I am working on learning how to do frequency analysis of Server Fault question tags to see if there is any useful data that I can glean from them. I'm storing the raw data in Bitbucket for global ...

Sienna

463

asked Dec 28, 2016 at 23:10

3 votes

1 answer

52 views

Efficient implementation of aggregating test/train data

Here is a short python snippet to ingest train data: ...

envy_intelligence

225

asked Sep 7, 2016 at 6:46

5 votes

1 answer

285 views

Data analytics on static file of 50,000+ tweets

I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB ...

Daniel Brown

153

asked Jan 27, 2016 at 17:43

Stack Exchange Network

All Questions

Better way to create a contingency table with pandas for film genres from a Film DataFrame

producer-consumer Pipeline problem implementation in asyncio

How to avoid bottlenecks json processing with Apache Beam?

Analyzing patient treatment data using Pandas

Finding word association strengths from an input text

Python program to rank based on the frequency of names that appears in text files

k-means using numpy

Pandas data extraction task taking too much memory. How to optimize for memory usage?

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

Code for training machine learning linear regression and SVM

Gradient descent for linear regression using numpy/pandas

Calculating frequencies of each obs in the data

PANDAS DataFrame operations to analyze top Server Fault tags [closed]

Efficient implementation of aggregating test/train data

Data analytics on static file of 50,000+ tweets

Hot Network Questions