Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
2
votes
0answers
16 views
Closest points using Rabin randomizing approach
I was told to use the following Rabin algorithm to find the shortest distance between 2 points in 2D:
Randomly choose sqrt(n) and brute force to find the closest ...
-3
votes
0answers
25 views
0
votes
0answers
41 views
Find k nearest points
I'm working on a problem to select k nearest points for a given point. Any advice for bugs, improvements are appreciated, including general advice to implement find nearest k points.
My major idea is ...
2
votes
1answer
52 views
Zip code reduce function
My task is to write a function that would take an array of zip codes and spit out only the zip codes that do not qualify. A non-qualifying zip code will not exist in the database and does NOT have ...
0
votes
0answers
81 views
Cosine similarity computation
I have a matrix of ~4.5 million vector [4.5mil, 300] and I want to calculate the distance between a vector of length 300 against all the entries in the matrix.
I got some great performance time ...
3
votes
1answer
70 views
OpenCV 3: Using k-Nearest Neighbors to analyse RGB image
I'm new to computer vision and numpy.
I wrote a simple script to seperate red, green and blue colors from the original image by using the kNN algorithm.
After reading through some numpy tutorials, I'...
3
votes
1answer
247 views
Finding closest pair of 2D points, using divide-and-conquer
I'm learning C++ as well as algorithms. Here's my implementation of finding the closest pair problem. I tried to minimize memory by using only iterators.
And points are being read from ...
3
votes
0answers
53 views
Divide-and-conquer approach for finding the closest pair of points
This is an algorithm for finding the closest pair of points on a 2d plane by dividing the problem by half recursively, as illustrated here:
...
5
votes
2answers
109 views
Calculating cooccurrence probabilities for pairs of words in a document
It is a 1.5 hour coding test, started the moment when the question was sent by email. My solution was done under the strict condition. I was not told anything before the test.
The question is about ...
5
votes
1answer
202 views
K-means clustering implemented in Python 3
Here is the classic K-means clustering algorithm implemented in Python 3. My main concern is time/memory efficiency and if there are version specific idioms that I could use to address issues of the ...
3
votes
1answer
184 views
Clustering 16 million records in parallel
I have a dataset with 16 million rows and may increase upwards of 30 million. I am using the parLapply to run across three cores in R. But it's taking two days to ...
2
votes
0answers
118 views
Solving the Mining algorithm from HackerRank
I was working on this problem for a few hours last night and finally came up with a brute-force solution. The task is to report the minimum work necessary (sum of weight × distance) to relocate gold ...
4
votes
1answer
72 views
“Similar Destinations” challenge
I am currently solving the Similar Destinations challenge on HackerRank and am in need of some assistance in the code optimization/performance department. The task is to take a list of up to 1000 ...
6
votes
1answer
141 views
Similarity research : K-Nearest Neighbour(KNN) using a linear regression to determine the weights
I have a set of houses with categorical and numerical data. Later I will have a new house and my goal will be to find the 20 closest houses.
The code is working fine, and the result are not so bad but ...
3
votes
2answers
42 views
Getting the smallest snippet from content containing all keywords
This returns the smallest snippet from the content containing all of the given keywords (in any order). This provides the correct solution but I would like to know if it can be made more efficient.
<...
3
votes
0answers
47 views
KNN pipeline w/ cross_validation_scores
Using the wine quality dataset, I'm attempting to perform a simple KNN classification (w/ a scaler, and the classifier in a pipeline). It works, but I've never used ...
5
votes
1answer
149 views
Clustering nodes with Hamming distance < 3
I want to speed up the following code, which is from an algorithm class.
I get a list of 200000 nodes where every node is a tuple of the length of 24 where every item is either a 1 or 0.
These ...
2
votes
1answer
74 views
Predict new ratings for each user based on their pearson correlation with other users
I am new to R and programming. I have a set of ratings for 45000 users and 40 odd movies. I need to predict new ratings for each user based on their pearson correlation with other users. I also need ...
5
votes
2answers
92 views
Grouping rectangles horizontally and vertically
As you can see the below code for each method is that same, except for the properties it uses. For example X vs Y and ...
0
votes
0answers
19 views
Applying kmodes on every “column wise subset” of a dataframe
I want to apply kmodes for 2 clusters on every possible combination of columns from a dataframe. Finally, I want to compare the clusters with another column that ...
2
votes
1answer
165 views
K-Means Clustering - F# Learning Challenge
Inspired by this blog I went on implementing my own version as a F# learning challenge. It turned out to be quite different than the original (but somewhat faster for large samples).
The first code ...
14
votes
1answer
268 views
Dynamic Colour Binning: Grouping Similar Colours in Images
This is a piece of code that implements an image-processing algorithm I came up with. I call it Dynamic Colour Binning. It's a fairly academic exercise that was more about providing a learning ...
5
votes
1answer
680 views
k-means clustering algorithm implementation
Here is my personal implementation of the clustering k-means algorithm.
...
2
votes
1answer
186 views
Clustering similar tweets in a corpus
I am attempting to write a statistical program using an LDA model I've trained/created using Gensim. I am very new to Python and am a student level programmer. This current program is working and ...
5
votes
1answer
116 views
K-means clustering in Rust
I've implemented K-means clustering in Rust. It's my second Rust project (my first one is here: Randomly selecting an adjective and noun, combining them into a message)
I would like advice on ...
1
vote
1answer
46 views
Store and output hard-coded relationships among hosts
The following code has begun to smell, but I have not yet decided with what to replace it, other than, obviously, a database.
I made a very unsatisfactory workaround for my attempt to make ...
1
vote
1answer
83 views
DBSCAN in C++ for general and Android use
I've implemented a templated DBSCAN for general use. At the moment, it's going to be used on Android through the JNI. I used Wikipedia's pseudocode and a little bit of the DBSCAN paper for reference. ...
3
votes
1answer
224 views
PANDAS spatial clustering
I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the .loc part takes most time for bigger datasets. I wonder ...
2
votes
0answers
90 views
Cluster arrays according to similarity of key values
The below script will compare a set of arrays according to similarities between their key's values. For example, if the first 4 keys values of an array are equal to another array's first 4 keys values,...
8
votes
1answer
809 views
Implementing a fast DBScan in C#
I tried to implement a DBScan in C# using kd-trees. I followed the implementation from here.
...
15
votes
4answers
576 views
N closest points to the reference point
Here is working code to get N closest points to some reference point.
Please help to improve it, specifically by commenting on my use of std algorithms and ...
1
vote
0answers
202 views
Depth First Search for percolation to find clusters in Go game
I have some questions about Depth First Search and whether I implemented it correctly. Below is a more thorough discussion. The graph in question is a randomly colored square grid (I use 3 colors). ...
1
vote
0answers
27 views
Collaborative filtering to group similar users and products
I'm doing product recommendation module based on collaborative_filtering.
The recommendation will be generated by users, ...
2
votes
0answers
145 views
C# port of data mining algorithm much slower than reference implementation
I was trying to implement the algorithm specified in this research paper (please ignore the math, since it's irrelevant to the question). This algorithm is very basic in formal concept analysis. The ...
4
votes
1answer
2k views
Implementation of KNN in R
I have implemented the K-Nearest Neighbor algorithm with Euclidean distance in R. It works fine but takes tremendously huge time than the library function (get.knn). Please point out the possibility ...
3
votes
1answer
101 views
Simple string-root detection in a string-family
(This problem is related to Simple string-split by root and sufix algorithm)
There are many ways to find a "common root" of a list of similar strings, that begins with the same substring... The ...
7
votes
2answers
218 views
Finding the maximum pairwise difference in a collection of colors
Note that this problem is equivalent to finding the longest line segment defined by any two points in a collection of 3D coordinates, which may be an easier way to visualize the problem, and is almost ...
6
votes
3answers
615 views
Finding clusters in a matrix
I got asked at an interview to write a program that, given a NxM matrix with zeros and ones, prints out the list of clusters of 1s. The clusters are defined as patches of 1s connected horizontally, ...
2
votes
0answers
586 views
Discretization of continuous attributes for automatic classification [closed]
Background
In machine learning, it's common to encounter the problem of making a decision as to which discrete category an object belongs to based on a set of continuous attributes. For example, we ...
5
votes
4answers
2k views
Aggregate array values into ranges
In five minutes I made a pretty ugly looking function. Can you help before I have to commit the code into history?
Requirements:
I would like a function that takes an array of numbers, and ...
4
votes
2answers
14k views
K-means clustering algorithm in python
Here is my implementation of the k-means algorithm in python. I would love to get any feedback on how it could be improved or any logical errors that you may see. I've left off a lot of the ...
8
votes
3answers
257 views
Breadth-first search for clusters of pixels in a given color range
I am a beginner in programming languages, so I apologise if my code is badly formatted or doesn't make any sense.
My program gets an image and a RGB color range as input and it counts how many pixels ...
3
votes
1answer
2k views
Finding the shortest substring containing keywords
Problem: Write a function that takes a String document and a String[] keywords and returns the smallest substring of ...
7
votes
2answers
161 views
What are my highest activity streaks?
I have written the following query to figure out activity streaks on a per-user basis. I find it... Ugly... And would love to improve it!
Limitations
Those are explained as commented text at the ...
1
vote
0answers
63 views
A scalable function for get boundary vertices in a graph
Given a community division I need a list of vertices that have edges in more than one community, i.e., boundary vertices.
I've tried this:
...
2
votes
3answers
70 views
1
vote
0answers
356 views
Optimize QuadTree to find K Nearest Neighbors
I'm looking a way to make my k nearest neighbors search more efficient. The context of the question is that I'm given a list of topics that have a unique ID (integer) and a (x,y) coordinate (floats) ...
8
votes
1answer
886 views
K-Means in Rust
I have implemented for learning purposes a simple K-Means clustering algorithm in Rust. For those who are not familiar: you are given N points, say in the plane, ...
7
votes
1answer
5k views
9
votes
1answer
324 views
Bitmap problem performance boost
Problem statement:
Input is a rectangular bitmap like this:
0001
0011
0110
The task is to find for each black (0) "pixel", the distance to the
...