1

I have a function that use a generator to loop over large 2D python lists of float coordinates in order to create flat lists of integers that represent the distance between coordinates.

point_input = {"x": -8081441.0, "y": 5685214.0}
output = [-8081441, 5685214]

polyline_input = {"paths" : [[-8081441.0, 5685214.0], [-8081446.0, 5685216.0], [-8081442.0, 5685219.0], [-8081440.0, 5685211.0], [-8081441.0, 5685214.0]]}
output = [[-8081441, 5685214, 5, -2, -4, -3, -2, 8, 1, -3]]

polygon_input = {"rings" : [[-8081441.0, 5685214.0], [-8081446.0, 5685216.0], [-8081442.0, 5685219.0], [-8081440.0, 5685211.0], [-8081441.0, 5685214.0]]}
output = [[-8081441, 5685214, 5, -2, -4, -3, -2, 8, 1, -3]]

pure python:

def geometry_to_distance(geometry, geometry_type):
    def calculate_distance(coords):
        iterator = iter(coords)
        previous_x, previous_y = iterator.next()
        yield int(previous_x)
        yield int(previous_y)
        for current_x, current_y in iterator:
            yield int(previous_x - current_x)
            yield int(previous_y - current_y)
            previous_x, previous_y = current_x, current_y

    if geometry_type == "POINT":
        distance_array = [int(geometry["x"]), int(geometry["y"])]
    elif geometry_type == "POLYLINE":
        distance_array = [list(calculate_distance(path)) for path in geometry["paths"]]
    elif geometry_type == "POLYGON":
        distance_array = [list(calculate_distance(ring)) for ring in geometry["rings"]]
    else:
        raise Exception("{} geometry type not supported".format(geometry_type))

    return distance_array

For speed performance, I want to use cython implementation of the same function. I am using type declaration for integer variables in the calculate_distance function.

cython implementation:

def geometry_to_distance(geometry, geometry_type):
    def calculate_distance(coords):
        cdef int previous_x, previous_y, current_x, current_y
        iterator = iter(coords)
        previous_x, previous_y = iterator.next()
        yield previous_x
        yield previous_y
        for current_x, current_y in iterator:
            yield previous_x - current_x
            yield previous_y - current_y
            previous_x, previous_y = current_x, current_y 

    if geometry_type == "POINT":
        distance_array = [geometry["x"], geometry["y"]]
    elif geometry_type == "POLYLINE":
        distance_array = [list(calculate_distance(path)) for path in geometry["paths"]]
    elif geometry_type == "POLYGON":
        distance_array = [list(calculate_distance(ring)) for ring in geometry["rings"]]
    else:
        raise Exception("{} geometry type not supported".format(geometry_type))

    return distance_array

here a script that can be used to benchmark the function:

import time
from functools import wraps
import numpy as np
import geometry_converter as gc

def timethis(func):
    '''Decorator that reports the execution time.'''
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(func.__name__, end-start)
        return result
    return wrapper


def prepare_data(featCount, size):
    ''' Create arrays of polygon geometry (see polygon_input above)'''
    input = []
    for i in xrange(0, featCount):
        polygon = {"rings" : []}
        #random x,y coordinates inside a quadrant of the world bounding box in a spherical mercator (epsg:3857) projection
        ys = np.random.uniform(-20037507.0,0,size).tolist()
        xs = np.random.uniform(0,20037507.0,size).tolist()
        polygon["rings"].append(zip(xs,ys))
        input.append(polygon)
    return input

@timethis
def process_data(data):
    output = [gc.esriJson_to_CV(x, "POLYGON") for x in data]
    return output

data = prepare_data(100, 100000)
process_data(data)

Is there improvements that could increase performance in the cython implementation? maybe by using 2D cython arrays or carrays?

7
  • Why not just use numpy.diff to take the 1st difference of the X and Y coordinates?
    – pbreach
    Commented Dec 8, 2016 at 20:17
  • Because it appears that creating numpy.array from the huge 2D python lists is too slow. Commented Dec 8, 2016 at 20:23
  • You'd have the same problem going to cython or c arrays as well. Lists are not stored in contiguous memory while (homogenous) numpy, cython, and c arrays are. So the conversion will take some time regardless of these approaches. I'm surprised numpy.diff isn't faster than the cython implementation considering the use of generators and lists.
    – pbreach
    Commented Dec 8, 2016 at 20:31
  • you can see a numpy function that do the same, but even if its faster to calculate the distance, the conversion to numpy array make it a little bit slower overall Commented Dec 8, 2016 at 20:33
  • 1
    You could try pandas? It has fast io and would probably be much faster than using the json module
    – pbreach
    Commented Dec 9, 2016 at 14:29

1 Answer 1

1

The Python, rewritten without the generator, is

In [362]: polyline_input = {"paths" : [[-8081441.0, 5685214.0], [-8081446.0, 568
     ...: 5216.0], [-8081442.0, 5685219.0], [-8081440.0, 5685211.0], [-8081441.0
     ...: , 5685214.0]]}
In [363]: output=polyline_input['paths'][0][:] # copy
In [364]: i0,j0 = output
     ...: for i,j in polyline_input['paths'][1:]:
     ...:     output.extend([i0-i, j0-j][:])
     ...:     i0,j0 = i,j
     ...:     
In [365]: output
Out[365]: [-8081441.0, 5685214.0, 5.0, -2.0, -4.0, -3.0, -2.0, 8.0, 1.0, -3.0]

I'm just thinking though alternative ways of expressing the calculation. I could have used append to a list of pairs instead of the flat list.

An array equivalent:

In [375]: arr=np.array(polyline_input['paths'])
In [376]: arr[1:,:]=arr[:-1,:]-arr[1:,:]
In [377]: arr.ravel().tolist()
Out[377]: [-8081441.0, 5685214.0, 5.0, -2.0, -4.0, -3.0, -2.0, 8.0, 1.0, -3.0]

Ignoring the cost of converting a list to array, that looks like an efficient numpy operation. To improve on it in cython I expect you'd have to convert the array to memoryview, and iterate c style over pairs of values.

I forget why you are switching to this distance format. Are you trying to save some file space? Or speed up some downstream calculation?

1
  • Thank you for your answer! Yes the aim is to get a lightweight representation of the geometry to increase speed of sharing on network Commented Dec 9, 2016 at 13:03

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.