Linear regression with visualization

Question

I have created a small script that:

Creates a lot of random points.
Runs a small brute force search to find a rect that has a low error, that is a good fit for the data.
Runs a linear regression on the rect generated with brute force to further reduce the error.
Prints out information and plots the data.

Down here you can see an example graph: the random points in blue, the brute forced rect, in blue, the 'linear regressed' rect in green.

The random points in blue, the brute forced rect, in blue, the 'linear regressed' rect in green.

from __future__ import division
import random
import matplotlib.pyplot as p
import numpy as np

def random_point():
    return (random.random(),random.random())

def list_of_random_points(n):
    return [random_point() for _ in range(n)]

def error_of_point(point,function):
    x,y = point[0],point[1]
    return abs(y - function(x))

def error_of_list_of_points(points,function):
    return sum((error_of_point(point,function) for point in points))


def get_starting_rect(points):
    min_error = 10**10
    for m1 in range(-10,10):
        for q1 in range(-10,10):
            m,q = m1/10,q1/10
            function = lambda x: m*x + q
            if error_of_list_of_points(points,function) < min_error:
                min_error = error_of_list_of_points(points,function)
                best_mq = m,q
    return best_mq

def get_approximate_m(m,q,points,sensibility):
    if error_of_list_of_points(points,lambda x: (m+sensibility)*x+q) < error_of_list_of_points(points,lambda x: m*x + q):
        m += sensibility
    elif error_of_list_of_points(points,lambda x: (m-sensibility)*x+q) < error_of_list_of_points(points,lambda x: m*x + q):
        m -= sensibility
    return m

def get_approximate_q(m,q,points,sensibility):
    if error_of_list_of_points(points,lambda x: (m)*x+q+sensibility) < error_of_list_of_points(points,lambda x: m*x + q):
        q += sensibility
    elif error_of_list_of_points(points,lambda x: (m)*x+q-sensibility) < error_of_list_of_points(points,lambda x: m*x + q):
        q -= sensibility
    return q

def approximate_better_m_and_q(m,q,points,sensibility):
    for _ in range(100):
        m = get_approximate_m(m,q,points,sensibility)
        q = get_approximate_q(m,q,points,sensibility)
    sensibility /= 10
    for _ in range(100):
        m = get_approximate_m(m,q,points,sensibility)
        q = get_approximate_q(m,q,points,sensibility)
    return m,q

def plot_rect(m,q):
    x = np.arange(0,1,0.1)
    y = [i*m + q for  i in x]
    p.plot(x,y)

def plot_rect_and_points(rect,points):
    p.scatter(*zip(*points))
    m,q = rect
    plot_rect(m,q)

def main():
    NUMBER_OF_POINTS = 1000
    points = list_of_random_points(NUMBER_OF_POINTS)


    m,q = get_starting_rect(points)
    plot_rect(m,q)

    print("""The rect generated with brute force of equation y = {}x + {} 
has an error of {}""".format(m,q,error_of_list_of_points(points,lambda x: m*x + q)))

    better_rect = approximate_better_m_and_q(m,q,points,0.1)
    m,q = better_rect
    print("""The rect generated with linear regression of equation y = {}x + {} 
starting from the rect generated with
brute force has an error of {}""".format(
    m,q,error_of_list_of_points(points,lambda x: m*x + q)))

    plot_rect_and_points((m,q),points)
    p.show()

if __name__ == "__main__":
    main()

@JanneKarila the review is about the code with rect changing it in line would invalidate the review. — Caridorc, Jan 7 '15 at 14:53

Veedrac · Accepted Answer · 2015-01-07 14:49:48Z

For some reason there's a trend among scientific Python users to import things with really short names, like

import numpy as np

This is somewhat bearable when it's np and you use it many times, but you use np once. Further, you

import matplotlib.pyplot as p

and only use this three times. It's not worth it; typing pyplot three times isn't going to make your code unduly long.

You should add more spacing in accordance to PEP 8, and cap line lengths to something less than what you currently have (100 characters should be enough).

random_point can just be

def random_point():
    return random.random(), random.random()

Namely, no brackets needed. You only use it once in

def list_of_random_points(n):
    return [random_point() for _ in range(n)]

so just write

def list_of_random_points(n):
    return [(random.random(), random.random()) for _ in range(n)]

instead. I'd also rename this to random_points. In fact, you only use this once, so I'd remove the function here too.

You only use error_of_point in error_of_list_of_points, so inline it, giving:

def error_of_list_of_points(points, func):
    return sum(abs(y - func(x)) for x, y in points)

You don't actually need a list, so call this error_in_points or similar.

In get_starting_rect, change

function = lambda x: m*x + q

to

def line(x): return m*x + q

I would also put it on two lines for readability. Since you use this so much, it makes sense to do

def line(m, q):
    def y(x):
        return m*x + q
    return y

So you can just do

error_in_points(points, line(m, q))

This also simplified the loop:

for m in range(-10, 10):
    for q in range(-10, 10):
        if error_in_points(points, line(m / 10, q / 10)) < min_error:
            ...

You can then move to using min:

mqs = product(range(-10, 10), range(-10, 10))
mqs = ((m/10, q/10) for m, q in mqs)

return min(mqs, key=lambda mq: error_in_points(points, line(*mq)))

or even

mqs = numpy.mgrid[-1:+1:0.1, -1:+1:0.1].reshape(2, -1).T
return min(mqs, key=lambda mq: error_in_points(points, line(*mq)))

get_approximate_m now looks like

def get_approximate_m(m, q, points, sensibility):
    if error_in_points(points, line(m+sensibility, q)) < error_in_points(points, line(m, q)):
        m += sensibility
    elif error_in_points(points, line(m-sensibility, q)) < error_in_points(points, line(m, q)):
        m -= sensibility
    return m

You should cache error_in_points(points, line(m, q)):

def get_approximate_m(m, q, points, sensibility):
    current_err = error_in_points(points, line(m, q))

    if error_in_points(points, line(m+sensibility, q)) < current_err:
        m += sensibility
    elif error_in_points(points, line(m-sensibility, q)) < current_err:
        m -= sensibility
    return m

This would also be better named adjust_m or similar.

Your approximate_better_m_and_q does:

for _ in range(100):
    m = adjust_m(m, q, points,sensibility)
    q = get_approximate_q(m, q, points, sensibility)

Since this does not change sensibility as it progresses it's likely that this quickly stops doing anything productive, spending most of its time bouncing around a few points. It'll probably work better with a gradual slowing:

def approximate_better_m_and_q(m, q, points, sensibility):
    for _ in range(50):
        m = get_approximate_m(m, q, points, sensibility)
        q = get_approximate_q(m, q, points, sensibility)
        sensibility *= 0.95
    return m, q

Which does tend to improve the results for me even though it has a quater the number of iterations.

plot_rect can trivially use Numpy's broadcasting:

def plot_rect(m,q):
    x = numpy.arange(0, 1, 0.1)
    y = x * m + q
    pyplot.plot(x,y)

plot_rect_and_points can use unpacking (as can main):

plot_rect(*rect)

I hadn't realized that error_in_points is only used with line as the funciton; maybe you should move it into error_in_points to make calling more convenient.

You should split your prints up:

print("The rect generated with brute force of equation y = {}x + {}".format(m, q))
print("has an error of {}".format(error_in_points(points, m, q)))

although the trailing space suggests maybe you didn't realize they were printing on separate lines.

All in all this gives

from __future__ import division

import numpy
import random

from matplotlib import pyplot

def error_in_points(points, m, q):
    return sum(abs(y - (m*x + q)) for x, y in points)

def get_starting_rect(points):
    mqs = numpy.mgrid[-1:+1:0.1, -1:+1:0.1].reshape(2, -1).T
    return min(mqs, key=lambda mq: error_in_points(points, *mq))

def adjust_m(m, q, points, sensibility):
    current_err = error_in_points(points, m, q)

    if error_in_points(points, m+sensibility, q) < current_err:
        m += sensibility
    elif error_in_points(points, m-sensibility, q) < current_err:
        m -= sensibility
    return m

def adjust_q(m, q, points, sensibility):
    current_err = error_in_points(points, m, q)

    if error_in_points(points, m, q+sensibility) < current_err:
        q += sensibility
    elif error_in_points(points, m, q-sensibility) < current_err:
        q -= sensibility
    return q

def approximate_better_m_and_q(m, q, points, sensibility):
    for _ in range(50):
        m = adjust_m(m, q, points, sensibility)
        q = adjust_q(m, q, points, sensibility)
        sensibility *= 0.95
    return m, q

def plot_rect(m,q):
    x = numpy.arange(0, 1, 0.1)
    y = x * m + q
    pyplot.plot(x,y)

def plot_rect_and_points(rect,points):
    pyplot.scatter(*zip(*points))
    plot_rect(*rect)

def main():
    NUMBER_OF_POINTS = 1000
    points = [(random.random(), random.random()) for _ in range(NUMBER_OF_POINTS)]

    m, q = get_starting_rect(points)
    plot_rect(m, q)

    print("The rect generated with brute force of equation y = {}x + {}".format(m, q))
    print("has an error of {}".format(error_in_points(points, m, q)))

    better_rect = approximate_better_m_and_q(m, q, points, 0.1)
    m, q = better_rect
    print("The rect generated with linear regression of equation y = {}x + {}".format(m, q))
    print("starting from the rect generated with brute force has an error of {}".format(
        error_in_points(points, m, q)
    ))

    plot_rect_and_points((m,q),points)
    pyplot.show()

if __name__ == "__main__":
    main()

There is a lot more you can do (particularly vectorization), but this is a good start.

Excellent review, just as personal style I would like to keep my small functions, I like them. — Caridorc, Jan 7 '15 at 14:58
@Caridorc I mentioned merging these functions because they are below a logical unit of work; error_of_list_of_points ended up shorter once you removed the function indirection. I'm totally fine with something like plot_rect, though. — Veedrac, Jan 7 '15 at 15:03

asked	1 year ago
viewed	148 times
active	1 year ago

current community

your communities

more stack exchange communities

Linear regression with visualization

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python algorithm mathematics machine-learning matplotlib or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Linear regression with visualization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python algorithm mathematics machine-learning matplotlib or ask your own question.

Related

Hot Network Questions