Hey folks. Bit stuck again. Working on problem set 10 from the 2011 course. I'm not sure if I'm understanding problem 1 correctly. I've generated a graph for the different k values but I'm getting 2 fairly flat lines indicating that the error values my code is calculating are fairly consistent. Could some have a quick gander at my code please? It's no where near finished, I'm still very much in the figuring out stage

Question

anonymous · Answer

Specifically the graphremovederr function http://pastebin.com/FeJUVVKE

anonymous · Answer

syntax highlighting is your friend http://pastebin.com/pmcqgjbq where do you suspect the problem might be? have you used print statements to inspect those error likely parts of the code? what type of graph did you expect to see?

anonymous · Answer

Thanks. Another thing learned, just updated the pastebin with python syntax highlighting. 
I'm really unsure as to what I'm doing in this problem so I'm not sure what my graph should look like. I think my graph makes sense as the points from my holdout set should not be included in a cluster and therefore should have a larger error value. This seems to be what my graph is showing but I really don't know if it's correct or if I'm way off the mark. I'd love to see someone elses graphs or code just to get a better idea of what's supposed to be happening.

anonymous · Answer

This is the question:

Problem #1: Graphing Removed Error
In this question, you will need to graph the total error produced by the kmeans(…) function.
In order to do this, you must do the following:
1. Iterate over k in increments of 25 from 25 <= k <= 150 and for each k do the following:
1. Partition your data set into a training and holdout set, where the holdout set should be 20% of all the points.
2. Using kmeans(...), find the total error of the training set, where the total error is the following equation:Σ (distance of point to the centroid of its encapsulating cluster)**2 where the sum is over all points in the training set. Hint: use the Point.distance(...) method.
3. Given the holdout set, find the error by calculating the squared distance of each point in the holdout set to its nearest cluster.
2. Once you have these error values:
1. Graph the error in the training set and the error in the holdout set against k (you can graph these two lines on the same plot).
2. Also graph the ratio of the error of the holdout set over the total error of the training set. It will make sense to do this in a separate plot.
3. (Hint: It will be much faster if you save all of the error values first and then do your graphing afterwards rather than recomputing the clusters each time.)

anonymous · Answer

it will take me a bit to get thru your two functions - i'll post as i see stuff:
- line 236: I think you have holdout and training swapped, which one has more points?
-looks like you got the code for 1.2 correct (lines 241-247)
- 1.2: at line 252 you are using a clusterSet that was create with the trainingSet
          is this correct?  the instructions are a bit vague and i haven't watched the
          lectures.  otherwise 1.2 looks correct (lines 250 to 258)
- and 1.1 looks correct

anonymous · Answer

Thanks. I have no access to pastebin at the moment. I spotted my error with the training set. I was passing the wrong probability value to randompartition. I think I have my code running fairly well now but I have no idea whether I'm producing the correct results or whether I'm returning what I think should be the answer. I have the values of k for 25 <= k <= 150 on the x axis and the error summations on the y. the larger set drops from 60 to 30 and the holdoutset starts at 20 and ends up at around 0. I don't i'm gathering the error information correctly or I'm graphing it wrong.

anonymous · Answer

attachment

anonymous · Answer

NEW GRAPH REMOVED ERROR CODE ***********************************

def graphRemovedErr(points, kvals = [25, 50, 75, 100, 125, 150], cutoff = 0.1):
    """
    Should produce graphs of the error in training and holdout point sets, and
    the ratio of the error of the points, after clustering for the given values of k.
    For details see Problem 1.
    """
    trainingSet , holdOutSet = randomPartition(points, .8)
    errorSet = []

    holdOutErrorSet = []
    for val in kvals:

        clusterSet , maxDist = kmeans(trainingSet,val, cutoff, County)



        errorSummation = 0.0
        for cluster in clusterSet:
            centroid = cluster.computeCentroid()
            points = cluster.getPoints()
            for point in points:
                errorSummation += (point.distance(centroid))**2
        print 'error',errorSummation
        errorSet.append(errorSummation)

        holdOutError = 0.0

        for point in holdOutSet:
            minDist = 100.0
            for cluster in clusterSet:
                centroid = cluster.computeCentroid()
                if point.distance(centroid)<minDist:
                    minDist = point.distance(centroid)

            holdOutError+= minDist **2
        holdOutErrorSet.append(holdOutError)

    pylab.figure()
    pylab.plot(kvals,errorSet)
    pylab.plot(kvals, holdOutErrorSet)

    pylab.show()

anonymous · Answer

i didn't watch the lectures so i don't know the context.  but it seems the exercise is a comparison of the analysis results of a subset with the analysis results of the rest of the set.

what minimum k produces a sufficient result?
Do you get valid results by analyzing a subset?
Is the difference in errors produced acceptable?
If so then you can speed processing by analyzing a subset - maybe at least to give you an idea if there is anything interesting in the data.

if i guessed correctly, i would think that the code would perform k-means clustering and error determination on the holdout set in the same manner as the training set - but the instructions aren't clear, i bet the lectures contain the answer to that question.

The k-means clustering probably takes a long time, so maybe the goal is to determine that producing clusters with a subset is sufficient to analyze the whole dataset. but then again, if you don't have all the data grouped in clusters .... ??

i guess, all you can do is confirm that your code matches the specifications then whatever the data shows you is correct.  you could write some tests based on the specs and a known (small) data set and validate your code.