Ask your own question, for FREE!
MIT 6.00 Intro Computer Science (OCW) 7 Online
OpenStudy (anonymous):

I'm tackling ps10 now. Since I'm not very good to read English, there is some passages which I can't interpret properly. What does it mean, written in problem:1? "Σ (distance of point to the centroid of its encapsulating cluster)**2 where the sum is over all points in the training set." 1.sum of what ? 2.does "sum is over all points in the training set" means, sum of something is over sum of all distances of points in the training set ? I hope I could hold the meaning reversely from the solution, but the attached solution to this problem is wrong. (it should be solution of ps9.)

OpenStudy (anonymous):

each Point object has a distance attribute. this is the distance of the point from the centroid of the cluster. sum/accumulate/add all of these distances then square the sum. sounds like the beginning of a statistics problem:

OpenStudy (anonymous):

@bwCA Thanks a lot for your reply. I could understand the distance method inside Point object, but what I have confused is targets to be summed up. I need to accumulate the distances of error points, I know that, between what and what should I compare for judging whether the point is error or not ? One of that may "sum of all distances of points in the training set" from my understanding. What is the other one ?

OpenStudy (anonymous):

To make sure, I quote the part below: """ In this question, you will need to graph the total error produced by the kmeans(…) function. In order to do this, you must do the following: 1. Iterate over k in increments of 25 from 25 <= k <= 150 and for each k do the following: 1. Partition your data set into a training and holdout set, where the holdout set should be 20% of all the points. 2. Using kmeans(...), find the total error of the training set, where the total error is the following equation:Σ (distance of point to the centroid of its encapsulating cluster)**2 where the sum is over all points in the training set. Hint: use the Point.distance(...) method. 3. Given the holdout set, find the error by calculating the squared distance of each point in the holdout set to its nearest cluster. """

OpenStudy (anonymous):

i suspect that there are answers to your questions in the lectures. I have not watched them yest. do you understand k-means clustering. i scanned the wikipedia article, http://en.wikipedia.org/wiki/K-means_clustering, and have some ideas. it sounds like they want you 'clusterize' the data with different values for k (number of clusters) - the first iteration will have 25 clusters - each cluster has a a list of points - use randomPartition(l, p) to make the training and holdout sets - for the training set, each point's 'error' IS its distance to its centroid - the error of the training set is (the sum of the distances) squared - the error of the holdout set is the sum of (each distance squared) it is possible that is incorrect. it seems likely that those two errors could be: [the sum of (each distance minus the mean distance)] squared and the sum of [(each distance minus the mean distance) squared] or \[\left( \sum_{}(d - \mu)\right)^{2}\] and \[\sum_{}\left( d - \mu \right)^{2}\]

OpenStudy (anonymous):

k-means clustering appears to produce clusters of data that minimizes the error of the data in each cluster using a least squares fit. ideally each point in a cluster would reside on the mean but in real life it won't and the error is the difference between its acual distance and the mean those last two equations seem very familiar - the square of (the sum of the errors) and - the sum of (the error squared) i suspect that as the clusters get smaller ( k gets bigger) the error will go down.

OpenStudy (anonymous):

Finally I have got the meaning. I have confused because of mainly interpretation of "error". on this situation, error does not mean [mistake] but [range between target and existing point]. and more, I have been able to understand the aim of this problem by your explanation. I greatly appreciate for your kindness !

OpenStudy (anonymous):

cool. after looking at the k-means function again - i went back to my original though that the 'error' is the actual distance from the point to the centroid of its cluster. the ideal case would be every point in the cluster residing at the centroid. the k-means function creates clusters that minimizes the distances.

Can't find your answer? Make a FREE account and ask your own questions, OR help others and earn volunteer hours!

Join our real-time social learning platform and learn together with your friends!
Can't find your answer? Make a FREE account and ask your own questions, OR help others and earn volunteer hours!

Join our real-time social learning platform and learn together with your friends!