I have 350 document scores that, when I plot them, have this shape:
docScores = [(0, 68.62998962), (1, 60.21374512), (2, 54.72480392),
(3, 50.71389389), (4, 49.39723969), ...,
(345, 28.3756237), (346, 28.37126923),
(347, 28.36397934), (348, 28.35762787), (349, 28.34219933)]
I posted the complete array here on pastebin (it corresponds to the dataPoints list on the code below).

Now, I originally needed to find the elbow point of this L-shape curve, which I found thanks to this post.
Now, on the following plot, the red vector p represents the elbow point. I would like to find the point x=(?,?) (the yellow star) on the vector b which corresponds to the orthogonal projection of p onto b.

The red point on the plot is the one I obtain (which is obviously wrong). I obtain it doing the following:
b_hat = b / np.linalg.norm(b) #unit vector of b
proj_p_onto_b = p.dot(b_hat)*b_hat
red_point = proj_p_onto_b + s
Now, if the projection of p onto b is defined by the its starting and ending point, namely s and x (the yellow star), it follows that proj_p_onto_b = x - s, therefore x = proj_p_onto_b + s ?
Did I make a mistake here ?
EDIT : In answer to @cxw, here is the code for computing the elbow point :
def findElbowPoint(self, rawDocScores):
dataPoints = zip(range(0, len(rawDocScores)), rawDocScores)
s = np.array(dataPoints[0])
l = np.array(dataPoints[len(dataPoints)-1])
b_vect = l-s
b_hat = b_vect/np.linalg.norm(b_vect)
distances = []
for scoreVec in dataPoints[1:]:
p = np.array(scoreVec) - s
proj = p.dot(b_hat)*b_hat
d = abs(np.linalg.norm(p - proj)) # orthgonal distance between b and the L-curve
distances.append((scoreVec[0], scoreVec[1], proj, d))
elbow_x = max(distances, key=itemgetter(3))[0]
elbow_y = max(distances, key=itemgetter(3))[1]
proj = max(distances, key=itemgetter(3))[2]
max_distance = max(distances, key=itemgetter(3))[3]
red_point = proj + s
EDIT : Here is the code for the plot :
>>> l_curve_x_values = [x[0] for x in docScores]
>>> l_curve_y_values = [x[1] for x in docScores]
>>> b_line_x_values = [x[0] for x in docScores]
>>> b_line_y_values = np.linspace(s[1], l[1], len(docScores))
>>> p_line_x_values = l_curve_x_values[:elbow_x]
>>> p_line_y_values = np.linspace(s[1], elbow_y, elbow_x)
>>> plt.plot(l_curve_x_values, l_curve_y_values, b_line_x_values, b_line_y_values, p_line_x_values, p_line_y_values)
>>> red_point = proj + s
>>> plt.plot(red_point[0], red_point[1], 'ro')
>>> plt.show()
If you are using the plot to visually determine if the solution looks correct, you must plot the data using the same scale on each axis, i.e. use plt.axis('equal'). If the axes do not have equal scales, the angles between lines are distorted in the plot.
First of all, is the point at ~(50, 37) p or s+p? If p, that might be your problem right there! If the Y component of your p variable is positive, you won't get the results you expect when you do the dot product.
Assuming that point is s+p, if a bit of Post-It scribbling is correct,
p_len = np.linalg.norm(p)
p_hat = p / p_len
red_len = p_hat.dot(b_hat) * p_len # red_len = |x-s|
# because p_hat . b_hat = 1 * 1 * cos(angle) = |x-s| / |p|
red_point = s + red_len * b_hat
Not tested! YMMV. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With