Moed B exam and solution published.

Grades for Moed B + final course grade are available under the "exam" tab. Grades will processed by the Mazkirut soon.

Thanks,

Regev

]]>For those who did Moed A, final grades have been calculated, and are available under the exam. They will be submitted to the Mazkirut soon.

Let me know if there are any issues or appeals.

Thanks!

]]>I tried running also the old hw3 code which used to work fine and now also it gives a killed message at different points.

Anything to be done? Is it because of current overload on the servers? ]]>

Can we get some directions on how to implement the calculations and some extra time to do it?

Thanks,

Ofer.

it says:

"What is the log-likelihood of the parameters

given the data x1, … , xn" ?

While there is no prior defined.

I can write

P(data | parameters)

Is it what you meant ? ]]>

תקופת המבחנים עמוסה, ונדמה שזה יוכל להקל גם על הבודק

תודה!

]]>when i derived it i got (u-x)/sigma**2 (meaning when i derived only the log part).

did i derived wrong or are the scribes wrong?

]]>Note that this is only the exam grade.

Solution will be posted soon. ]]>

It seems that we can't reach the file hw5.py as described in the exercise

Thanks

Guy Oren

I mean, when i initialize the variance parameters i calculate them by calculating the cov-matrix of the sample, and then takes the average.

later when I update the covariances it gets 1000 bigger than the values they had when i initialized it. does it make any sense? it seems it happens because the update rule is depended on a dot multipication between two vectors (in that case of 784 attributes), which result in a very high numbers….

and it seems that the formula doesn't care, but it also seems it will not give us the correct variance values.

can you explain?

]]>Can we have an estimation of a feasible run time of the EM program?

]]>On page 4 the EM algorithm for GMM is being described.

It is not clear what exactly happens with Q(teta,teta-t) - it seems we do not derive it when we evaluate the parameters at the M-step, but rather equation 12.7 which appears later. Is equation 12.7 equals to Q(teta,teta-t)? if it is - why? if not - then what exactly do we derive?

Is the "two bits" example of recitation 12 considered as an Naive Bayes approach?

Naive Bayes is whatever that done with dependency probability? Or just for x|y?

What r the difference between Naive Bayes and Bayesian Methods?

Thank u:)

]]>In exam 2014/15 b, question 2 subsection d, they ask whether the PCA will be affected if we multiply each vector by an orthonormal matrix U.

We thought it wouldn't affect since the correlation matrix X^T * X stays the same. However, the answers claimed the opposite stating that X*X^T will change.

Can anyone settle this difference?

Thanks ]]>

האם יוכלו לתפוס ״מספר עיגולים״ באותו מדגם? (או כל שילוב אחר בין האליפסות,עיגולים,…)

בנוסף, האם מרכז הצורות האלה חייב להיות ראשית הצירים? אם לא אשמח לדוגמא לקרנל ריבועי שמפריד 2 טבעות העוטפות אחת את השניה, וכולן ברביע הראשון (x, y חיוביים(

(השאלה בעקבות מבחנים, בהם תמיד נראה שהצורות להפרדה ברביע הראשון בעוד ש״הפרדה יפה״ לקרנל ריבועי מתקבלת בדף סביב הראשית… ולא כל האלגוריתמים שראינו מנרמלים את המידע)

]]>But we learned that the PCA components are the first k columns of the SVD of

Please help… ]]>

There are plenty of questions in the exam forum.

Can you please answer?

Thank you.

]]>האם אפשר בבקשה להחזיר את תרגול 13 המקורי לאתר? (אפשר להוסיפו בנוסף לקובץ שיש שם כרגע)

והאם אפשר לדעת מהי הסיבה בגללה הורידו את הקובץ יומים לפני המבחן?

תודה

1. Recitation 13 was updated. can you upload the original recitation?

2. The new document's format is docx, can you please upload a pdf version for the linux users?

Thanks.

]]>I understand how we can find the the eigenvectors, what i don't understand is how they help us to map correctly the original points to the PCA world.

for example - in the question i mentioned in the title we are being asked to find the incline of the line we are asked to return with PCA of 1d.

it is clear in this example that a good vector to map each point will be (1,5), and all we need is a line with real value which represent all possible coefficients, a line which can be X axis or Y axis.

but in the general case it won't be so easy to "see" the right vector to map with, and all we get from the PCA is a vector which return for which eigenvectors we should pick. how is this vector helps us to map?

]]>Is it possible for the decision tree algorithm we were taught to choose a dimension more than once?

for example, if we have 2-d points, first predicat is x1>5? and if the answer is correct, second would be is x2>3?, and after that x1>10?

Thanks

]]>But shouldn't it be without the ^2:

Otherwise I don't get the result:

Thanks

]]>i didn't understand, for example, at section 2a (in this question) about the decision tree:

what does it mean that "now the best classifier is v+x0"?

if for example our classifier is "if a<0 y = -1 else 1", and we have points x1 = -1 and x2 = 1, and our tree has only 1 decision node; then x1 will get y1 = -1, and x2 will get y2 = 1.

now lets choose x0 = 2.

for the same tree we will get now that T(x1) and T(x2) will have the same labels - which means a different classification.

where am i wrong?

]]>At the last lecture pdf there is a generalization bound of ERM. As I remmember, the bounds we saw were for detemine the needed sample size. How do we get that bound? Where is its proof/explaination?

Yhanx!

]]>What am I missing? ]]>

In question 6, we are given a prior distribution P over H, D over x, and target function which is chosen using P from H.

We are requested to answer a few questions about h_ml, the hypothesis we get from using Maximum Likelihood.

I'm not quite sure what is h_ml. Is it argmax {sum{log(P(x_i,h))|i } | h in H}?

Thanks

]]>(since a_i = (V^T)x_i) then (A^T)A = (V^T)(X^T)XV. we know that the solution V is a matrix where the i'th column is the eigenvector u_i with eigenvalue lambda_i (of (X^T)X) and therefore (V^T)(X^T)XV = (V^T)Z where Z is a matrix with lambda_i*u_i as the i'th column and finally (V^T)Z = Diag(lambda_1, … , lambda_r) since V is orthonormal, and not I_r as pointed in the scribe.

Am I missing anything?

Thanks!

]]>Would the method of getting xTx = sigma, and then work with the eigenvectors of sigma (with the largest eigenvalues) won't work?

Thanks! ]]>

Shouldn't the answer to 2 be 15?

If I understand correctly, we need to estimate P(Y=0), and then (2^{3}-1)*2, which gives 15.

(Where 2^{3}-1 comes from all 2^{3} input combinations, less 1 since they all sum to 1. We multiply by 2 since we have cases for both P(Y=0) and P(Y=1))

Thanks

]]>can you please explain it to me?

thx!

]]>In the theory described on lecture 11 slides 33-34 it says that we need to maximize 1/n*Sigma(logP(xi)).

yet, later on the slides and at the recitation it seems that we didn't include the 1/n in our calculations.

why? where did it go? for example - on slide 37 it says we max Sigma(logP(x; teta1,teta2))

]]>in page 3 it is said that when calculating the derivative of the loss function wrt w1 the first three expressions are exactly delta2, but delta2 is the derivative of the loss function with respect to w2 and not z1 like in the first three expressions. actually we need to replace the final multiplication by z1 in delta2 with w2 to get the result of the first three expressions. Am I missing anything?

Thanks!

]]>I got 3 main questions:

1) in 2013B I didnt get the answer to 5a. Why is the answer yi/xi-a and not y-xa?

Becuase of that, I didnt know how to answer 5a in 2013C.

2)in 2013C I didnt get the answer to 7.

How the regression line is computed? and how the PCA does?

Did we learn it?

Becuase of that, I didnt know how to answer 7 in 2013C.

3)What is the answer to:Thanks.

]]>In the scribes, it said that if we have hyper plane so:

wx>=1. ==> c(x)=1

wx<=1-gamma ==> c(x)=0

The margin is gamma. Isn't it gamma/2? And what the meaning of bias 1 here? Why no just 0?

(I took w=(0,1), gamma=1/2 for example)

]]>We assumed the realizible case - how could we handle the unrealizble case?

If there is a question like that in the test - is the right answer finding and ERM algorithm to minimize the observed error and then we can just use what we learned about ERM to say the it's PAC learnable?

Thanks!

]]>1. I uploaded last year's exams' solution. I also updated 2014B solution, there was a mistake in the previous solution, in 2.2.b.

2. I uploaded two student solutions for HW4. As before, it's not thoroughly checked so there might be errors. I might upload more soon. Note that some of the questions (kernel PCA and AdaBoost) appeared in last year's exam, so there is a formal solution there. Many thanks for the students who agreed!

3. HW5 had a mistake in Q2 (p and p-hat were reversed). Uploaded a new version with that correction.

4. There is no official scribe for the last recitation. I uploaded a student summary (thanks!) - if someone wants to send me another one, it'll be nice.

Regev

]]>what is the number of question to solve?

I understand we can have an A4 page with us, 2 side? 1 side? (3 sides ;) )

]]>I understand why proof 1 is wrong, you can't say anything on the probability not to "fall" between min and max because well thats exactly whats define min and max.

but once you defind min tag and max tag which depend only on teta and epsilon ( and not on the sample), why do you need to seperate to two cases and than use union bound instead of asking about Prob[x1,…xm] not in [min tag, max tag] and get a better bound?

tenx!

]]>But shouldn't it be:

Or at least something similar that takes into account the rest of the max function as defined in the loss function?

If not, why can we take only the max for the classification?

Thanks!

]]>above this inequality, didn't we showed that event B happens with probability less than 2^(-em) anyway? what am I missing here? ]]>

1. Is it right that for d large enough, a polynomial kernel clasiffier will classify right all the samples that are given? (Even if they not seem "seperable").

2. We've been heard thar RBF kernel beaves like Nearest Neighbor - my question is first - why and how? Second question about it is how do we use NN alogirthm to create an hypothesis and classify a given sample?

Thanks!

]]>At previous exams it seems like there r no 'heavy' proofs (with 7/12 epsilon, etc…)

Can we assume such proofs will not be at this year exam too? (The course is meant to be more theoretical this year )

Little question that bothers me - on scribe 5, page 6: it said that if. $\sum a_i y_i \neq 0$ then $L(a) \rightarrow -\infty$. Is it because we "free to choose b"?

thanks!

]]>why the multiplication equal to the trace result? the result is a matrix d*d. ]]>

there are no answers for question 1 in 2014B, so I would like to check mine:

1. for decision trees - we can easily by splitting each time according to one of the axises of the sqaure\rectangle in the middle.

2. 3-NN - I think we still can, and it'll be like a circle in the middle, but I'm not sure.

3. Poly kernel with d=2 - We can, and it'll be a circle around the middle.

I published a student solution for HW3 (thanks!). In the upcoming couple of days we will hopefully publish another solution for HW3, solutions for HW4, and the solutions for last year's exam. We hope it will be helpful to you.

The HW solutions informal - they are by students, and may contain mistakes (this goes in particular to HW3-4, which we preferred to post as early as possible). So take them with a grain of salt and compare solutions if you are unsure.

Regev

]]>What am I missing, should the tag of all of the sample points be 1 instead of -1? ]]>

What happens in the case where the rectangle can label samples to be *either negative or positive*? Same as before but this time can choose the "sign" of the rectangle.

I could prove VC-dim >= 5 but could not find a way to prove it's not bigger (VC-dim <= 5).

Any hint or direction will be appreciated, thanks!

]]>On the scribes of lecture 3. The last equation on page 11 (a bit after the Sauer-Shelah Lemma):

Why is it true to replace $log (m)$ with $log \frac{d}{\epsilon}$?

Thanks ]]>

1. On the scribes of lecture 3. Theorem 3.3 - I think the first S in the Radon Theorem should be S' right? (conv(S') \inter conv(S\ S') …

It appears again that way on Claim 3.4

2. Also, are the indexes on the last line of the Radon Theorem proof correct?

3. The proofs on the scribes that have a line on the left are for additional knownlege right? the were not on the lectures so we are not expected to memorize them?

]]>Today we recieved HW2 grades.

Our grade dropped mainly becuase the practical part explanations.

We didn't know how much to elaborate, and we did the same on HW3, becuase we didnt get feedback on HW2.

Can you consider this in HW3 grading? ]]>