Note on an item-item recommender system using matrix factorisation

Collaborative filtering for product recommendations tends to focus on user-item recommendations – given a user, try to find the products they might like based on the products they have already interacted with.  While personalisation like this is often preferable, sometimes we would just like to know what products to recommend if a user lands on a given product page.  There is far less written about this in the literature, so I would like to share some thoughts (and maths).

Assume we have u users p products,  we can build a u \times p ratings matrix R where r_{ij} is the rating given by user u_i to product p_j.  This can either be made up of ratings entered explicitly by a user, or inferred implicitly from the way they interact with the product.  This matrix will generally be large and sparse, since most users will only interact with a small proportion of the products available.

A standard approach to fill in unknown data in the matrix is to use matrix factorisation.  Briefly, this assumes that there are only a small number of latent factors for users and products, that we can use to infer an unknown rating using a linear multiplication.  That is we assume

\begin{pmatrix} \, & \, & \, & \, & \, \\ \, & \, & \, & \, & \, \\ \, & \, & R & \, & \, \\ \, & \, & \, & \, & \, \\ \, & \, & \, & \, & \, \end{pmatrix} = \begin{pmatrix} \, & \, & \, \\ \, & \, & \, \\ \, & U & \, \\ \, & \, & \, \\ \, & \, & \, \end{pmatrix} \begin{pmatrix} \, & \, & \, & \, & \, \\ \, & \, & P & \, & \, \\ \, & \, & \, & \, & \, \end{pmatrix} ,

where U,\,P are u \times r and r \times p matrices (resp.), for some small rank r.  This allows us to express latent vectors for each product, given by the columns of matrix P.  It also allows us to estimate unknown ratings between products and users that have not been rated.  This is a standard technique, and for large sparse R, the factorisation can be approximated done using the alternating least squares algorithm, e.g. in Spark.

If our aim is item-item recommendations, another standard approach is to put the items in a latent space, and measure the cosine distance between them.  So if we have vectors p_i, p_j, normalised to have norm 1, then we take

\text{distance}(p_i, p_j) = 1 - p_i \cdot p_j.

So a natural thing to consider is just taking the cosine distance between our columns of P (this is what was suggested in this Stackoverflow answer).  This seems problematic, for example U \cdot P is invariant under scaling column of U by 100, and scaling the corresponding row of P by 0.01, but clearly the dot products between the columns of P will generally change if we do this.  Really we would like to take the dot product between the columns of R.  Unfortunately, this matrix can be very large and dense – imagine if we had millions of products, and millions of users.  Even if we are doing this in a distributed computation, calculating even a subset of the dot products between them will be expensive.

Since U and P have low rank, we might hope that we can find some matrix with the same dimensions of P that has the same pairwise dot products between the columns as R.  Fortunately, this can be done, with a little linear algebra.

Suppose we want to take the dot product between column i and column j of R.  This is

\begin{aligned} (U p_i) \cdot (U p_j) &= (U p_i)^\intercal (U p_j) \\ &= p_i^\intercal ( U^\intercal U) p_j \end{aligned}.

U^\intercal U is an r \times r matrix, so it would be awesome if we could find an r \times r matrix M such that M^\intercal M = U^\intercal U, since then the dot product between column i and column j of R is the dot product between column i and column j of MP, which is a much smaller calculation.  We now turn to the singular value decomposition of U. Assuming that U has rank r, we can write U as U = V_1 \Sigma V_2 where V_1 is a u \times u orthogonal matrix of eigenvalues of  \, U U^\intercal V_2 is a r \times r orthogonal matrix of eigenvalues of \, U^\intercal U  , and \Sigma is the u \times r diagonal matrix (not square, but having zeros off the leading diagonal), whose entries are the square root of the eigenvalues of V_1 or V_2.  Exercise for the reader : show that \Sigma is a real matrix, i.e. that for any U, \, U^\intercal U   has real positive eigenvalues.  N.B. this decomposition has an implementation in Spark.

Then note that

 \begin{aligned} U^\intercal U &= (V_1 \Sigma V_2)^\intercal (V_1 \Sigma V_2) \\ &=V_2^\intercal \Sigma^\intercal V_1^ \intercal V_1 \Sigma V_2) \\ &=V_2^\intercal \Sigma^\intercal \Sigma V_2 \\ &= M^\intercal M, \end{aligned}

where M = \Sigma V_2 .  We aren’t quite there, because M is a u \times r matrix, but note that because the lower u - r rows of \Sigma are all zero, if we take  \Sigma' to be the upper square matrix of \Sigma , then if we take M' = \Sigma ' V_2 (an r \times r matrix), then it’s easy to check that  M'^\intercal M' = M^\intercal M , and we are done.  So to get our item-item similarities we can look at the dot products between columns in the matrix \Sigma ' V_2 P .

Thanks to Jeremy Mitchell, a colleague at Qubit for an enlightening discussion on the content of this post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s