ListNet: A Listwise Approach of Learning to Rank (2007)
ListNet: A Listwise Approach of Learning to Rank (2007)
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf
Problem description
Trainning set:
- A set of queries $Q = { q^1, q^2 \dots, q^m }$.
- Each query $q^i$ is associated with a list of documents $d^i = { d_1^i, d_2^i, \dots, d_{n^i}^i }$.
- Each list of documents is associated with a list of scores $y_i = { y_1^i, y_2^i, \dots, y_{n^i}^i }$. It can be number of clicks on $d_j^i$ for query $q^i$ at a search engine.
- The feature vector $x_j^i = X(q^i, d_j^i)$, $x_i = { x_1^i, x_2^i, \dots, x_{n^i}^i }$.
- Training set $T = {(x^1, y^1), \dots, (x^m, y^m)}$.
Objective:
- The rank function $f$: for each feature vector $x_j^i$ it outputs a score $z_j^i = f(x_j^i)$, $z^i = { f(x_1^i), \dots, f(x_{n^i}^i) }$.
- Loss function: $\min \sum_{i=1}^m L(y^i, z^i)$, where $L$ is a listwise loss function.
Listwise loss function: Probability Models
Issue of NDCG: Discontinuity.
Solution from Listwise approach: Loss function based on probability distribution on permutations.
- Define a family of distributions on permutation of scores $s$, $P_s(\pi) = P(s, \pi)$, $\sum_{\pi \in \Omega} P_s(\pi) = 1$.
- Let $\pi_{opt}$ be the optimal permutation of score $y$, then $P_y(\pi_{opt})$ is the highest probability, and $P_y(reverse(\pi_{opt}))$ is the lowest probability.
- $L(y, z) = KL(P_y, P_z) = - \sum_{\pi \in \Omega} P_y(\pi) \log(P_z(\pi))$.
General Permutation Probability
A general permutation distribution: $P_s(\pi) = \prod_{j=1}^n \frac{\phi(s_{\pi(j)})}{ \sum_{k=j}^n \phi(s_{\pi(k)}) }$.
Example: $s = (s_1, s_2, s_3), \pi = (2,1,3)$, \(P_s(\pi) = \frac{\phi(s_2)}{\phi(s_2) + \phi(s_1) + \phi(s_3)} \times \frac{\phi(s_1)}{\phi(s_1) + \phi(s_3)} \times \frac{\phi(s_3)}{\phi(s_3)}\)
Top One Probability
The problem is that we still need to calculate $n!$ permutation probabilities to compute $KL(P_y, P_z) = - \sum_{\pi \in \Omega} P_y(\pi) \log(P_z(\pi))$.
ListNet approach: Top One Probability
$P_s(j) = (\sum_{\pi(1) = j, \pi \in \Omega} P_s(\pi)) = \frac{\phi(s_j)}{\sum_{k=1}^n \phi(s_k)}$.
Let $s = {s_1 \dots, s_n }$,
- $\sum_{i=1}^n P_s(j) = 1$.
- If $s_j > s_k$, $P_s(j) > P_s(k)$.
Define loss function based on top one probability, $L(y,z) = - \sum_{j=1}^n P_y(j) \log(P_z(j))$.
Learning Method: ListNet
Let $\phi(s_j) = \exp(s_j)$, $P_s(j) = \frac{\exp(s_j)}{\sum_{k=1}^n \exp(s_k)}$.
For $z_j^i = f(x_j^i)$, let $f$ be a neural network.