ListNet: A Listwise Approach of Learning to Rank (2007)

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf

Problem description

Trainning set:

A set of queries $Q = { q^1, q^2 \dots, q^m }$.
Each query $q^i$ is associated with a list of documents $d^i = { d_1^i, d_2^i, \dots, d_{n^i}^i }$.
Each list of documents is associated with a list of scores $y_i = { y_1^i, y_2^i, \dots, y_{n^i}^i }$. It can be number of clicks on $d_j^i$ for query $q^i$ at a search engine.
The feature vector $x_j^i = X(q^i, d_j^i)$, $x_i = { x_1^i, x_2^i, \dots, x_{n^i}^i }$.
Training set $T = {(x^1, y^1), \dots, (x^m, y^m)}$.

Objective:

The rank function $f$: for each feature vector $x_j^i$ it outputs a score $z_j^i = f(x_j^i)$, $z^i = { f(x_1^i), \dots, f(x_{n^i}^i) }$.
Loss function: $\min \sum_{i=1}^m L(y^i, z^i)$, where $L$ is a listwise loss function.

Listwise loss function: Probability Models

Issue of NDCG: Discontinuity.

Solution from Listwise approach: Loss function based on probability distribution on permutations.

Define a family of distributions on permutation of scores $s$, $P_s(\pi) = P(s, \pi)$, $\sum_{\pi \in \Omega} P_s(\pi) = 1$.
Let $\pi_{opt}$ be the optimal permutation of score $y$, then $P_y(\pi_{opt})$ is the highest probability, and $P_y(reverse(\pi_{opt}))$ is the lowest probability.
$L(y, z) = KL(P_y, P_z) = - \sum_{\pi \in \Omega} P_y(\pi) \log(P_z(\pi))$.

General Permutation Probability

A general permutation distribution: $P_s(\pi) = \prod_{j=1}^n \frac{\phi(s_{\pi(j)})}{ \sum_{k=j}^n \phi(s_{\pi(k)}) }$.

Example: $s = (s_1, s_2, s_3), \pi = (2,1,3)$, $P_s(\pi) = \frac{\phi(s_2)}{\phi(s_2) + \phi(s_1) + \phi(s_3)} \times \frac{\phi(s_1)}{\phi(s_1) + \phi(s_3)} \times \frac{\phi(s_3)}{\phi(s_3)}$

Top One Probability

The problem is that we still need to calculate $n!$ permutation probabilities to compute $KL(P_y, P_z) = - \sum_{\pi \in \Omega} P_y(\pi) \log(P_z(\pi))$.

ListNet approach: Top One Probability

$P_s(j) = (\sum_{\pi(1) = j, \pi \in \Omega} P_s(\pi)) = \frac{\phi(s_j)}{\sum_{k=1}^n \phi(s_k)}$.

Let $s = {s_1 \dots, s_n }$,

$\sum_{i=1}^n P_s(j) = 1$.
If $s_j > s_k$, $P_s(j) > P_s(k)$.

Define loss function based on top one probability, $L(y,z) = - \sum_{j=1}^n P_y(j) \log(P_z(j))$.

Learning Method: ListNet

Let $\phi(s_j) = \exp(s_j)$, $P_s(j) = \frac{\exp(s_j)}{\sum_{k=1}^n \exp(s_k)}$.

For $z_j^i = f(x_j^i)$, let $f$ be a neural network.