Long before being introduced to proper losses, I'd known about, and completely accepted, the basic premise as to why density ratio was different to class-probability estimation: the later only indirectly modelled the ratio, which is generally a sub-optimal strategy. Sometime in late 2015, I re-read the cool work on LSIF. This time, with link functions and partial losses fresh in my mind, I noticed that one could think of this as involving a particular proper loss plus link function. This was mildly interesting, but the point remained: using anything but the link that directly yields the density ratio must be a bad idea.
This was something of a defeat for my running theory at the time, that most everything can be attacked with some version of logistic regression. To better understand this failure mode, I was looking at Sugiyama's elegant unified Bregman view of density ratios. I noticed that logistic regression was mentioned, but didn't pay much attention to it; after all, the Bregman view of regret under proper losses immediately led to the KL minimisation view of logistic regression.
It was only on a re-read that I realised there was a bit more to this innocuous result: it was a statement of Bregman minimisation not to the true probability, but to the true density ratio. Now that I didn't know was true for logistic regression. Studying the surprisingly simple proof, I saw it could be viewed as a surprising equality between the KL divergence on probabilities and on ratios.
I was sure this couldn't be true in general; but why? I worked through the case of a general divergence, trying to find out the line where the proof would break.
Half an hour later, after staring at my working multiple times, I was still incredulous that everything seemed to work in general, too. And so was Lemma 2 born. [hide]