# massively scalable sinkhorn distances via the nystr om method massively scalable sinkhorn distances

Post on 07-Apr-2020

6 views

Embed Size (px)

TRANSCRIPT

Massively scalable Sinkhorn distances via the Nyström

method

Jason Altschuler∗

MIT jasonalt@mit.edu

Francis Bach†

INRIA - ENS - PSL francis.bach@inria.fr

Alessandro Rudi†

INRIA - ENS - PSL alessandro.rudi@inria.fr

Jonathan Niles-Weed‡

NYU jnw@cims.nyu.edu

Abstract

The Sinkhorn “distance,” a variant of the Wasserstein distance with entropic regu- larization, is an increasingly popular tool in machine learning and statistical inference. However, the time and memory requirements of standard algorithms for computing this distance grow quadratically with the size of the data, making them prohibitively expensive on massive data sets. In this work, we show that this challenge is surpris- ingly easy to circumvent: combining two simple techniques—the Nyström method and Sinkhorn scaling—provably yields an accurate approximation of the Sinkhorn distance with significantly lower time and memory requirements than other approaches. We prove our results via new, explicit analyses of the Nyström method and of the stabil- ity properties of Sinkhorn scaling. We validate our claims experimentally by showing that our approach easily computes Sinkhorn distances on data sets hundreds of times larger than can be handled by other techniques.

∗Supported in part by NSF Graduate Research Fellowship 1122374. †Supported in part by the European Research Council (grant SEQUOIA 724063). ‡Supported in part by the Josephine de Kármán Fellowship.

1

ar X

iv :1

81 2.

05 18

9v 3

[ st

at .M

L ]

2 6

O ct

2 01

9

Contents

1 Introduction 3 1.1 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline of paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Main result 5 2.1 Preliminaries and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Main result and proposed algorithm . . . . . . . . . . . . . . . . . . . . . . 7

3 Kernel approximation via the Nyström method 10 3.1 Preliminaries: Nyström and error in terms of effective dimension . . . . . . 10 3.2 Adaptive Nyström with doubling trick . . . . . . . . . . . . . . . . . . . . . 11 3.3 General results: data points lie in a ball . . . . . . . . . . . . . . . . . . . . 12 3.4 Adaptivity: data points lie on a low dimensional manifold . . . . . . . . . . 13

4 Sinkhorn scaling an approximate kernel matrix 16 4.1 Using an approximate kernel matrix . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Using an approximate Sinkhorn projection . . . . . . . . . . . . . . . . . . . 17 4.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Proof of Theorem 1 18

6 Experimental results 19

A Pseudocode for subroutines 22 A.1 Pseudocode for Sinkhorn algorithm . . . . . . . . . . . . . . . . . . . . . . . 22 A.2 Pseudocode for rounding algorithm . . . . . . . . . . . . . . . . . . . . . . . 23

B Omitted proofs 24 B.1 Stability inequalities for Sinkhorn distances . . . . . . . . . . . . . . . . . . 24 B.2 Bregman divergence of Sinkhorn distances . . . . . . . . . . . . . . . . . . . 25 B.3 Hausdorff distance between transport polytopes . . . . . . . . . . . . . . . . 25 B.4 Miscellaneous helpful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B.5 Supplemental results for Section 3 . . . . . . . . . . . . . . . . . . . . . . . 27

B.5.1 Full proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.5.2 Full proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.5.3 Additional bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C Lipschitz properties of the Sinkhorn projection 33

References 34

2

1 Introduction

Optimal transport is a fundamental notion in probability theory and geometry (Villani, 2008), which has recently attracted a great deal of interest in the machine learning com- munity as a tool for image recognition (Li et al., 2013; Rubner et al., 2000), domain adaptation (Courty et al., 2014, 2017), and generative modeling (Arjovsky et al., 2017; Bousquet et al., 2017; Genevay et al., 2016), among many other applications (see, e.g., Kolouri et al., 2017; Peyré and Cuturi, 2017).

The growth of this field has been fueled in part by computational advances, many of them stemming from an influential proposal of Cuturi (2013) to modify the definition of optimal transport to include an entropic penalty. The resulting quantity, which Cu- turi (2013) called the Sinkhorn “distance”1 after Sinkhorn (1967), is significantly faster to compute than its unregularized counterpart. Though originally attractive purely for computational reasons, the Sinkhorn distance has since become an object of study in its own right because it appears to possess better statistical properties than the unregular- ized distance both in theory and in practice (Genevay et al., 2018; Montavon et al., 2016; Peyré and Cuturi, 2017; Rigollet and Weed, 2018; Schiebinger et al., 2019). Computing this distance as quickly as possible has therefore become an area of active study.

We briefly recall the setting. Let p and q be probability distributions supported on at most n points in Rd. We denote by M(p,q) the set of all couplings between p and q, and for any P ∈ M(p,q), we denote by H(P ) its Shannon entropy. (See Section 2.1 for full definitions.) The Sinkhorn distance between p and q is defined as

Wη(p,q) := min P∈M(p,q)

∑ ij

Pij‖xi − xj‖22 − η−1H(P ) , (1)

for a parameter η > 0. We stress that we use the squared Euclidean cost in our formulation of the Sinkhorn distance. This choice of cost—which in the unregularized case corresponds to what is called the 2-Wasserstein distance (Villani, 2008)—is essential to our results, and we do not consider other costs here. The squared Euclidean cost is among the most common in applications (Bousquet et al., 2017; Courty et al., 2017; Forrow et al., 2018; Genevay et al., 2018; Schiebinger et al., 2019).

Many algorithms to compute Wη(p,q) are known. Cuturi (2013) showed that a simple iterative procedure known as Sinkhorn’s algorithm had very fast performance in practice, and later experimental work has shown that greedy and stochastic versions of Sinkhorn’s algorithm perform even better in certain settings (Altschuler et al., 2017; Genevay et al., 2016). These algorithms are notable for their versatility: they provably succeed for any bounded, nonnegative cost. On the other hand, these algorithms are based on matrix manipulations involving the n × n cost matrix C, so their running times and memory requirements inevitably scale with n2. In experiments, Cuturi (2013) and Genevay et al. (2016) showed that these algorithms could reliably be run on problems of size n ≈ 104.

Another line of work has focused on obtaining better running times when the cost matrix has special structure. A preeminent example is due to Solomon et al. (2015), who focus on the Wasserstein distance on a compact Riemannian manifold, and show

1We use quotations since it is not technically a distance; see (Cuturi, 2013, Section 3.2) for details. The quotes are dropped henceforth.

3

that an approximation to the entropic regularized Wasserstein distance can be obtained by repeated convolution with the heat kernel on the domain. Solomon et al. (2015) also establish that for data supported on a grid in Rd, significant speedups are possible by decomposing the cost matrix into “slices” along each dimension (see Peyré and Cuturi, 2017, Remark 4.17). While this approach allowed Sinkhorn distances to be computed on significantly larger problems (n ≈ 108), it does not extend to non-grid settings. Other proposals include using random sampling of auxiliary points to approximate semi-discrete costs (Tenetov et al., 2018) or performing a Taylor expansion of the kernel matrix in the case of the squared Euclidean cost (Altschuler et al., 2018). These approximations both focus on the η →∞ regime, when the regularization term in (1) is very small, and do not apply to the moderately regularized case η = O(1) typically used in practice. Moreover, the running time of these algorithms scales exponentially in the ambient dimension, which can be very large in applications.

1.1 Our contributions

We show that a simple algorithm can be used to approximate Wη(p,q) quickly on massive data sets. Our algorithm uses only known tools, but we give novel theoretical guarantees that allow us to show that the Nyström method combined with Sinkhorn scaling provably yields a valid approximation algorithm for the Sinkhorn distance at a fraction of the running time of other approaches.

We establish two theoretical results of independent interest: (i) New Nyström ap- proximation results showing that instance-adaptive low-rank approximations to Gaussian kernel matrices can be found for data lying on a low-dimensional manifold (Section 3). (ii) New stability results about Sinkhorn projections, establishing that a sufficiently good approximation to the cost matrix can be used (Section 4).

1.2 Prior work

Computing the Sinkhorn distance efficiently is a well studied problem in a number of com- munities. The Sinkhorn distance is so named because, as was pointed out by Cuturi (2013), there is an extremely simple iterative algorithm due to Sinkhorn (1967) which converges quickly to a solution to (1). This algorithm, which we