Deep Unsupervised Learning

Notes from the Berkley course on deep unsupervised Learning.

Onur Copur

Last updated on Jan 30, 2022 4 min read

Image credit: Synced

Lecture 1 (Autoregressive Models)

Likelihood-based models

Problems we’d like to solve: Generating data, compressing data, anomaly detection
Likelihood-based models estimate the data distribution from some samples from the data distribution
the aim is to estimate the distribution of complex, high-dimensional data with computational and statistical efficiency.

Generative models

Maximum likelihood: given a dataset x(1), …, x(n), find θ by solving the optimization problem.
$\arg\min_\theta \ \mathrm{loss}(\theta, \mathbf{x}^{(1)}, \dotsc, \mathbf{x}^{(n)}) = \frac{1}{n}\sum_{i=1}^n -\log p_\theta(\mathbf{x}^{(i)})$
It is equivalent to minimizing KL divergence between the empirical distribution and the model.

How do we solve this optimization problem?
- Stochastic Gradient descent $\arg\min_\theta \mathbb{E}{\mathbf{x} \sim \hat p\mathrm{data}}[-\log p_\theta(\mathbf{x})]$
Why maximum likelihood + SGD? It works with large datasets and is compatible with neural networks.

Designing The Model

The key requirement for maximum likelihood + SGD: efficiently compute log p(x) and its gradient.
pθ —> deep neural network

$for\ all\ \theta,\quad \sum_\mathbf{x} p_\theta(\mathbf{x}) = 1 \quad\text{and}\quad p_\theta(\mathbf{x}) \geq 0 \quad\ for\ all\ \mathbf{x}$

Autoregressive model

An expressive Bayes net structure with neural network conditional distributions yields an expressive model for p(x) with tractable maximum likelihood training.

$\log p_\theta(\mathbf{x}) = \sum_{i} \log p_\theta(x_i \,|\, \mathrm{parents}(x_i))$

RNN autoregressive models - char-rnn

Masking-based autoregressive models

Masked Autoencoder for Distribution Estimation (MADE)

Masked Temporal (1D) Convolution (WaveNet)

Improved receptive field: dilated convolution, with exponential dilation. Better expressivity: Gated Residual blocks, Skip connections.

https://procedural-generation.isaackarth.com/tumblr_files/tumblr_od90sk1vkL1uo5d9jo1_640.gif

https://haydensansum.github.io/CS205-Waveforms/imgs/wavenet_gate.png

Masked Spatial (2D) Convolution - PixelCNN

Images can be flatten into 1D vectors, but they are fundamentally 2D.
We can use a masked variant of ConvNet to exploit this knowledge.
First, we impose an autoregressive ordering on 2D images:

PixelCNN

PixelCNN-style masking has one problem: blind spot in the receptive field.

Masked Attention + Convolution

Neural autoregressive models: the good

Best in class modeling performance:

expressivity - autoregressive factorization is general.
generalization - meaningful parameter sharing has good inductive bias.

Masked autoregressive models: the bad

Sampling each pixel (1 forward pass)

Speedup by breaking the autoregressive pattern
- O(d) -> O(log(d)) by parallelizing within groups {2, 3, 4}.
- Cannot capture dependencies within each group: this is a fine assumption if all pixels in one group are conditionally independent.
  - Most often they are not, then you trade expressivity for sampling speed.

Natural Image Manipulation for Autoregressive Models using Fisher Scores

Main challenge:
- How to get a latent representation from PixelCNN?
- Why hard? The random input happens on a per-pixel sample basis.
Proposed solution
- Use Fisher score

Lecture 2 (Flow Models)

How to fit a density $p_Q(x)$ model with continuous $x \in R^n$
What do we want from this model?
- Good fit to the training data (really, the underlying distribution!)
- For new x, the ability to evaluate $p_\theta (x)$
- Ability to sample from $p_\theta (x)$
- And, ideally, a latent representation that’s meaningful

How to fit a density model?

Option 1: Mixture of Gaussians

Parameters: means and variances of components, mixture weights

$p_\theta(x) = \sum_{i=1}^k \pi_i \mathcal{N}(x; \mu_i, \sigma_i^2)$ $\arg\min_\theta \mathbb{E}_x [ -\log p_\theta (x) ]$

Option 2: General Density Model

How to ensure proper distribution?

How to sample?
Latent representation?

Flows: Main Idea

Flows: Training

Change of Variable

Note: requires $f_\theta$ invertible & differentiable

assuming we have an expression for $pz$ , this can be optimized with Stochastic Gradient Descent

Flows: Sampling

2-D Autoregressive Flow

2-D Autoregressive Flow: Face

Architecture: Base distribution: Uniform[0,1]^2 x1: mixture of 5 Gaussians x2: mixture of 5 Gaussians, conditioned on x1

$%x_1 &\sim p_\theta(x_1) \\%x_2 &\sim p_\theta(x_2 | x_1) \\%x_3 &\sim p_\theta(x_3 | x_1, x_2) \\x_1 &= f_\theta^{-1}(z_1) \\x_2 &= f_\theta^{-1}(z_2; x_1) \\x_3 &= f_\theta^{-1}(z_3; x_1, x_2) \\$

Autoregressive flows

How to fit autoregressive flows?
- Map x to z
- Fully parallelizable
Notice
- x → z has the same structure as the log likelihood computation of an autoregressive model
- z → x has the same structure as the sampling procedure of an autoregressive model

Inverse autoregressive flows

The inverse of an autoregressive flow is also a flow, called the inverse autoregressive flow (IAF)
x → z has the same structure as the sampling in an autoregressive model
z → x has the same structure as log likelihood computation of an autoregressive model. So, IAF sampling is fast

AF vs IAF

Autoregressive flow
- Fast evaluation of p(x) for arbitrary x
- Slow sampling
Inverse autoregressive flow
- Slow evaluation of p(x) for arbitrary x, so training directly by maximum likelihood is slow.
- Fast sampling
- Fast evaluation of p(x) if x is a sample
There are models (Parallel WaveNet, IAF-VAE) that exploit IAF’s fast sampling.

Change of MANY variables

For $z \sim p(z)$ , the sampling process f-1 linearly transforms a small cube $dz$ to a small parallelepiped $dx$ . Probability is conserved:

Intuition: x is likely if it maps to a “large” region in z space

High- Dimensional Flow models: training

Change-of-variables formula lets us compute the density over x:

$p_\theta(\mathbf{x}) = p(f_\theta(\mathbf{x})) \left| \frac{\partial f_\theta(\mathbf{x})}{\partial \mathbf{x}} \right|$

Train with maximum likelihood:

$\arg\min_\theta \mathbb{E}_\mathbf{x} \left[ -\log p_\theta(\mathbf{x}) \right] = \mathbb{E}_\mathbf{x} \left[ -\log p(f_\theta(\mathbf{x})) - \log \mathrm{det} \left| \frac{\partial f_\theta(\mathbf{x})}{\partial \mathbf{x}} \right| \right]$

New key requirement: the Jacobian determinant must be easy to calculate and differentiate!

NICE/RealNVP

Split variables in half: $x_{1:d/2}, x_{d/2+1:d}$

Invertible! Note that $sθ$ and $t_θ$ can be arbitrary neural nets with no restrictions.
- Think of them as data-parameterized elementwise flows.

Onur Copur

MSc Data Science

Data scientist & Industrial Engineer

Deep Unsupervised Learning

Lecture 1 (Autoregressive Models)

Likelihood-based models

Generative models

Designing The Model

Autoregressive model

RNN autoregressive models - char-rnn

Masking-based autoregressive models

Masked Temporal (1D) Convolution (WaveNet)

Masked Spatial (2D) Convolution - PixelCNN

Masked Attention + Convolution

Neural autoregressive models: the good

Masked autoregressive models: the bad

Speedup by breaking the autoregressive pattern

Natural Image Manipulation for Autoregressive Models using Fisher Scores

Lecture 2 (Flow Models)

How to fit a density model?

Option 1: Mixture of Gaussians

Option 2: General Density Model

Flows: Main Idea

Flows: Training

Change of Variable

Flows: Sampling

2-D Autoregressive Flow

2-D Autoregressive Flow: Face

Autoregressive flows

Inverse autoregressive flows

AF vs IAF

Change of MANY variables

High- Dimensional Flow models: training

NICE/RealNVP

Onur Copur

MSc Data Science