Stanford CS330

architecture

Introduction & Overview

Why multitask learning and meta-learning?

Challenges

  • not scallable for learning more task
  • not realistic to collect data all the time
  • need to be supervised

Benefits for deep learning

Unstructured input

neural network for various tasks

Adaption

Object classifictaion & Machine translation

apply to different situations

Benefits for multi-task learning and meta learning

not enough dataset or paired data?

long tail data(such as automatic drive)

How does it work for human?

Brief summary

Defining the task

It is the dataset $D$ and loss function $L$ that jointly determines the model $f_\theta$ (task)

  • cross entropy loss & mean squared loss
  • MNIST & Fashion MNIST for classifying

Critical Assumption

If not, splitting into different simple-task learning from scratch may be more rational

Informal Problem Definitions

Q & A

Difference between meta-learning and transfer learning?

I think that one aspect about this problem is that you want to be able to learn a new task more quickly, whereas in transfer learning you may also want to be able to just form a well-performing a new task while in zero shot where you kind of just want to share representations. I actually view transfer learning as something that encapsulates both of these things, uh, where you’re thinking about how you can transfer information between different tasks and that could actually also correspond to the multitask learning problem, uh, as well as the meta-learning problem.

Learning to learn?

Yes

More than one task for meta-learning?

Yes, or break down the original task to sub-tasks.

Meta-learning and domain adapatation?

Likely, in the domain & out of domain

Reduction?

Idea

$$ \begin{gather} D=\bigcup D_i \end{gather} $$

$$ L=\Sigma L_i $$

Union the datasets & Add up the loss functions $\to$ multi-task learning to single-task learning

Why now?

Ideas are no means new

Strong nerual networks

Increasing role

Democratization

Multi-Task & Meta-learning Basics

Reference

Maximum Likelihood Estimation Fully Connected Neural Network Negative Transfer Conditional Independency

Multi-Task Learning Basics

Some Notation

Single-task learning(supervised)

$$ L={(x,y)_k} $$

Aimed at minizing the loss function

$$\mathop{min}\limits_{\theta} L(\theta,L)$$

Typical Loss: negative log likelihood

$$L(\theta,L) = -E_{(x,y) \sim L}[\log f_{\theta}(y|x)]$$

Task(p for x distribution, and y distribution given x

$$T_i:={p_i(x),p_i(y|x),L_i}$$

Corresponding datasets

$$D_{i}^{tr} \quad D_i^{test}$$ In future, we will use $D_i$ as shorthand for $D_i^{tr}$

Examples of Tasks

  • Multi-task classification example: Students and HRs receiving emails from Warwick
  • Multi-label learning example: Detecting whether or not the person is wearing a hat/detecting hair color($L_i$ and $p_i(x)$ are the same, however y given x is different since there are different binary classification tasks)
  • $L_i$ also vary sometimes

Introduction of task descriptor

To distinguish different tasks

Objective

$$\mathop{min}\limits_{\theta} \sum_{i=1}^{T}L_i(\theta,L_i)$$ Two following questions two answer

  • How should we condition on the task descriptor $z_i$?
  • How to optimize our objective?

Conditioning on the task

Assuming that different tasks have same size, same dimensions Different size → RNN or attention based model that aggregate various dimensions

  1. Simplest way
  • seperate into several neural networks with completely different weights

  • condition on task descriptor by pick the task corresponding to multiplicative gating

  • not sharing parameters trained previously

    1. Concat $z_i$
  • almost all the parameters are shared

  • weights that are right after the $z_i$(a fully connected layer with features) are different(one-hot for example)

An Alternative View on the Multi-Task Objective

Split $\theta$, optimize $\theta^{sh}$ in union and $\theta^i$ seperately

Choosing how to condition on $z_i$ $\iff$ Choosing how & where to share parameters

Conditioning:

Some Common Choices

Actually the same(assume as one-hot vector)

If more information is given(like the degree that two tasks are similar to each other), it can be feed to $z_i$ of the neural network. However, determining how content are shared is quite a big problem in multi-task learning.(may be figured out during the learning process rather than before training)

Multiplicative conditioning: multiply, rather than add

choose different part of the networks which should be used for various tasks → modulate different features(completely turn off features/only use some heads for one task)

More Complex Choices

Conditioning Choices

Recall: Art, Science, Engineering architecture by Li Mu

Optimizing the objective

Basically similar to single-task learning, importance of different tasks need to be adopted manually. However for regression problems, make sure they are on the same scale, otherwise labels have a greater magnitude → loss function having a greater scale

Challenges

Negative transfer

gradient 1 may hurts gradient 2 & one task learn much faster than another

Inverse problem: why do we expect positive transfer? expected you don’t have a lot of data per task, & the tasks are related → features and representations learned for one task will be useful for the another task

Combine two different neural networks with same structure? Create a task selector naively → try to learn a single network

soft parameter sharing
Idea: encourage the task specific parameters to be simliar to one another

Overfitting

Case study

User Engagement & User Satisfication

Framework Set-Up

The Ranking Problem

The Architecture

baseline

Decide the experts/expert used

Experiments

Experts meet one or several tasks polarization → using one expert or no expert at all

Meta-Learning Basics

Two ways to view meta-learning algorithms

Problem definitions(Probabilistic view)

Ideas: can we incorporate additional data?

Tips: it don’t need to be image classification problem

The meta-learning problem

assume that $\phi$ and $D_{meta-train}$ are conditionally independent conditioned on $\theta$ (if the parameters from meta-training dataset is given, new parameters and meta-training dataset are independent)

so that the meta-training problem can be understood as

$$ \theta^*=\mathop{\arg\max}_{\theta}\log p(\theta | D) $$

where $D$ symbolizes

$$ D_{meta-train} $$

A Quick Example

meta-learning & adaption

How do we train this thing?

The idea of Match meta-training and meta-testing(adapting)

For meta-training time: how can we reach the test dataset?

Reserve a test set for each task!

Training on the Test Set

Shift

  • training set → meta-train set
  • test set → meta-test set

The complete meta-learning optimization

$\theta$ is what shares among different tasks

Some meta-learning terminology

k-shot learning means learning from k examples

Meta-Learning Recipe, Black-Box Adaptation, Optimization-Based Approaches

Reference

Black-Box Adaptation

Tips

Symbol Meaning
$D_i$ meta-training dataset
$D$ meta-testing dataset
$\theta$ meta-parameters
$\phi$ task-specific parameters

Recap from Last Time

General Recipe

How to evaluate a meta-learning algorithm

few-shot dataset → Both G&D problems are proposed

split other datasets for meta-training and adapt to any ML problem

The Meta-Learning Problem: The Mechanistic View

The Meta-Learning Problem: The Probabilistic View

Assumption: the training task and new task are from the same distribution?

How to design a meta-learning algorithm

Black-Box Adaption

Overview

$f _{\theta}$ : sequential fashion/one batch Train with standard supervised learning: maximize the probability of the labels under the distribution that $G$ is producing

$$ \begin{gather} \mathop{max}\limits_{\theta} \sum\limits_{T_{i}} \sum\limits_{(x,y) \sim D_i^{test}}\log g_{\phi_{i}(y|x)}\ =\mathop{max}\limits_{\theta} \sum\limits_{T_{i}}L(f_{\theta}(D_i^{tr},D_i^{test})) \end{gather} $$

在meta-learning process中,$\theta$被learned,$\phi$则可以被看作activations/tensor rather than actual parameters

Meta-Training Part

randomly split

repeat this iteratively using your favorite gradient descent optimizer
  1. all the $\theta$ are meta paramters and then $\phi$ is considered as the task-specific parameters
  2. we update $\theta$ and do not update $\phi$ since it’s basically dynamically computed at every iteration
  3. it has to go through $\phi$ in order to cumpute the gradient $\nabla_{\theta}L(\phi_i,D_i^{test})$ for $\theta$

Challenge

If $\phi$ is litearlly representing all the parameters of another neural network, it may not be that scalable to actually ouput all of those neural network parameters because neural networks are very large.

sufficient statistics $h_i$(lower dimensional vector)(similar to the hidden state of LSTM) $$\phi_i={h_i,\theta_{g}}$$

general form

$$y^{ts}=f_{\theta}(D_i^{tr},x^{ts})$$

Architecture

Pros and Cons

Optimization-Based Inference

Inference problem → optimization procedure $$\mathop{max}\limits_{\phi_{i}}\log p(D_i^{tr}|\phi_i)+\log p(\phi_{i}|\theta) $$ break down the meta-training problem to two terms

  1. maximizing the likelihood of training data given task-specific parameters
  2. optimizing the likelihood of task-specific parameters under meta parameters

Fine-tuning

$$\phi \leftarrow \theta -\alpha \nabla L(\theta, D^{tr})$$

Pre-train parameters acquistion in different fields and common practices using various methods are listed above

Traning from scratch versus using pre-trained model bigger datasets versus small datasets

Meta-learning

$$\mathop{min}\limits_{\theta} \sum\limits_{task \quad i}L(\theta - \alpha \nabla L(\theta, D_i^{tr}),D_i^{ts}) $$

Notation

  1. It’s not a two-dimensional problem
  2. It’s actually a whole space of optimums rather than single optimum

meta-learning + multitask learning

Formula

Single Inner Gradient Steps

$d$ as total derivative, $\nabla$ as partial derivative, $u$ as update rule, and $D^{test}$ for $D_i^{test}$ $$\phi=u(\theta,D^{tr})$$ $$\mathop{min}\limits_{\theta}L(\phi,D^{test})=\mathop{min}\limits_{\theta}L(u(\theta,D^{tr}),D^{test})$$ $$\frac{d}{d\theta}L(\phi,D^{test})$$ $$ Let \quad U(\theta,D^{tr})=\theta -\alpha d_{\theta}L(\theta,D^{tr}) $$

$$ d_{\theta}u(\theta,D^{tr})=I-\alpha \underbrace{d_{\theta}^2L(\theta,D^{tr})}_{Hessain \quad matrix} $$

Mutiple Inner Gradient Steps

Optimization vs. Black-Box Adaption

skew and scaled digits classification

Yikun Han
Yikun Han
First Year Master Student

Wir müssen wissen. Wir werden wissen.