Stanford CS330
Introduction & Overview
Why multitask learning and metalearning?
Challenges
 not scallable for learning more task
 not realistic to collect data all the time
 need to be supervised
Benefits for deep learning
Unstructured input
neural network for various tasks
Adaption
Object classifictaion & Machine translation
apply to different situations
Benefits for multitask learning and meta learning
not enough dataset or paired data?
long tail data(such as automatic drive)
How does it work for human?
Brief summary
Defining the task
It is the dataset $D$ and loss function $L$ that jointly determines the model $f_\theta$ (task)
 cross entropy loss & mean squared loss
 MNIST & Fashion MNIST for classifying
Critical Assumption
If not, splitting into different simpletask learning from scratch may be more rational
Informal Problem Definitions
Q & A
Difference between metalearning and transfer learning?
I think that one aspect about this problem is that you want to be able to learn a new task more quickly, whereas in transfer learning you may also want to be able to just form a wellperforming a new task while in zero shot where you kind of just want to share representations. I actually view transfer learning as something that encapsulates both of these things, uh, where you’re thinking about how you can transfer information between different tasks and that could actually also correspond to the multitask learning problem, uh, as well as the metalearning problem.
Learning to learn?
Yes
More than one task for metalearning?
Yes, or break down the original task to subtasks.
Metalearning and domain adapatation?
Likely, in the domain & out of domain
Reduction?
Idea
$$ \begin{gather} D=\bigcup D_i \end{gather} $$
$$ L=\Sigma L_i $$
Union the datasets & Add up the loss functions $\to$ multitask learning to singletask learning
Why now?
Ideas are no means new
Strong nerual networks
Increasing role
Democratization
MultiTask & Metalearning Basics
Reference
Maximum Likelihood Estimation Fully Connected Neural Network Negative Transfer Conditional Independency
MultiTask Learning Basics
Some Notation
$$ L={(x,y)_k} $$
$$\mathop{min}\limits_{\theta} L(\theta,L)$$
$$L(\theta,L) = E_{(x,y) \sim L}[\log f_{\theta}(yx)]$$
$$T_i:={p_i(x),p_i(yx),L_i}$$
$$D_{i}^{tr} \quad D_i^{test}$$ In future, we will use $D_i$ as shorthand for $D_i^{tr}$
Examples of Tasks
 Multitask classification example: Students and HRs receiving emails from Warwick
 Multilabel learning example: Detecting whether or not the person is wearing a hat/detecting hair color($L_i$ and $p_i(x)$ are the same, however y given x is different since there are different binary classification tasks)
 $L_i$ also vary sometimes
Introduction of task descriptor
$$\mathop{min}\limits_{\theta} \sum_{i=1}^{T}L_i(\theta,L_i)$$ Two following questions two answer
 How should we condition on the task descriptor $z_i$?
 How to optimize our objective?
Conditioning on the task
Assuming that different tasks have same size, same dimensions Different size → RNN or attention based model that aggregate various dimensions
 Simplest way

seperate into several neural networks with completely different weights

condition on task descriptor by pick the task corresponding to multiplicative gating

not sharing parameters trained previously
 Concat $z_i$

almost all the parameters are shared

weights that are right after the $z_i$(a fully connected layer with features) are different(onehot for example)
An Alternative View on the MultiTask Objective
Split $\theta$, optimize $\theta^{sh}$ in union and $\theta^i$ seperately
Choosing how to condition on $z_i$ $\iff$ Choosing how & where to share parameters
Conditioning:
Some Common Choices
If more information is given(like the degree that two tasks are similar to each other), it can be feed to $z_i$ of the neural network. However, determining how content are shared is quite a big problem in multitask learning.(may be figured out during the learning process rather than before training)
choose different part of the networks which should be used for various tasks → modulate different features(completely turn off features/only use some heads for one task)
More Complex Choices
Conditioning Choices
Optimizing the objective
Basically similar to singletask learning, importance of different tasks need to be adopted manually. However for regression problems, make sure they are on the same scale, otherwise labels have a greater magnitude → loss function having a greater scale
Challenges
Negative transfer
Inverse problem: why do we expect positive transfer? expected you don’t have a lot of data per task, & the tasks are related → features and representations learned for one task will be useful for the another task
Combine two different neural networks with same structure? Create a task selector naively → try to learn a single network
Overfitting
Case study
User Engagement & User Satisfication
Framework SetUp
The Ranking Problem
The Architecture
Experiments
Experts meet one or several tasks polarization → using one expert or no expert at all
MetaLearning Basics
Two ways to view metalearning algorithms
Problem definitions(Probabilistic view)
Tips: it don’t need to be image classification problem
The metalearning problem
assume that $\phi$ and $D_{metatrain}$ are conditionally independent conditioned on $\theta$ (if the parameters from metatraining dataset is given, new parameters and metatraining dataset are independent)
$$ \theta^*=\mathop{\arg\max}_{\theta}\log p(\theta  D) $$
$$ D_{metatrain} $$
A Quick Example
How do we train this thing?
For metatraining time: how can we reach the test dataset?
Reserve a test set for each task!
Shift
 training set → metatrain set
 test set → metatest set
The complete metalearning optimization
$\theta$ is what shares among different tasks
Some metalearning terminology
kshot learning means learning from k examples
Closely related problem settings
MetaLearning Recipe, BlackBox Adaptation, OptimizationBased Approaches
Reference
Tips
Symbol  Meaning 

$D_i$  metatraining dataset 
$D$  metatesting dataset 
$\theta$  metaparameters 
$\phi$  taskspecific parameters 
Recap from Last Time
General Recipe
How to evaluate a metalearning algorithm
The MetaLearning Problem: The Mechanistic View
The MetaLearning Problem: The Probabilistic View
Assumption: the training task and new task are from the same distribution?
How to design a metalearning algorithm
BlackBox Adaption
Overview
$f _{\theta}$ : sequential fashion/one batch Train with standard supervised learning: maximize the probability of the labels under the distribution that $G$ is producing
$$ \begin{gather} \mathop{max}\limits_{\theta} \sum\limits_{T_{i}} \sum\limits_{(x,y) \sim D_i^{test}}\log g_{\phi_{i}(yx)}\ =\mathop{max}\limits_{\theta} \sum\limits_{T_{i}}L(f_{\theta}(D_i^{tr},D_i^{test})) \end{gather} $$
在metalearning process中,$\theta$被learned，$\phi$则可以被看作activations/tensor rather than actual parameters
MetaTraining Part
 all the $\theta$ are meta paramters and then $\phi$ is considered as the taskspecific parameters
 we update $\theta$ and do not update $\phi$ since it’s basically dynamically computed at every iteration
 it has to go through $\phi$ in order to cumpute the gradient $\nabla_{\theta}L(\phi_i,D_i^{test})$ for $\theta$
Challenge
If $\phi$ is litearlly representing all the parameters of another neural network, it may not be that scalable to actually ouput all of those neural network parameters because neural networks are very large.
sufficient statistics $h_i$(lower dimensional vector)(similar to the hidden state of LSTM) $$\phi_i={h_i,\theta_{g}}$$$$y^{ts}=f_{\theta}(D_i^{tr},x^{ts})$$
Architecture
Pros and Cons
OptimizationBased Inference
Inference problem → optimization procedure $$\mathop{max}\limits_{\phi_{i}}\log p(D_i^{tr}\phi_i)+\log p(\phi_{i}\theta) $$ break down the metatraining problem to two terms
 maximizing the likelihood of training data given taskspecific parameters
 optimizing the likelihood of taskspecific parameters under meta parameters
Finetuning
$$\phi \leftarrow \theta \alpha \nabla L(\theta, D^{tr})$$
Pretrain parameters acquistion in different fields and common practices using various methods are listed above
Traning from scratch versus using pretrained model bigger datasets versus small datasets
Metalearning
$$\mathop{min}\limits_{\theta} \sum\limits_{task \quad i}L(\theta  \alpha \nabla L(\theta, D_i^{tr}),D_i^{ts}) $$
Notation
 It’s not a twodimensional problem
 It’s actually a whole space of optimums rather than single optimum
metalearning + multitask learning
Formula
Single Inner Gradient Steps
$d$ as total derivative, $\nabla$ as partial derivative, $u$ as update rule, and $D^{test}$ for $D_i^{test}$ $$\phi=u(\theta,D^{tr})$$ $$\mathop{min}\limits_{\theta}L(\phi,D^{test})=\mathop{min}\limits_{\theta}L(u(\theta,D^{tr}),D^{test})$$ $$\frac{d}{d\theta}L(\phi,D^{test})$$ $$ Let \quad U(\theta,D^{tr})=\theta \alpha d_{\theta}L(\theta,D^{tr}) $$
$$ d_{\theta}u(\theta,D^{tr})=I\alpha \underbrace{d_{\theta}^2L(\theta,D^{tr})}_{Hessain \quad matrix} $$
Mutiple Inner Gradient Steps
Optimization vs. BlackBox Adaption