Methods for data-driven modelling

Diploma(s)
Place
Université Paris Cité
Fall semester
Level Master 2 6 ECTS - English
Instructor(s) Rémi MONASSON ( ENS-PSL Polytechnique CNRS )
Teaching Assistant Jorge FERNANDEZ DE COSSIO DIAZ ( ENS-PSL ) Simona Cocco ( ENS-PSL CNRS )
Education office

Modern physics is characterized by an increasing complexity of systems under investigation, in domains as diverse as condensed matter, astrophysics, biophysics, etc. Due to the growing availability of experimental data, data-driven modelling is emerging as a powerful way to model those systems. The objective of the course is to provide the theoretical concepts and practical tools necessary to understand and to use these approaches.

Complex systems, characterized by pervasive, non-homogenous and strong interactions,out-of-equilibrium dynamical effects, are extremely challenging to model. Determining the relevant degrees of freedom, how they interact and shape the collective behaviour of these systems is often out of reach with first-principle approaches. Data-driven modelling is emerging as an alternative approach in many fields, at the origin of recent breakthrough in protein folding for instance. Their use raise many questions, from the statistical  (about the quality and quantity of data necessary for reaching good results), the physical (about the interpretability of those models, and how they reveal relevant mechanisms), and the computational (about the efficiency and complexity of the algorithms) points of view.

The objectives of this course are two-fold. First, we will provide statistical inference and machine learning tools to extract information and learn models from data. The lectures will start from the basics in Bayesian inference, and then present important concepts and tools in unsupervised and supervised learning. The emphasis will be put on the connections with statistical physics.

Second, each theoretical lecture will be followed by a tutorial illustrating the concepts with practical applications borrowed from various domains of physics or data science. We will focus on methods and on the interpretation of the results, not on programming and heavy numerics!

Syllabus

Week 1: What is Bayesian inference?

Bayes' rule, notions of prior, likelihood and posterior, two historical illustrations: the German Tank and the Boy/Girl Birth Rate problems

Tutorial: Diffusion coefficient from single-particle tracking

Week 2: Asymptotic inference

Rate of convergence, Kullback-Leibler divergence, Fisher information, variational inference, illustration: mean field in stat. mech.

Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (1)

Week 3: Entropy and information - Application to dimensional reduction

Shannon's entropy, principle of maximum entropy, mutual information, principal and independent component analysis

Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (2)

Week 4: Phase transition in high-dimensional settings: principal component analysis

Spiked covariance model, large dimensional setting & spectrum of random correlation matrices, the phase transition, when is learning retarded?

Tutorial: Replay of neural activity during sleep following task learning

Week 5: Phase transition in high-dimensional settings: regression

Linear regression, L2 prior, cross-validation, harmful and benign overfitting in high-dimensional inference

Tutorial: Characterization of colliding supernovae from gravitational waves (1)

Week 6: Priors: sparsity and beyond

L1 prior, conjugated priors and pseudo-counts, shrinkage, universal priors

Tutorial: Characterization of colliding supernovae from gravitational waves (2)

Week 7: Graphical models: learning many interactions

Boltzmann Machines (BM), Monte Carlo sampling, Convexity of log-likelihood, BM Learning, Mean-field inference, Pseudo-likelihood method

Tutorial: Inferring structural contacts from protein sequences (1)

Week 8: Unsupervised learning: representations and generation

Notion of representation, Autoencoders, restricted Boltzmann machines, Auto-regressive models

Tutorial: Inferring structural contacts from protein sequences (2)

Week 9: Supervised learning: support vector machines

Linear classifiers, enumeration of dichotomies, perceptron learning algorithm, Kernel methods

Tutorial: Interpretable representations of 2D disks by auto-encoders

Week 10: Supervised learning: learning curves and multilayer nets

Statistical mechanics of one- and two-layer neural nets

Tutorial: Classification of MNIST digits

Week 11: Learning from streaming data

On-line classification, on-line PCA (Oja's rule) and sparse PCA

Tutorial: TBA

Week 12: Time series analysis (1): hidden Markov models

Markov and hidden Markov processes, Transfer matrix calculations, Viterbi algorithm, Expectation-Maximization procedure

Tutorial: Identification of recombination events in SARS-CoV-2 

Week 13: Time series analysis (2): recurrent neural nets

Approximation theorem, low-dimensional rank nets: justification and analysis, Some applications

Tutorial: TBA

Prerequisites

Basic level in statistical physics. The program language that we use is Python 3, but no previous experience in programming is required. 


From a practical point of view, make sure your computer is properly setup for the course. Here are some (brief) instructions of what you should do:

1. Install Anaconda, following the instructions here: https://www.anaconda.com/download

2. Then install Pytorch. To do this, open a terminal and run:

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Make sure you are able to load all packages. To test this, you can start Python, and run:

>> import torch, numpy, scipy, matplotlib

This should not produce any errors.


 

Evaluation

Homework at the end of October

Written exam on the theoretical part of the course + the practical (computational and statistical inference) aspects at the beginning of January