Methods for data-driven modelling
Modern physics is characterized by an increasing complexity of systems under investigation, in domains as diverse as condensed matter, astrophysics, biophysics, etc. Due to the growing availability of experimental data, data-driven modelling is emerging as a powerful way to model those systems. The objective of the course is to provide the theoretical concepts and practical tools necessary to understand and to use these approaches.
Complex systems, characterized by pervasive, non-homogenous and strong interactions,out-of-equilibrium dynamical effects, are extremely challenging to model. Determining the relevant degrees of freedom, how they interact and shape the collective behaviour of these systems is often out of reach with first-principle approaches. Data-driven modelling is emerging as an alternative approach in many fields, at the origin of recent breakthrough in protein folding for instance. Their use raise many questions, from the statistical (about the quality and quantity of data necessary for reaching good results), the physical (about the interpretability of those models, and how they reveal relevant mechanisms), and the computational (about the efficiency and complexity of the algorithms) points of view.
The objectives of this course are two-fold. First, we will provide statistical inference and machine learning tools to extract information and learn models from data. The lectures will start from the basics in Bayesian inference, and then present important concepts and tools in unsupervised and supervised learning. The emphasis will be put on the connections with statistical physics.
Second, each theoretical lecture will be followed by a tutorial illustrating the concepts with practical applications borrowed from various domains of physics or data science. We will focus on methods and on the interpretation of the results, not on programming and heavy numerics!
Week 1: What is Bayesian inference?
Bayes' rule, notions of prior, likelihood and posterior, two historical illustrations: the German Tank and the Boy/Girl Birth Rate problems
Tutorial: Diffusion coefficient from single-particle tracking
Week 2: Asymptotic inference
Rate of convergence, Kullback-Leibler divergence, Fisher information, variational inference, illustration: mean field in stat. mech.
Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (1)
Week 3: Entropy and information - Application to dimensional reduction
Shannon's entropy, principle of maximum entropy, mutual information, principal and independent component analysis
Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (2)
Week 4: Phase transition in high-dimensional settings: principal component analysis
Spiked covariance model, large dimensional setting & spectrum of random correlation matrices, the phase transition, when is learning retarded?
Tutorial: Replay of neural activity during sleep following task learning
Week 5: Phase transition in high-dimensional settings: regression
Linear regression, L2 prior, cross-validation, harmful and benign overfitting in high-dimensional inference
Tutorial: Characterization of colliding supernovae from gravitational waves (1)
Week 6: Priors: sparsity and beyond
L1 prior, conjugated priors and pseudo-counts, shrinkage, universal priors
Tutorial: Characterization of colliding supernovae from gravitational waves (2)
Week 7: Graphical models: learning many interactions
Boltzmann Machines (BM), Monte Carlo sampling, Convexity of log-likelihood, BM Learning, Mean-field inference, Pseudo-likelihood method
Tutorial: Inferring structural contacts from protein sequences (1)
Week 8: Unsupervised learning: representations and generation
Notion of representation, Autoencoders, restricted Boltzmann machines, Auto-regressive models
Tutorial: Inferring structural contacts from protein sequences (2)
Week 9: Supervised learning: support vector machines
Linear classifiers, enumeration of dichotomies, perceptron learning algorithm, Kernel methods
Tutorial: Interpretable representations of 2D disks by auto-encoders
Week 10: Supervised learning: learning curves and multilayer nets
Statistical mechanics of one- and two-layer neural nets
Tutorial: Classification of MNIST digits
Week 11: Learning from streaming data
On-line classification, on-line PCA (Oja's rule) and sparse PCA
Tutorial: TBA
Week 12: Time series analysis (1): hidden Markov models
Markov and hidden Markov processes, Transfer matrix calculations, Viterbi algorithm, Expectation-Maximization procedure
Tutorial: Identification of recombination events in SARS-CoV-2
Week 13: Time series analysis (2): recurrent neural nets
Approximation theorem, low-dimensional rank nets: justification and analysis, Some applications
Tutorial: TBA
Basic level in statistical physics. The program language that we use is Python 3, but no previous experience in programming is required.
From a practical point of view, make sure your computer is properly setup for the course. Here are some (brief) instructions of what you should do:
1. Install Anaconda, following the instructions here: https://www.anaconda.com/download
2. Then install Pytorch. To do this, open a terminal and run:
conda install pytorch torchvision torchaudio cpuonly -c pytorch
Make sure you are able to load all packages. To test this, you can start Python, and run:
>> import torch, numpy, scipy, matplotlib
This should not produce any errors.
Homework at the end of October
Written exam on the theoretical part of the course + the practical (computational and statistical inference) aspects at the beginning of January