I tend to work on one thing at the time, and always getting my hands dirty implementing, debugging, and scaling myself the models.

I’ve started my journey with continual learning, and did my PhD on this. In hindsight, I was focused on academic benchmarks (e.g. learning incrementally new classes on ImageNet) that were not realistic enough.

When joining DeepMind in late 2022 to work on continual learning, I’ve realized that current deep learning needs more modularity: to learn & unlearn specific knowledge skills but also to scale massively the model without also increasing its inference cost. In order to train massive modular systems, I’ve developped DiLoCo: a new way to do distribute training of LLMs across the world with two orders of magnitude less bandwidth. Several startups are now based on that technology. Using DiLoCo, I’ve made DiPaCo, a new kind of modular architecture whose weights are world-wide distributed and trained semi-independetly.

I’m still working on distributed training, fighting the tyranny of requiring devices colocation. My dream is to do compute arbitrage on all the GPUs and TPUs across the world. No devices should be ever idle, everthing must be used towards training better AIs.

Distributed Training

Continual Learning

Work mosly during my PhD thesis (2019-2022).

Miscellaneous