Project HF

My name is Thomas Bouvier, and I am based in Paris. I obtained an engineering degree in electrical engineering & computer science in 2019. My hobby has always been creating computer projects, ranging from developing small video games from scratch (Herr Speck, Demo Video) to building an AI-powered robot that plays Flappy Bird (Floppy Bird, Demo Video).

After completing my engineering studies (INSA), I worked at Snips, a parisian startup that pioneered privacy-preserving AI technology at the edge. This experience allowed me to contribute to open-source libraries, similarly to what HF does today. Following Snips’ acquisition, I wanted to delve deeper into Machine Learning, leading me to join Inria as a research engineer. I really enjoyed creating and optimizing scalable, data-intensive pipelines powering streaming applications.

I then pursued a PhD at Inria, working on a topic at the intersection of Machine Learning and High-Performance Computing (HPC). I focused on training ML models at supercomputer scale, leveraging parallelization techniques for efficiency. I could work at Argonne National Lab in Chicago to benefit from their compute infrastructure. I have kept up excellent connections with the people there, who are very open to collaborations. I defended my PhD two months ago.

To sum up, I am applying to Hugging Face with the idea of contributing to HPC topics like parallelization and large-scale pre-training. I’m keen to contribute to open source software and internal projects. I would also like to create collaborations with leading institutions in HPC.

Please detail the reasons why you are applying to work at Hugging Face and how you think you can make an impact on our team (self-authored only 🤗)

I am applying for the following reasons:

First, I believe that open-source software is a powerful tool for decentralizing knowledge. Reproducing research results in ML is hard: the code and hyper-parameters are not always provided, descriptions are vague, and results are noisy. HF does a great job at democratizing AI outside research labs. Besides, HF does care about ethics and legal aspects, and this is important to me.
Second, the High-Performance Computing (HPC) community possesses complementary skills to those of ML researchers/practitioners. For instance, pre-training large models requires HPC expertise to scale effectively across thousands of GPUs. However, I realized that there is limited interaction between these two groups. Bridging this gap would be mutually beneficial, and I believe that collaborations in the HPC space could help gain visibility and benefit from best practices. The JLESC community has access to top supercomputers https://jlesc.github.io and reports on national efforts around LLMs, and the TPC consortium https://tpc.dev/participants/ leads an open-science effort in large-scale training.
Third, I have had the opportunity to work in very diverse environments, which has made me versatile in my skills. In 2018, I worked at a large aerospace company (Thales). In 2019, I joined an AI startup (Snips). From 2020 onward, I was employed by Inria, where I conducted research, lectured in engineering schools, and mentored two undergraduates. Additionally, while working at an American national laboratory (ANL), I collaborated with researchers specialized in X-ray science. Through these experiences, I have developed a great ability to adapt to different teams. I believe this is crucial for open-source projects that demand a range of skills, including engaging with the community, communicating with domain researchers, writing documentation, developing efficient code, and benchmarking new features at scale.
I am interested in AI4Science, which I think will drive some breakthroughs in 2025. For instance, Argonne National Lab is leading the AuroraGPT initiative, which will likely be the “BigScience” moment of LLMs for science. Why not get Hugging Face involved? Some former colleagues are involved in the large-scale pre-training and dataset curation efforts.
Since the “Wild Card” page mentioned brainstorming, I also want to add that I am sensitive to climate issues. As such, I was very briefly involved in the CodeCarbon initiative to track and reduce CO2 emissions from GPU computing, to which I hope to contribute more concretely in the future. Another effort that I find particularly promising is material discovery using AI, with LeMaterial being such an initiative.

Please detail the project you would be most excited to work on in your first 3 months of joining

I identified Nanotron⚡️ 🤗 as the project that would best suit my interests and skills in the short term. I identified a few HPC-oriented improvements that could improve the robustness of large training runs. The following could realistically be integrated into Nanotron⚡️ in ~4 months:

An efficient checkpointing component. Training at scale can be slowed down by hardware failures. A SOTA solution has been proposed by Bogdan Nicolae and his students. This would allow Nanotron⚡️ to checkpoint model states faster than DeepSeed Asynchronous checkpointing for LLMs. Such an improvement could be validated at supercomputer scale using Polaris or Aurora.
Interleaved offloading. Another optimization would be to perform dynamic scheduling of optimizer updates across both CPU and GPU to halve training iteration times compared with DeepSpeed Deep Optimizer States, as proposed by Bogdan Nicolae. Again, Aurora could be used to benchmark and validate this computation-communication overlap improvement .
FSDP Zero-3. At the moment, the Nanotron⚡️ codebase doesn’t support the offloading of parameters while training to achieve Zero-3 Memory optimizations. This would allow a decrease in memory requirements for training large models.

With a few more months of work, zero-bubble pipeline parallelism could be integrated to maximize GPU utilization at scale, encouraging the adoption of Nanotron⚡️ by more teams running on tight training budgets. I suspect this feature to be more challenging than those above.

Finally, I also have some research ideas for continual fine-tuning, involving techniques that could be integrated directly into Nanotron⚡️ to mitigate catastrophic forgetting. This aligns with the type of research I conducted during my PhD, where I devised an open-source software Neomem to learn from evolving datasets. Published results available at https://arxiv.org/abs/2406.03285.

Best,

Thomas