Schedule
π₯ Discord community on AI & HPC - Join to connect with all excited about efficient neural network training, WANT participants and organizers!
π WANT poll - Tell us your insights and thought about efficient training of neural networks! Your vote does matter! (Poll results from previous WANT@NeurIPSβ23 iteration)
π WANT page at OpenReview - Accepted papers (Orals & Posters) are here!
π WANT page at Whova - Add to your ICML agenda!
- offline: at the venue of the ICML 2024 conference in Vienna, Austria,
- online: with streaming from the venue π₯, poster session and networking in Gather Town π°.
Time (Vienna) | Morning |
---|---|
08:30 - 09:00 | Coffee & Poster placement π° |
09:00 - 09:10 | Welcome speech from Organizers π₯ |
09:10 - 09:40 | Invited talk π₯ Online Training from Numerical Simulations Bruno Raffin [INRIA] Traditionally, scientists and engineers rely on computationally intensive numerical solvers to address Partial Differential Equations (PDEs). However, deep learning is emerging as a promising alternative for obtaining rapid PDE solutions. Typically, deep surrogate models are trained on synthetic data generated by these solvers, which is stored on disk and subsequently retrieved for training. In this talk, we explore the challenges and benefits of enabling online training of deep models concurrent with data generation from running simulations. This approach offers several advantages: it circumvents I/O operations, often the bottleneck in supercomputing; allows training on datasets larger than available storage capacities, potentially improving generalization; and introduces the possibility of steering data generation for enhanced efficiency. However, online training is subject to specific biases that must be mitigated through adapted buffering techniques. Our presentation will draw upon research findings from the development of the Melissa framework, which is designed for large-scale online training. By addressing these topics, we aim to provide insights into the future of PDE solving using deep learning techniques and the potential for more efficient, scalable computational methods in scientific and engineering applications. |
09:40 - 10:10 | Invited talk π₯ Making device-agnostic ML training and inference easy at scale Zachary Mueller [HuggingFace] Using Hugging Face Accelerate as a case study, we will discuss an approach to creating a framework aimed at decentralizing and lowering the barrier-to-entry on machine learning model training given any hardware and/or accelerator configuration (FSDP, DeepSpeed, Multi/Single GPU, MPS/XLA/CUDA, etc). Given such a framework needs to be as robust as possible, this incurs a number of challenges to ensure that the framework can have approachable code, an intuitive API, and any errors related to these configurations can attempt to be as clear as possible. In this talk, we will cover why Accelerate is designed the way it is and how certain API configurations were utilized such as commonality in configurations ideally through config files, having a zero-magic-code policy, and ensuring that there is as minimal code-intrusion as possible when utilizing the framework. We will also discuss how building a open-source forward framework helps the community flourish by easily hacking, adding, and fixing different aspects of the codebase, and in many cases enabling users to utilize new backends, kernels, and more on day 0. |
10:10 - 10:30 | Contributed talks π₯ |
10:30 - 11:30 | Poster session π π° |
11:30 - 12:00 | Invited talk π₯ Enabling extremely fast inference and training performance using dataflow and custom chip Urmish Thakker [SambaNova] As the pursuit of larger language models continues to push the boundaries of computational demands, the traditional silicon chip is facing a daunting memory/power wall. In this talk, we present a novel chip design, SN40L, to tackle this challenge. This chip combines a reconfigurable data-flow architecture with a tightly coupled 3-tier memory hierarchy to enable efficient compute-intensive training and memory-bound inference workloads for a wide variety of neural network architectures. We discuss the various advantages of this chip via case studies. In our first case study, we discuss how the dataflow architecture coupled with on-chip SRAM and HBM empowers operation fusion capabilities enabling 1000+ tokens/second inference performance without sacrificing on precision. In the second case study we look at the training performance of various model architectures and compare their performance against traditional kernel by kernel execution-based architectures. We show how dataflow architecture can help accelerate LLM training for traditional dense, sparse and novel state space models, while allowing one to train extremely large models on a smaller footprint. In the third case study we will discuss how you can use the strongly coupled DRAM, HBM and SRAM to develop new neural network architecture that can help scale to 100+ billion parameters efficiently. This new architecture uses a modular and coarse-grained approach to mixture of experts that allows for incremental updates to models for new capabilities and knowledge and smaller footprint execution during inference. |
12:00 - 12:30 | Contributed talks π₯ |
Time (Vienna) | Afternoon |
---|---|
12:30 - 13:30 | Lunch π° |
13:30 - 14:00 | Poster session π π° |
14:00 - 14:30 | Invited talk π₯ Structured matrices for memory-efficient training and finetuning Beidi Chen [CMU & Meta] Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing number of parameters, optimizer states and context lengths. In todayβs talk, I will mainly introduce two approaches for reducing memory overhead in pretraining and finetuning stage. I will start with Galore (Gradient Low-Rank Projection), a pretraining strategy that maintains full-parameter learning accuracy and is more memory-efficient than common low-rank adaptation methods such as LoRA. It reduces total training memory usage by up to 63.3\(\%\), unlocking the possibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. Then I will talk about \(S^2\)FT (Structured Sparse Finetuning) that concurrently achieves SOTA fine-tuning performance, efficiency, and inference scalability by βselecting sparsely and computing denselyβ. \(S^2\)FT prevents overfitting and forgetting, delivers SOTA performance on established benchmarks with improvements up to 4.1\(\%\), and outperforms full FT by 7.1\(\%\) in generalization tasks. \(S^2\)FT saves the fine-tuning memory up to 3\(\times\) and improves the throughput by 1.5-2.7\(\times\) compared to full FT. Finally, I will conclude my talk by briefly introducing our new line of work MST (Mini-seq transformer) and MEGALODON that tackle the activation and attention memory bottlenecks posed by long sequence inputs. |
14:30 - 15:00 | Invited talk π₯ Architecting and deploying compute clusters for large language models Adam DeConinck [NVIDIA] As the size of large language models and the processing needs keep on increasing, the compute infrastructure needs to adapt to be able to handle these reliably. In particular in addition to having a large number of processing units, the platform needs to provide guarantees on fabric and IO but also software strategies to schedule jobs and cache data reliably. In this work, we will show how some strategic choices on reference design definitions, combined with versatile scheduling and checkpointing strategies can help leverage the infrastructure for best performance. We will also review how scaling up to extreme scale impacts the hardware and software implementation choices for LLMs. |
15:00 - 15:20 | Contributed talks π₯ |
15:20 - 15:30 | Best paper awards π₯ |
15:30 - 16:00 | Coffee & Poster session π π° |
16:00 - 16:50 | Panel Discussion π₯ |
16:50 - 17:00 | Closing remarks π₯ |