HPC Centres

Description

This session will feature these talks:

Every Supercomputer is a Record of Decisions: Navigating New Tensions in HPC • Cristin Merritt

High-Performance Computing systems are often discussed in terms of architecture, performance, and scale. Yet every HPC system is ultimately shaped by a series of decisions: what hardware to buy, how resources are allocated, which users are prioritised, and how policies evolve over time.
This short, interactive talk explores HPC systems through the lens of decision-making. Using live polls and discussions, we will take the community's pulse at Durham HPC Days and examine the factors that most influence operational and infrastructure decisions across HPC centres.
Participants will be invited to consider common decision scenarios reflecting the tensions HPC centres face today, including capacity planning, scheduler policies, user growth, and strategic direction.
The session will also introduce Season 2 of Move the Needle: Decision Making in HPC, a community project exploring how HPC centres make decisions and how small improvements in decision processes can lead to better outcomes for researchers, operators, and institutions.

What's next? Genesis Mission, NAIRR, and the next phase of science in the US • Richard Knepper

The US Department of Energy has initiated the Genesis mission to 'double the science productivity of the US', the National AI Research Resource is moving beyond its Pilot Phase, and the National Institutes of Health and National Science Foundation have had massive restructuring in the interim. Meanwhile, regional initiatives like New York's Empire AI and California Compute are beginning to take shape. This talk describes what new developments are being built, what directions are encouraged under funding initiatives, and where the US computational research community may be pointed as it moves forward. This discussion will cover new computational investments (and changes to how those investments takes shape), new points of focus from the administration, and maybe even some discussion of those difficult supply chain issues.

A Frequency-Invariant Approach to Energy-Efficient Computing • Syed Ibtisam Tauhidi

Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to manage the explosive growth in energy consumption in computation, including in High-Performance Computing (HPC) and AI inference workloads. However, standard governors rely on coarse utilisation heuristics, operating on a 'race-to-idle' philosophy, assume that high CPU utilisation correlates with productive computation. This assumption collapses during memory-bound phases, such as LLM token generation or sparse matrix operations, where cores are stalled due to cache misses, despite utilisation remaining near 100%. Thus, such systems waste energy maintaining peak frequencies waiting for cache fill. Previous software solutions face adoption barriers, demanding intrusive source code modifications, offline training, or dense telemetry with heavy overhead.

In this talk, I present URJA, a purely online, user-space DVFS framework that eliminates energy waste without disrupting workflows. URJA employs a minimalist design using a sparse set of hardware counters, specifically Last-Level Cache Misses Per Kilo Instruction (MPKI). We demonstrate MPKI is a robust, frequency-invariant proxy for memory pressure, enabling accurate phase classification without dynamic recalibration. To prevent control thrashing, URJA replaces continuous scaling with a reactive ternary control model (minimum, maximum, and stable intermediate frequencies) combined with adaptive window-based smoothing and hysteresis. Furthermore, the control parameters for our DVFS framework are rigorously derived using various data-driven methodologies. Comprehensive evaluations across Intel and ARM platforms using NAS Parallel Benchmarks, SPEC CPU2017, and LLM inference tasks (Ministral 3, GPT-OSS) demonstrate URJA's efficacy. URJA drives execution toward the Energy-Delay Product (EDP) Pareto frontier, achieving a ~10% EDP improvement. It yields up to 29% energy savings without significant runtime degradation, maintaining a negligible ~1% overhead.