Tutorial T5

Title: MARS: A framework for runtime monitoring, modeling, and management of realtime systems

Abstract: From datacenters to embedded devices, modern realtime workloads are demanding exceptional computational capacity from state-of-the-art systems, while satisfying energy constraints, real-time deadlines, mixed criticality workloads, and satisfactory QoS. In response, researchers have proposed resource management policies to maximize system utilization and efficiency, e.g., power managers, dynamic frequency and voltage scaling governors, task mappers and schedulers, offloading orchestrators, etc. Policies can utilize techniques from various algorithmic domains, e.g., game theory, control theory, and machine learning. In this tutorial, we give an overview and demonstration of MARS (Middleware for Adaptive and Reflective Systems), a cross-layer and multi-platform framework developed by Dutt Research Group at UC Irvine that allows system designers to easily create resource managers by composing system models and resource management policies in a flexible and coordinated manner. MARS consists of a generic user-level sensing/actuation interface that allows for portable policy design, and a reflective system model used to coordinate multiple policies. We demonstrate MARS’ ability to deploy a low-overhead realtime resource manager through a dynamic voltage and frequency scaling (DVFS) policy example which can run on any Linux-based HMP computing platform. We also demonstrate MARS’ ability to transparently collect, store, and analyze realtime application behavior at scale through an architectural monitor for (1) a rack server executing inference services and (2) an embedded developer board executing autonomous driving tasks.

The tutorial will start with an introductory session giving an overview of the technical content covered in the tutorial: realtime telemetry, system modeling, resource management policies. A guest speaker from Meta/Facebook Capacity Engineering and Analysis Group (e.g., Parth Malani or David Cisneros, to be confirmed) will present the challenges of telemetry and modeling at scale that the industry is currently facing. The first session will cover the design and implementation of MARS 1.0 for realtime monitoring and control of embedded systems-on-chip. Use cases and results will be presented. The second session will cover the design and implementation of MARS 2.0, and its ability to portably monitor and control distributed systems at scale. A demonstration of MARS 2.0 will be presented, including how to set up scalable monitoring telemetry, visualize live monitor data, and deploy an application-level runtime policy for two use cases: (1) a rack server executing inference services and (2) an embedded developer board executing autonomous driving tasks.

Presenters

Bryan Donyanvard (Email: bdonyanavard@sdsu.edu) is an Assistant Professor in the Department of Computer Science at San Diego State University. He received his Ph.D. in Computer Science from the University of California, Irvine, in 2019. Prof. Donyanavard spent 2020 as a researcher in the IoT/CPS group at Ericsson Research in Stockholm. His research focuses on self-aware autonomy.

Biswadip Maity (Email: maityb@uci.edu) is graduating in June 2023 with a PhD from the Department of Computer Science at the University of California, Irvine and joining the planning and control team at Zoox for the autonomous vehicle industry. His research interests are self-aware embedded systems, memory systems, and hyperscale datacenter systems.

Tiago Mück (Email: tiago.muck@arm.com) received his PhD from the University of California, Irvine (UCI) in 2018 and his M.Sc. in Computer Science from the Federal University of Santa Catarina (UFSC) in 2013. Since 2018, he has been a research engineer at Arm working on scalable system software and computer architecture. He is especially interested in cross-layer hardware/software co-design issues and memory system architecture for multi-chiplet many-core systems.

Parth Malani (Email: pmalani@meta.com) received his Phd (2010) and MS (2006) in Electrical and Computer Engineering from State University of New York, Binghamton. He is currently an Engineering Manager at Meta Platforms Inc., leading infrastructure efficiency teams. His research interests include resource management, power efficiency and performance observability.