Cross pollination

Owain Kenway

Published Sep 14, 2023

I attended a LinuxONE event yesterday, mostly because I've always been interested in the architecture and evaluating it for some specific edge-case uses, but also partly because I think it's valuable to visit events ostensibly from other areas of IT and see what their solutions are to our common problems.

It's long been an annoyance to me that "academic computing" and "private sector computing" have weirdly siloed tech stacks because it fragments knowledge and causes parallel evolution of similar but wildly incompatible solutions to identical problems. When the applications remain in the silo this causes no issues, but the evolution of “data science” as a field within industry (and later machine learning and now generative models) pulled back from there into academia we started to have problems.

A classic example, I think is the Hadoop/Apache Spark problem. In academic HPC, we have decades and decades of experience running batch systems for data analysis, indeed at UCL alone we have somewhere in the region of 60k cores and 100-odd GPUs running under batch systems spread across two datacentres. These software solutions are mature, handle the difficult problem of security (users are often competitors and should not be able to mess with each other’s stuff (code/data)), handle very complex scheduler policy with different priorities for different groups (as an example we treat paying users as having higher priority than free users, can manage scheduling with radically heterogeneous resources etc). None of this knowledge was retained with either Hadoop or Spark, both of which have extremely basic batch queues, little meaningful security (indeed Spark seems to delight in having little and being obnoxious to enable any of) or resource management, makes the mistake of tightly coupling the scheduler to applications and yet we started having researchers in “crossover” subjects like Economics wanting to use applications written in it.

This has just gotten more of a challenge with each iteration of Data Science – the love of interactive notebooks as control tool – difficult to manage safely on a shared resource with a traditional batch workload manager has necessitated tools like OpenOnDemand to schedule them, ML workflows usually require containerisation, often with Kubernetes to manage them and again, until relatively recently these things were extremely difficult to run safely in an environment where multiple competing user groups and even students are using a single resource.

Don’t get me wrong, containers are awesome, Kubernetes and friends are really great tools and one of our goals for our future “cloud-like” HPC software stack we are in the process of building is that it should be based around and support these technologies as well as traditional batch. We can do this because we are rebuilding with these as a core requirement and there is a point in the future (within 1 year on present plans) where we can, for example, comfortably run in production interactive AI workloads and batch computational Chemistry codes on the same hardware, automatically scheduled (both need big multi-GPU nodes with fast networks but very different tech stacks).

Recommended by LinkedIn

When Distributed Systems Meet Exascale Computing:…

Eyouel A Fantu 1 year ago

Diving deep into Key Elements of Apache Spark: Driver…

Rezuanur Rahman Dip 2 years ago

Distributed Systems & the CAP Theorem

Prabhu Thukkaram 8 years ago

A long-term solution though is to attend “industry” events to talk to our equivalents in banking, finance and others because we all share common problems around datacentre space, power costs, the desire to be sustainable and strong need for security – to make sure there’s a cross-pollination of ideas so that we can bring ideas from the “other side” over earlier in the process and so I’ve been trying to do this.

The LinuxONE event was interesting for example, because when talking about TCO of systems the “stacked bar graph” of costs for us is radically different to the one for (for example) a commercial customer like a bank. It’s quite a shock to see how expensive software is in their environment compared to ours, where hardware cost and power dwarf the software which is almost entirely either Open Source or homebrew.

This falls forward into carbon costs as well. Because most universities are forced to buy only “green energy” (and yes I know there are considerable issues with this in terms of real carbon benefits, but I don’t make the rules) and HPC has such large hardware deployments the “embodied carbon” of our hardware is a much bigger part of the carbon cost of the life-cycle of the machine – to the degree that the only ethical thing for many deployments (particularly in places like Scotland where power generation is greener by default) is to run them as hard and long as we can before the hardware dies and we have to pay the embodiment cost again.

I have many takes, both positive and negative about the “AI explosion” but one thing it has done is that both churches of computing are once again running some of the same software and some of the same hardware. Back in the early years of computing, mainframes and HPC were the same computers – the latter often being the former with an optional floating-point capability and running FORTRAN rather than COBOL. I don’t know if we’ll ever get back to that because we diverged for very good reasons down different paths, but we should keep talking – we have much to learn from each other.

Iain Neville 2y

May thanks for your comments and insights Owain! Looking forward to looking deeper under the #LinuxONE technology hood with you regarding #sustainability #AI (on-chip) #performance #security & most importantly learning from each other. 👍

1 Reaction

Cristian Dinu 2y

When you mention "considerable issues in terms of carbon benefits" do you mean the dodgy offsets market or there is something else, too?

Cross pollination

Owain Kenway

Recommended by LinkedIn

More articles by Owain Kenway

Others also viewed

Is AI turning all data computing to essentially HPC?

Exploring Key Distributed System Algorithms and Concepts Series: 1 — Markel Tree

Speculative Execution in Apache Spark

🚀 Beyond CAP: Why the PACELC Theorem Explains Distributed Systems Better

Oops, I Accidentally Deleted a Genome: The High-Stakes World of Linux in Biotech

Kafka and Anomaly Detection for Compute Observability

Spark cache: memory or storage?

A primer on Queuing Systems in Computer Science

Parallel computing challenge

CAP Theorem

Explore content categories

Recommended by LinkedIn

More articles by Owain Kenway

Open Source Food

vLLM on AMD Mi300x

The purpose of a system is what it does

About the 400 Million items of e-waste...

China watching

Benchmarking shenanigans

So anyway, today I went completely off the rails

The problem of spreadsheets

Parasocial trust issues

Getting AlphaFold 3 to run on AMD GPUs

Others also viewed

Is AI turning all data computing to essentially HPC?

Exploring Key Distributed System Algorithms and Concepts Series: 1 — Markel Tree

Speculative Execution in Apache Spark

🚀 Beyond CAP: Why the PACELC Theorem Explains Distributed Systems Better

Oops, I Accidentally Deleted a Genome: The High-Stakes World of Linux in Biotech

Kafka and Anomaly Detection for Compute Observability

Spark cache: memory or storage?

A primer on Queuing Systems in Computer Science

Parallel computing challenge

CAP Theorem

Explore content categories