Showing posts with label gpu. Show all posts
Showing posts with label gpu. Show all posts

Thursday, November 05, 2020

The real motivation behind the Matrix engine (GPU/TPU/...) adoption in HPC

 There is current backlash in the HPC community against GPU/TPU/... aka matrix engine accelerator adoption. Most of the arguments are performance, efficiency and real HPC workload driven. 

Like in a recent paper by Jens Domke et Al., colleagues of Dr Matsuoka at RIKEN and AIST in Japan, explore if the inclusion of specialized matrix engines in general-purpose processors are genuinely motivated and merited, or is the silicon better invested in other parts.




I wrote before in that a lot of new HPC systems overuse matrix engine hardware in their architecture. In this paper, the authors looked at the broad usefulness of matrix engines. They found that there is only a small fraction of real-world application that use and benefit from accelerated dense matrix multiplications operations. Moreover, when combining HPC and ML or when you try to accelerate traditional HPC applications, the inference computation is very lightweight compared to the heavyweight HPC compute. 


While I agree with the argument put forward some other aspects that go beyond HPC need to be taken into consideration as to why there is such a push for matrix engine adoption. And these aspects are mainly market-driven. If you compare markets, there is significantly more money in the "hyped" years old AI market (training + inference) vs the 30 years old "mature" HPC market. 




In raw numbers, the HPC market is worth $39 Billion. In comparison, the AI market is worth $256 Billions in hardware along. If you focus on AI semiconductor only it is still $32 Billion alone! And the growth projections are not in favour of HPC. 





If you then look into the N^4 computing complexity for AI vs at best N^3 for HPC. Or look where (institutions, companies, systems/individuals such as in cars, wearables, medical appliances, etc.) those AI systems are going vs HPC systems. You quickly understand the significant difference and potential between the two markets.

If you take the ROI of AI-related business into consideration, it now makes more sense why HPC institutes are investing in such type of hardware. Such investment will allow them to tap into a promising and fast-expanding market. The matrix engine movement is simply a market-driven investment to ensure the best ROI for HPC centres.

Monday, June 22, 2020

Stop throwing GPU at HPC, Not all scientific problems are compute dense

The current race to exascale has put a heavy emphasis on GPU-based acceleration at the detriment of other HPC architecture. However, Crossroads and Fugaku supercomputer are demonstrating that it is not all about GPU.



The vast majority of the (pre-)exascale machines are relying heavily on GPU acceleration targeting scientific problems that can be cast as dense matrix-matrix multiplication problems.

However, there are large numbers of scientific problems that are not compute dense. And such GPU architectures are ill-equipped to accelerate these problems. Sadly, the current trends seem to have relegated those type of scientific challenges to second class citizens in the HPC world. If you look at extreme-scale graph problems by example, the graph500 benchmark clearly shows that these type of problem have been orphaned. 4 out of ten systems are more than seven years old and nearing their end of life. Moreover, the newer systems show marginal progress toward accelerating extreme-scale graph traversal. 

I understand that the current machine learning hype heavily influences the HPC ecosystem. However, we have to remind ourselves that there is life beyond FLOPS. And the Fugaku and Crossroads system demonstrates it is possible to achieve strategic compute leadership without sacrificing the architecture to the altar of exaflop compute dense gods. 



The Japanese latest ARM-based Fugaku supercomputer is demonstrating that it can address both compute dense GPU optimised and the one that not reducible to dense linear algebra and therefore incompatible with GPU technologies. The Japanese supercomputer built around the ARM v8.2A A64FX CPU just picked up the number one in the HPC Top 500 Green benchmark and the Graph500 BFS benchmark.

Hopefully, this will be a wake-up call within the HPC community to properly fund R&D efforts orthogonal to the compute dense and exaflops benchmark friendly architecture.


Update 22/06/2020 : after publishing this article Fugaku just got ranked Nb 1 at the top 500 Linpack benchmarks with close to half an exaflops! (415 Petaflops).And Fugaku is pretty much topping every single HPC ranking :

  • Top500 : Nb 1.
  • Top500-Green : Nb 1.
  • HPCG : Nb 1.
  • HPL-AI: Nb 1.
  • Graph500 : Nb 1.







Thursday, June 08, 2017

[Links of the Day] 08/06/2017 : Machine Learning Tuning DBMS, Direct SSD to GPU SQL and Large Graph DB processing

  • Tuning DBMS with Machine Learning : From the people behind Peloton, they demonstrate a way to automatically tune DB using machine learning. This is rather interesting, however, there is a key element that is missing in the approach: Cost. Your DB system can become highly optimise but your AWS cost can skyrocket too. What you need is a system that automatically tunes perf & cost tradeoff to maximise ROI Sometimes being a little bit slower can save $$
  • MOSAIC : More heterogeneous approach: graph processing engine that exploits all the hardware resources available in a standard Xeon host processor, Xeon Phi coprocessors, NVMe, and a fast interconnect. Because fast processing of your Facebook social network for fast advertisment targeting is worth it :) [slides]
  • PG-Strom : By-passing CPU for SQL operation by allowing direct SSD to GPU communication for Postgress SQL processing. We are slowly entering the age of heterogeneous computing system were core CPU get relegated to highly generic tasks. [slides]


Wednesday, April 19, 2017

[Links of the Day] 19/04/2017 : AMD ROCm GPU open platform, Weak Memory Models concurrency report, SSH server for distributed infrastruscture

  • ROCm : this slide deck give an overview of the AMD ROCm open platform for GPU computing exploration. They are really pushing to become the open source standard for the GPU industry battling against NVIDIA supremacy in the domain. It looks like they are making really good progress and I would be curious to see how this progress when combining with their Ryzen CPU. 
  • Concurrency with Weak Memory Models : this is a really good report on the state of memory models in hardware and software. It provides a wide spectrum overview of Hardware and Software concurrency model and approaches as well as the future direction in the domain. 
  • Teleport 2 : a modern SSH server designed for teams managing distributed infrastructure. [github]


Friday, November 18, 2016

[Links of the Day] 18/11/2016 : Extreme Scale OS, GPU Stream Benchmark, Neural Net that produce neural net

  • Neural Architecture Search with Reinforcement Learning : Neural net that produce neural net. Cool thing is that the authors are able to beat human generated model for text processing and deliver equivalent performance for image processing model. Who needs human anymore.... 
  • Extreme-Scale Operating Systems : multi-OS research project at Intel aiming to be the node OS for HPC machine. Intel is trying to deliver a polymorphic OS that can quickly adapt to new software and hardware without the need for specialized solution like it exist commonly on high end HPC systems. To some extend it looks like the Jailhouse system. Where the HW is physically partitioned. A few core are dedicated for management, while the rest are partitioned and are running lightweight kernel (LWK) + application. Note that I really resent Intel for always trying to rename things that are commonly used. LWK are Unikernel dammit.. Anyway its jailhouse + unikernel for HPC. 
  • GPU-STREAM : Stream benchmark for GPU, much needed benchmark to understand and quantify memory transfer rate to from global memory device on GPUs.


Wednesday, October 05, 2016

[Links of the day] 05/10/2016 : SQL Scan using NVMe to GPU using P2P DMA , Strange Loop 2016, Secure Time Service

  • SSD-to-GPU Direct DMA : Interesting work where the author use p2p DMS to load data from NVMe to GPU. This bypass the RAM altogether. The objective is to accelerate PostgreSQL scan operation. This is really neat, but I am not sure that the SQL DB are the best choice of use case. I would have thought that columnar or K/V system would held better speedup potential because of the way the data is organised and processed. 
  • Strange Loop : all the video of strange loop 2016 conference. Way too many good talk to mention them all. Just check it out. 
  • RoughTime : secure time synchronization project. Everybody use time, but nobody mention clock attack. This project aim at alleviating this potential threat. So you know all your data is accurately fresh. 

Monday, June 27, 2016

[Links of the day] 27/06/2016 : GPU programming, Persistent Memory Thesis

  • CS 179 GPU Programming : Caltech GPU programming course
  • Systems and application for Persistent Memory : Subramanya R. Dulloor PhD thesis on persistent memory programming. A must Read for anybody interested in the next generation storage class memory as the Author wrote PMFS as part of his thesis work.
  • WrAP: Hardware and Software Support for Atomic Persistence in Storage Class, master thesis providing some insight on how complex and far reaching are the new storage class memory system and especially with their interaction with CPU cache. 

Wednesday, March 30, 2016