ISSN: 2643-6744
Cristóbal A. Navarro*
Received: December 19, 2023; Published: January 11, 2024
*Corresponding author:Cristóbal A. Navarro, Associate Professor Computer Sciences, Austral University, Chile
DOI: 10.32474/CTCSA.2024.03.000159
Parallel processors have undergone a profound transformation in recent years, transitioning from homogeneous generalpurpose units to a heterogeneous ecosystem comprising a mix of general and specific-purpose cores on a single chip. This shift, driven by the demands of Artificial Intelligence (AI) and computer graphics applications, has not only altered the architecture of processors but has also introduced novel challenges in optimizing algorithms for parallel execution. In this brief review, we delve into the evolution of parallel processors and explore the research challenges arising from this shift. We will be focusing on the particular case of GPUs, where tensor cores and ray tracing cores have created new research opportunities on finding what other applications, different from AI and graphics, could be reformulated as a series of tensor/ray-tracing core operations and further accelerate their performance compared to their regular GPU implementation.
Parallel computing gained a strong relevance with the introduction of the first dual-core CPU in the early 2000s. From there, parallel architectures as well as research in parallel computing achieved significant milestones, such as the possibility to pack dozens of CPU cores in a single chip, new parallel algorithms and the development of parallel programming languages and tools [1]. Today we have a large ecosystem of parallel processors sitting in many of the devices we use every day; from laptops and cellphones to TVs and Cars. In the last couple of years, with the surge of artificial intelligence and videogames, parallel computing has become even more relevant, as it is the technological bed for many states of the art applications that require high performance. During these last 5-6 years, the computing community has witnessed how parallel processors have evolved from being an homogenous set of general purpose cores, to an heterogenous set that now includes specific- purpose cores. A notable case study in this technological transformation is the Graphics Processing Unit (GPU). Around 2006, when NVIDIA announced the CUDA programming platform [2], GPUs transitioned from being specialized hardware for graphics rendering to general purpose accelerators. From that moment GPUs became an attractive device for doing very fast scientific computations. Nearly a decade later, with the surge of Artificial Intelligence (AI), the community realized that the performance of GPUs was not high enough to properly handle the new Deep Learning models being developed. For this reason, near 2017, NVIDIA introduced tensor cores [3-12] inside the chip to further accelerate the performance of all AI applications. GPU Tensor cores are Application Specific Integrated Circuits (ASICs), or simply specific-purpose cores that perform fast matrix multiply accumulate (MMA) operations. With Tensor cores, AI applications can further accelerate their performance by an extra order of magnitude, allowing the training of large models to go down from months to a few days. As of 2024, tensor cores are present in NVIDIA [13], AMD [6] and Intel GPUs [14], and are slowly becoming part of the CPUs as well. The videogame industry also had its revolution recently. In 2018 (one year after the inclusion of tensor cores) NVIDIA designed the Ray Tracing (RT) core to be inside their GPU chips. RT cores enable the processing of the ray tracing algorithm in real-time, bringing the possibility for interactive 3D applications to feature photorealistic lighting. Ray tracing [7] is one of the most computationally demanding tasks in 3D rendering as it requires thousands of rays to be traced and checked in order to find which triangles they hit. The difficulty comes because it is a search problem; for each ray, one needs to find which triangle has been hit by it. Doing it by brute force would mean checking all triangles of the scene for each ray, making it very inefficient. Space partitioning trees [9] and other variants of trees have been implemented in GPU [8], although the nature of trees introduce a difficult irregular memory accesses for the GPU architecture which is limited in this aspect. As a solution to this problem, an RT core offers a hardware implemented Bounding Volumne Hierarchy (BVH) tree data structure [15], allowing a ray to find ray/triangle intersections (other custom primitives as well) overall significantly faster than the software-implemented alternatives. Due to the success of the Ray Tracing core, as of 2024, all major GPU companies include them in one or other equivalent form.
The recent inclusion of specific purpose cores in parallel accelerators created the research question; is it possible that other applications, different from AI and Graphics, could also benefit from the new tensor cores and RT cores? The answer is yes and this has opened a whole new research field in GPU Computing; to find ways to reformulate common computational patterns, even ones already adapted for traditional GPU Computing, now as a series of tensor/ ray-tracing operations and obtain an additional performance lift. When programming tensor or RT cores, a great part of the pipeline is a black box, which is where the hardware-implemented functionality takes part. Therefore, adapting a computational pattern to tensor/RT cores greatly involves coming up with a new statement of the computational pattern, now formulated as a series of tensor/RT operations. Successful research has been done in the recent years. In the case of tensor cores, new ways have been proposed to further accelerate arithmetic reductions [16, 13, 5-12, 17-21] prefix sum [4-12, 17-21, 22-29] Fast Fourier Transform [22], [10], [23], [5], stencil computations for PDE simulations [11] and even fractals [14, 25-24]. In general, all of these works achieve significant higher performance when compared to doing it traditionally in GPU. Moreover, many times this benefit in performance also comes with less energy consumption, making it a more energy efficient approach as well. In the case of Ray Tracing cores, a significant amount of works can also be found. One of the most relevant research topics have been on finding ways to compute the nearest neighbors of many particles in parallel, using the high search speed of RT cores [20, 26-28]. Other works include a fully RT core approach for answering the Range Minimum Query (RMQ) problem [17], which consists of finding the minimum in a given interval [i,j] of an unordered array. In the case of geometry, point location has been solved with RT cores as well [18]. More recently, a clustering approach has been proposed that leverages RT cores [19]. There are still several candidate open problems for being adapted to tensor or RT cores. The key for bringing new ideas to tensor cores is find ways to group the arithmetic operations of a process as a series of matrix multiply accumulate (MMA) operations. If this can be done, and the matrices involved can be populated almost entirely with useful data, then there is a strong chance that the tensor cores can provide a performance boost. There are technical limitations though, for example the MMA operation offers several data types such as FP16, TF32, BF16 and INT8, among others. The less the precision, the faster the performance, therefore one should be cautious on what datatype to work with, ensuring both correctness and speed. In the case of RT cores, the key is to realize that the ray-triangle intersection is actually a search tool that when properly used, can solve non-graphical problems. The two major challenges when adapting a computation to RT core are i) to find a proper 3D geometrical representation of the input data, and ii) to find a ray launch scheme such that when colliding with the input data (triangles), it answers the intended search query. If these two challenges can be overcomed, then the problem may be computed with RT cores. Future parallel processors may keep bringing new specific-purpose cores to the table, creating new research challenges. Moreover, this research is not only limited to GPUs, but to all the current processors that are adding specific-purpose units in their chip, including embedded devices as well.
Bio chemistry
University of Texas Medical Branch, USADepartment of Criminal Justice
Liberty University, USADepartment of Psychiatry
University of Kentucky, USADepartment of Medicine
Gally International Biomedical Research & Consulting LLC, USADepartment of Urbanisation and Agricultural
Montreal university, USAOral & Maxillofacial Pathology
New York University, USAGastroenterology and Hepatology
University of Alabama, UKDepartment of Medicine
Universities of Bradford, UKOncology
Circulogene Theranostics, EnglandRadiation Chemistry
National University of Mexico, USAAnalytical Chemistry
Wentworth Institute of Technology, USAMinimally Invasive Surgery
Mercer University school of Medicine, USAPediatric Dentistry
University of Athens , GreeceThe annual scholar awards from Lupine Publishers honor a selected number Read More...