Researchers and developers today are often confronted with issues that are becoming more and more complex – the amount of data is also increasing in complexity. The performance of the hardware is also getting better and is even available in the cloud at the push of a button. However, it’s not that easy to get the most out of this performance. It is not uncommon for calculation routines to run on high-performance hardware, but they only use a fraction of this power. For example, anyone who uses the EC2 P3 instance of the type p3.8xlarge (this includes four GPUs) in the AWS cloud on demand pays US$ 12.24 per hour.
How criminal it would be to run a calculation routine there that uses neither the JIT optimization nor the GPU: It takes unnecessarily longer and costs more money accordingly. This article shows which factors are discussed here, how much faster a JIT or GPU-optimized routine runs and how developers can implement them with Python. Selected examples illustrate the pitfalls that can arise.
The necessary tools
The programming languages C++ and Python are very well suited to writing computationally intensive, performance-optimized routines – Nvidia only introduced innovations for both programming languages last year. This article focuses on Python, but also considers alternatives outside the box (Mathematica, C++). Therefore, developers need the appropriate tools:
A development environment (IDE): The examples, programs and code snippets presented here were developed in the IDE Visual Studio Code. There are also other well-suited environments such as PyCharm. A Python installation: The installation used here includes the Anaconda version Anaconda Individual Edition 2021.11 released last November, which comes with Python 3.9.7. Appropriate packages: for trying out the routines developed here You need the packages Numba (JIT compiler for translating Python code into fast machine code), CUDA Toolkit (GPU accelerated libraries), Pyculib (Python bindings for CUDA), math (package with mathematical functions), pandas (package for processing and storage /Export of data), NumPy (package for numerical calculations), multiprocessing (package for process-based parallelization) and gmpy2 (package for high-performance multi-precision arithmetic); time and sys for timing and system functions are standard in Python.Code assistant and quality checker: Pylint detects and flags problems, code errors, and violations of Python community conventions and recommendations.
Illustrative material: Mengoli’s six squares problem
Inspired by the French mathematician Jacques Ozanam, his Italian colleague Pietro Mengoli dealt with special Diophantine equations in the 1670s – they only allow integer solutions. A well-known problem dealing with such equations is the six-squares problem]: Find three natural numbers x, y, z whose differences zy, zx, yx are square numbers, respectively, where the differences z2-y2, z2-x2 , y2-x2 of the squares are also square numbers.
The results include very large numbers. Ozanam had already presented a solution back then, namely x=1873432, y=2288168 and z=2399057. Such problems in higher number regions are ideal as illustrative material, since there are pitfalls in JIT-optimized programming with very large numbers, as will be shown later.
The importance of today’s computing power
It is remarkable that already in the 17th century Ozanam found such large numbers for which this equation is fulfilled. He didn’t have any technical aids such as computers, programming languages or the like at hand. Today’s tools are capable of solving such complex problems. For example, in 2019, a team at the Massachussetts Institute of Technology (MIT) found a solution to the Diophantine equation x3+y3+z3=42 using x=-80538738812075974, y=80435758145817515, and z=12602123297335631. Among the numbers up to 100, 42 was the last open number that can be represented as the sum of three cube numbers. It is not surprising that fans of the science fiction classic “The Hitchhiker’s Guide to the Galaxy” were also very interested in this puzzle. The team used the idle power of more than half a million home PCs for the calculation. Such equations belong to the “sum of three cubes” problem, which has been unsolved for 160 years. There are other questions that would not be able to be solved without high computing power even in hundreds of years. These include above all tasks from the field of artificial intelligence (AI) such as speech recognition, image recognition or pattern recognition in masses of numerical data.
Algorithm for finding square numbers
For the sake of simplicity, the focus is on the three square numbers x2, y2 and z2, whose differences are again square numbers. There are also results that are not quite as large: x=153, y=185 and z=697 or x=520, y=533 and z=925. A small cross-check 9252-5332 gives 571536, which is actually a square number, namely 7562. It is left to the readership to check whether the other differences are also correct. A large number of such hits, up to the 12 million number region, were determined using a just-in-time-optimized routine (JIT) and are available in a GitHub repository (see file pythagorean_12000000.csv). The algorithm responsible for this is described on StackExchange Mathematics, optimized for JIT and also available in the GitHub repository (see the pythagorean_gendata_jit.py file).
@jit('void(uint64, uint64[:])') def generateData(limit: np.uint64, triples: np.ndarray) -> np.uint32: A=np.zeros(limit, dtype=np.uint64) B=np.zeros(limit, dtype=np.uint64) rows = np.uint32(0) for w in np.arange(1, limit+1, dtype=np.uint64): count = np.uint32(0) for a in np.arange(1, w+1, dtype=np.uint64): sqr = np.uint64(sqrt(w*w-a*a)) if sqr*sqr == w*w-a*a: count+=1 A[count]=a B[count]=np.uint64(sqrt(w*w-a*a)) if count>1: for i in np.arange(1, count+1, dtype=np.uint64): for j in np.arange(1, count+1, dtype=np.uint64): if i!=j: x=np.uint64(A[i]) t=np.uint64(B[i]) s=np.uint64(A[j]) z=np.uint64(B[j]) if z > x: y = np.uint64(sqrt(z*z-x*x)) if y*y == z*z-x*x: arr = np.array([x, y, z, s, t, w], dtype=np.uint64) triples[rows] = arr rows+=1 return rows
Listing 1: Optimized algorithm for solving the six squares problem
If you start the routine in Listing 1 to search for solutions that can have values of up to 50000 (limit=50000), you need an HP Z-Book (32 GB RAM, Intel Core i7-9750H, 2.60GHz, 6 cores, 12 logical, NVIDIA Quadro T1000) 4.8 seconds and finds 1074 triples for x, y and z. If the routine is run on the same computer without any optimization, it takes 1082.8 seconds (18 minutes) to determine these 1074 triples. According to this, the JIT optimization resulted in an acceleration by a factor of about 225.
To home page
#Python #Correct #JIT #optimization #wrong