Environment Variables and Multiprocessing
Premise
I needed to use a codebase that had a mix of light numpy operations, coupled with a few heavy mathematical optimization problems that did not use numpy.
The API that I had access to provided to be too slow (even asynchronously), so I thought I’ll run this locally on a faster system in parallel to speed
things up. Turns out, that was easier said than done, and I picked up the importance of environment variables along the way.
In detail
Demonstration code
Let’s say that the “black box” code looked something like this.
1import numpy as np
2import sys
3
4n = 400
5acc = 0.0
6for _ in range(30):
7 a = np.random.rand(n, n)
8 b = np.random.rand(n, n)
9 c = a @ b
10 d = np.linalg.inv(c + np.eye(n))
11 acc += np.linalg.svd(d, compute_uv=False).sum()
12
13result = acc + float(sys.argv[1])The actual codebase was a full blown package that I wasn’t familiar with, except for the following points:
- The author(s) had not attempted to parallelize the codebase.
- It used several packages, but most notably
numpyand a few niche optimization tools. - The most important bit:
numpywas used sparingly, and the compute intensive part was in the optimization tools.
The problem
I needed to call this script multiple times, supplying different arguments each time. A single call doesn’t take much time on my system:
1> time python test.py 0
2
3real 0m0.861s
4user 0m26.207s
5sys 0m0.120sI could, of course, call it sequentially, but that wouldn’t be very helpful:
1> time for i in {1..10}; do python test.py $i; done
2
3real 0m8.100s
4user 4m6.980s
5sys 0m1.193sThis isn’t that bad - it looks like it took slightly less than the expected 8.6s, but it’s no speedup.
What about switching to good ol’ GNU parallel? I have 32 cores, so it should be fast, right?
1> time seq 10 | parallel -j32 'python test.py {}'
2
3real 16m50.053s
4user 521m15.773s
5sys 0m42.995sWow, that’s actually around 125x SLOWER.
An experienced programmer at this point would likely scream one of two words: CONTENTION! OVERSUBSCRIPTION!
Let’s explore what that means.
But why?
numpy is a very optimized package. It tries to use all cores available under the hood so that it can do what you asked
in the fastest possible time. This is a great idea when you have large matrices to operate on.
However, we called numpy via Python here 10 separate times. Each of those 10 processes were trying to access 32 threads
on the machine. 320 threads on 32 cores, which means that there were significantly more threads than cores, all actively
vying for those 32 cores!
The clearest signal of this is that a naive attempt at parallelization via multiprocessing being SLOWER than a sequential set of calls to the same operation. That’s why using a system with more cores and trying to parallelize blindly doesn’t always speed things up, and in the worst case, such as this, it considerably slows things down.
The solution
numpy uses different libraries for multithreading based on the system. Environment variables can be used to control the
number of threads each process creates, a summary of which is provided below.
| Variable | Backend |
|---|---|
OMP_NUM_THREADS |
OpenMP (used by many BLAS) |
OPENBLAS_NUM_THREADS |
OpenBLAS |
MKL_NUM_THREADS |
Intel MKL |
BLIS_NUM_THREADS |
BLIS |
VECLIB_MAXIMUM_THREADS |
Apple Accelerate |
To avoid oversubscription, we need to tell numpy to use fewer threads than the default (which is all threads), because
(1) we are aware that the process running is not compute intensive, and (2) we will handle parallelism ourselves.
On my system, I was using OpenMP, so all that I needed to do was:
1> time seq 10 | parallel -j32 'OMP_NUM_THREADS=1 python test.py {}'
2
3real 0m0.612s
4user 0m4.905s
5sys 0m0.236sto make the slowest job among a set of 10 run faster than single call to the script!
The real world impact in the actual codebase was 15x response time, and 30x throughput. What used to take 120 machines earlier took only 4 now!
Ending notes
Why did this work here? Let’s revisit the “most important” bit of information from the demonstration code:
The most important bit:
numpywas used sparingly, and the compute intensive part was in the optimization tools.
This technique of using environment variables to restrict the number of threads that a process can spawn, and then
running it in parallel (via parallel based multiprocessing) will work in any case where multiple processes are
spawned for each physical core, and you want to maximize throughput. In case the computations used by numpy are
more compute intensive, you could experiment with allowing more threads while still paralllelizing:
1seq 10 | parallel -j8 'OMP_NUM_THREADS=4 python test.py {}'This allows use of up to 4 threads, while spawning 8 processes in parallel, and in theory still using a total of 32 cores. You may need to adjust this to find out what works best for your system.
In addition, to make this work less dependent on the libraries that numpy can use, just set more (or all) of the
environment variables:
1seq 10 | \
2parallel -j32 "OMP_NUM_THREADS=1 \
3OPENBLAS_NUM_THREADS=1 \
4MKL_NUM_THREADS=1 \
5python test.py {}"An interesting side note is that the parallel, but single core run took 0.61 seconds, less than the time it took to run
a single call to the script on multiple cores (via threading in numpy internally). For large matrices, this also indicates
that the overhead associated with spawning 32 threads for the single task might actually be significant - and running on a
single core might just be inherently better!
Footnotes
I’ve simplified a few things in this blog post:
- This demo uses heavy
numpyops to clearly show the oversubscription effect. The actual codebase had much lighter usage, but the principle is the same. - I’ve used contention and oversubscription interchangeably here. The former is a case of threads competing for the same shared resource, and the latter is a higher thread count than CPU cores. Oversubscription here led to contention.
- Modern systems have hundreds, or thousands of active threads for very few system cores. The difference with most of
these threads is that they often are “sleeping”, and don’t vie for attention like the
numpyones.
#python #numpy #multiprocessing #parallel #environment #variables