Porting Python Packages to Support Free-Threading
Many Python packages, particularly packages relying on C extension modules, are not thread-safe in the free-threaded build as of mid-2024. Up until now, the GIL has added implicit locking around any operation in Python or C that holds the GIL, and the GIL must be explicitly dropped before many thread safety issues become problematic. Also, because of the GIL, attempting to parallelize many workflows using the Python threading module will not produce any speedups, so thread safety issues that are possible even with the GIL are not hit often since users do not make use of threading as much as other parallelization strategies. This means many codebases have threading bugs that up-until-now have only been theoretical or present in niche use cases. With free-threading, many more users will want to use Python threads.
This means we must analyze Python codebases to identify supported and unsupported multithreaded workflows and make changes to fix thread safety issues. This need is particularly acute for low-level code exposed to Python including C, C++, Cython, and Rust code exposed to Python, but even pure Python codebases can exhibit non-determinism and racies in the free-threaded build that are either very unlikely or impossible in the default configuration of the GIL-enabled build.
Suggested Plan of Attack
Below, we outline a plan of attack for updating a Python project to support the free-threaded build. Since the changes required in native extensions are more substantial, we have split off the guide for porting extension modules into a subsequent section.
Thread Safety of Pure Python Code
The CPython interpreter protects you from low-level memory unsafety due to data races. It does not protect you from introducing thread safety issues due to race conditions. It is possible to write algorithms that depend on the precise timing of threads completing work. That means it is up to you as a user of multithreaded parallelism to ensure that any resources that need protection from multithreaded access or mutation are appropriately protected.
Below we describe various approaches for improving the determinism of multithreaded pure Python code. The correct approach will depend on exactly what you are doing.
General considerations for porting
Many projects assume the GIL serializes access to state shared between threads, introducing the possibility of data races in native extensions and race conditions that are impossible when the GIL is enabled.
We suggest focusing on safety over single-threaded performance. For example, if adding a lock to a global cache would harm multithreaded scaling, and turning off the cache implies a small performance hit, consider doing the simpler thing and disabling the cache in the free-threaded build. Single-threaded performance can always be improved later, once you've established free-threaded support and hopefully improved test coverage for multithreaded workflows.
NumPy, for example, decided not to add explicit locking to the ndarray object and does not support mutating shared ndarrays. This was a pragmatic choice given existing heavy multithreaded use of NumPy in the GIL-enabled build and a desire to not introducing scaling bottlenecks in existing workflows.
Eventually NumPy may need to offer explicitly thread-safe data structures, but it is a valid choice to initially support free-threading while still exposing possibly unsafe operations if users use the library unsafely.
For pure Python packages, a racey algorithm might result in unexpected exceptions or silently incorrect results. Projects shipping extension modules might additionally see crashes or trigger undefined behavior. See the section on supporting native code if you are curious about supporting compiled Python extensions in the free-threaded build.
For your libraries, we suggest to focus on thread safety issues that only occur with the GIL disabled. Any non-critical preexisting thread safety issues can be dealt with later once the free-threaded build is used more. The goal for your initial porting effort should be to enable further refinement and experimentation by fixing issues that prevent using the library at all.
Multithreaded Python Programming
The Python standard library offers a rich API for multithreaded
programming. This includes the threading
module, which offers
relatively low-level locking and synchronization primitives, as well as the
queue module
and the
ThreadPoolExecutor
high-level thread pool interface.
If you'd like to learn more about multithreaded Python programming in the GIL-enabled build, Santiago Basulto's tutorial from PyCon 2020 is a good place to start.
For a pedagogical introduction to multithreaded programming in free-threaded
Python, we suggest reading the
ft_utils
documentation, particularly the section on the impact of the global interpreter
lock on multithreaded Python
programs. Many
pure Python operations are not atomic and are susceptible to race conditions, or
only appear to be thread-safe in the GIL-enabled build because of details of how
CPython releases the GIL in a round-robin fasion to allow threads to run.
Both the ft_utils
and
cereggii
libraries offer data structures
that add enhanced atomicity to standard library primitives. We hope these sorts
of tools to aid concurrent free-threaded programming continue to pop up and
evolve, as they will be key to enabling scalable multithreaded workflows.
Dealing with mutable global state
The most common source of thread safety issues in Python packages is use of global mutable state. Many projects use module-level or class-level caches to speed up execution but do not envision filling the cache simultaneously from multiple threads. See the testing guide for strategies to add tests to detect problematic global state.
For example, the do_calculation
function in the following module is not
thread-safe:
from internals import _do_expensive_calculation
global_cache = {}
def do_calculation(arg):
if arg not in global_cache:
global_cache[arg] = _do_expensive_calculation(arg)
return global_cache[arg]
If do_calculation
is called simultaneously in multiple threads, then it is
possible for at least two threads to see that global_cache
doesn't have the
cached key and call _do_expensive_calculation
. In some cases this is harmless,
but depending on the nature of the cache, this could lead to unnecessary network
access, resource leaks, or wasted unnecessary compute cost.
Converting global state to thread-local state
One way of dealing with issues like this is to convert a shared global cache into a thread-local cache. In this apprach, each thread will see its own private copy of the cache, making races between threads impossible. This approach makes sense if having extra copies of the cache in each thread is not prohibitively expensive or does not lead to excessive runtime network, CPU, or memory use.
In pure Python, you can create a thread-local cache using an instance of threading.local. Each thread will see independent versions of the thread-local object. You could rewrite the above example to use a thread-local cache like so:
import threading
from internals import _do_expensive_calculation
local = threading.local()
local.cache = {}
def do_calculation(arg):
if arg not in local.cache:
local.cache[arg] = _do_expensive_calculation(arg)
return local.cache[arg]
This wouldn't help a case where each thread having a copy of the cache would be prohibitive, but it does fix possible issues with resource leaks issues due to races filling a cache.
Making mutable global caches thread-safe with locking
If a thread-local cache doesn't make sense, then you can serialize access to the cache with a lock. A lock provides exclusive access to some resource by forcing threads to acquire a lock instance before they can use the resource and release the lock when they are done. The lock ensures that only one thread at a time can use the acquired lock - all other threads block execution until the thread that holds the lock releases it, at which point only one thread waiting to acquire the lock is allowed to run.
You could rewrite the above thread-unsafe example to be thread-safe using a lock like this:
import threading
from internals import _do_expensive_calculation
cache_lock = threading.Lock()
global_cache = {}
def do_calculation(arg):
if arg in global_cache:
return global_cache[arg]
cache_lock.acquire()
if arg not in global_cache:
global_cache[arg] = _do_expensive_calculation(arg)
cache_lock.release()
return global_cache[arg]
Note that after acquiring the lock, we first check if the requested key
has been filled by another thread, to prevent unnecessary calls to
_do_expensive_calculation
if another thread filled the cache while the thread
currently holding the lock was blocked on acquiring the lock. Also note that
Lock.acquire
must be followed by a call to
Lock.release
,
calling Lock.acquire()
recursively on the same lock leads to deadlocks. Also,
in general, it is possible to create a deadlock in any program with more than
one lock. Care must be taken to ensure that operations done while the lock is
held cannot lead to recursive calls or lead to a situation where a thread owning
the lock is blocked on acquiring a difference mutex. You do not need to worry
about deadlocking with the GIL in pure Python code, the interpreter will handle
that for you.
There is also threading.RLock, which provides a reentrant lock allowing threads to recursively acquire the same lock.
Dealing with thread-unsafe objects
Mutability of objects is deeply embedded in the Python runtime and many tools freely assign to or mutate data stored in a python object.
In the GIL-enabled build, in many cases, you can get away with mutating a shared object safely. This is true so long as whatever mutation you are attempting to do is fast enough that a thread switch is very unlikely to happen while you are doing work.
In the free-threaded build there is no GIL to protect against mutation of state living on a Python object that is shared between threads. Just like when we used a lock to protect a global cache, we can also use a per-object lock to serialize access to state stored in a Python object. Consider the following class:
import time
import random
class RaceyCounter:
def __init__(self):
self.value = 0
def increment(self):
current_value = self.value
time.sleep(random.randint(0, 10) * 0.0001)
self.value = current_value + 1
Here we're simulating doing an in-place addition on an expensive function. A real example might have a method that looks something like this:
def increment(self):
self.value += do_some_expensive_calulation()
If we run this example in a thread pool, you'll see that the answer you get will vary randomly depending on the timing of the sleeps:
from concurrent.futures import ThreadPoolExecutor
counter = RaceyCounter()
def closure(counter):
counter.increment()
with ThreadPoolExecutor(max_workers=8) as tpe:
futures = [tpe.submit(closure, counter) for _ in range(1000)]
for f in futures:
f.result()
print(counter.value)
On both the free-threaded and GIL-enabled build, you will see the output of this script randomly vary.
We can ensure the above script has determistic answers by adding a lock to our counter:
import threading
class SafeCounter:
def __init__(self):
self.value = 0
self.lock = threading.Lock()
def increment(self):
self.lock.acquire()
current_value = self.value
time.sleep(random.randint(0, 10) * 0.0001)
self.value = current_value + 1
self.lock.release()
If you replace RaceyCounter
with SafeCounter
in the script above, it will
always output 1000.
Of course this introduces a scaling bottleneck when SafeCounter
instances are
concurrently updated. It's possible to implement more optimized locking
strategies, but doing so requires knowledge of the problem.