Skip to content

Porting Python Packages to Support Free-Threading

Many Python packages, particularly packages relying on C extension modules, are not thread-safe in the free-threaded build as of mid-2024. Up until now, the GIL has added implicit locking around any operation in Python or C that holds the GIL, and the GIL must be explicitly dropped before many thread safety issues become problematic. Also, because of the GIL, attempting to parallelize many workflows using the Python threading module will not produce any speedups, so thread safety issues that are possible even with the GIL are not hit often since users do not make use of threading as much as other parallelization strategies. This means many codebases have threading bugs that up-until-now have only been theoretical or present in niche use cases. With free-threading, many more users will want to use Python threads.

This means we must analyze Python codebases to identify supported and unsupported multithreaded workflows and make changes to fix thread safety issues. This need is particularly acute for low-level code exposed to Python including C, C++, Cython, and Rust code exposed to Python, but even pure Python codebases can exhibit non-determinism and racies in the free-threaded build that are either very unlikely or impossible in the default configuration of the GIL-enabled build.

Suggested Plan of Attack

Below, we outline a plan of attack for updating a Python project to support the free-threaded build. Since the changes required in native extensions are more substantial, we have split off the guide for porting extension modules into a subsequent section.

Thread Safety of Pure Python Code

The CPython interpreter protects you from low-level memory unsafety due to data races. It does not protect you from introducing thread safety issues due to race conditions. It is possible to write algorithms that depend on the precise timing of threads completing work. That means it is up to you as a user of multithreaded parallelism to ensure that any resources that need protection from multithreaded access or mutation are appropriately protected.

Below we describe various approaches for improving the determinism of multithreaded pure Python code. The correct approach will depend on exactly what you are doing.

General considerations for porting

Many projects assume the GIL serializes access to state shared between threads, introducing the possibility of data races in native extensions and race conditions that are impossible when the GIL is enabled.

We suggest focusing on safety over single-threaded performance. For example, if adding a lock to a global cache would harm multithreaded scaling, and turning off the cache implies a small performance hit, consider doing the simpler thing and disabling the cache in the free-threaded build. Single-threaded performance can always be improved later, once you've established free-threaded support and hopefully improved test coverage for multithreaded workflows.

NumPy, for example, decided not to add explicit locking to the ndarray object and does not support mutating shared ndarrays. This was a pragmatic choice given existing heavy multithreaded use of NumPy in the GIL-enabled build and a desire to not introducing scaling bottlenecks in existing workflows.

Eventually NumPy may need to offer explicitly thread-safe data structures, but it is a valid choice to initially support free-threading while still exposing possibly unsafe operations if users use the library unsafely.

For pure Python packages, a racey algorithm might result in unexpected exceptions or silently incorrect results. Projects shipping extension modules might additionally see crashes or trigger undefined behavior. See the section on supporting native code if you are curious about supporting compiled Python extensions in the free-threaded build.

For your libraries, we suggest to focus on thread safety issues that only occur with the GIL disabled. Any non-critical preexisting thread safety issues can be dealt with later once the free-threaded build is used more. The goal for your initial porting effort should be to enable further refinement and experimentation by fixing issues that prevent using the library at all.

Multithreaded Python Programming

The Python standard library offers a rich API for multithreaded programming. This includes the threading module, which offers relatively low-level locking and synchronization primitives, as well as the queue module and the ThreadPoolExecutor high-level thread pool interface.

If you'd like to learn more about multithreaded Python programming in the GIL-enabled build, Santiago Basulto's tutorial from PyCon 2020 is a good place to start.

For a pedagogical introduction to multithreaded programming in free-threaded Python, we suggest reading the ft_utils documentation, particularly the section on the impact of the global interpreter lock on multithreaded Python programs. Many pure Python operations are not atomic and are susceptible to race conditions, or only appear to be thread-safe in the GIL-enabled build because of details of how CPython releases the GIL in a round-robin fasion to allow threads to run.

Both the ft_utils and cereggii libraries offer data structures that add enhanced atomicity to standard library primitives. We hope these sorts of tools to aid concurrent free-threaded programming continue to pop up and evolve, as they will be key to enabling scalable multithreaded workflows.

Dealing with mutable global state

The most common source of thread safety issues in Python packages is use of global mutable state. Many projects use module-level or class-level caches to speed up execution but do not envision filling the cache simultaneously from multiple threads. See the testing guide for strategies to add tests to detect problematic global state.

For example, the do_calculation function in the following module is not thread-safe:

from internals import _do_expensive_calculation

global_cache = {}


def do_calculation(arg):
    if arg not in global_cache:
        global_cache[arg] = _do_expensive_calculation(arg)
    return global_cache[arg]

If do_calculation is called simultaneously in multiple threads, then it is possible for at least two threads to see that global_cache doesn't have the cached key and call _do_expensive_calculation. In some cases this is harmless, but depending on the nature of the cache, this could lead to unnecessary network access, resource leaks, or wasted unnecessary compute cost.

Converting global state to thread-local state

One way of dealing with issues like this is to convert a shared global cache into a thread-local cache. In this apprach, each thread will see its own private copy of the cache, making races between threads impossible. This approach makes sense if having extra copies of the cache in each thread is not prohibitively expensive or does not lead to excessive runtime network, CPU, or memory use.

In pure Python, you can create a thread-local cache using an instance of threading.local. Each thread will see independent versions of the thread-local object. You could rewrite the above example to use a thread-local cache like so:

import threading

from internals import _do_expensive_calculation

local = threading.local()

local.cache = {}


def do_calculation(arg):
    if arg not in local.cache:
        local.cache[arg] = _do_expensive_calculation(arg)
    return local.cache[arg]

This wouldn't help a case where each thread having a copy of the cache would be prohibitive, but it does fix possible issues with resource leaks issues due to races filling a cache.

Making mutable global caches thread-safe with locking

If a thread-local cache doesn't make sense, then you can serialize access to the cache with a lock. A lock provides exclusive access to some resource by forcing threads to acquire a lock instance before they can use the resource and release the lock when they are done. The lock ensures that only one thread at a time can use the acquired lock - all other threads block execution until the thread that holds the lock releases it, at which point only one thread waiting to acquire the lock is allowed to run.

You could rewrite the above thread-unsafe example to be thread-safe using a lock like this:

import threading

from internals import _do_expensive_calculation

cache_lock = threading.Lock()
global_cache = {}


def do_calculation(arg):
    if arg in global_cache:
        return global_cache[arg]

    cache_lock.acquire()
    if arg not in global_cache:
        global_cache[arg] = _do_expensive_calculation(arg)
    cache_lock.release()
    return global_cache[arg]

Note that after acquiring the lock, we first check if the requested key has been filled by another thread, to prevent unnecessary calls to _do_expensive_calculation if another thread filled the cache while the thread currently holding the lock was blocked on acquiring the lock. Also note that Lock.acquire must be followed by a call to Lock.release, calling Lock.acquire() recursively on the same lock leads to deadlocks. Also, in general, it is possible to create a deadlock in any program with more than one lock. Care must be taken to ensure that operations done while the lock is held cannot lead to recursive calls or lead to a situation where a thread owning the lock is blocked on acquiring a difference mutex. You do not need to worry about deadlocking with the GIL in pure Python code, the interpreter will handle that for you.

There is also threading.RLock, which provides a reentrant lock allowing threads to recursively acquire the same lock.

Dealing with thread-unsafe objects

Mutability of objects is deeply embedded in the Python runtime and many tools freely assign to or mutate data stored in a python object.

In the GIL-enabled build, in many cases, you can get away with mutating a shared object safely. This is true so long as whatever mutation you are attempting to do is fast enough that a thread switch is very unlikely to happen while you are doing work.

In the free-threaded build there is no GIL to protect against mutation of state living on a Python object that is shared between threads. Just like when we used a lock to protect a global cache, we can also use a per-object lock to serialize access to state stored in a Python object. Consider the following class:

import time
import random


class RaceyCounter:
    def __init__(self):
        self.value = 0

    def increment(self):
        current_value = self.value
        time.sleep(random.randint(0, 10) * 0.0001)
        self.value = current_value + 1

Here we're simulating doing an in-place addition on an expensive function. A real example might have a method that looks something like this:

def increment(self):
    self.value += do_some_expensive_calulation()

If we run this example in a thread pool, you'll see that the answer you get will vary randomly depending on the timing of the sleeps:

from concurrent.futures import ThreadPoolExecutor

counter = RaceyCounter()


def closure(counter):
    counter.increment()


with ThreadPoolExecutor(max_workers=8) as tpe:
    futures = [tpe.submit(closure, counter) for _ in range(1000)]
    for f in futures:
        f.result()

print(counter.value)

On both the free-threaded and GIL-enabled build, you will see the output of this script randomly vary.

We can ensure the above script has determistic answers by adding a lock to our counter:

import threading


class SafeCounter:
    def __init__(self):
        self.value = 0
        self.lock = threading.Lock()

    def increment(self):
        self.lock.acquire()
        current_value = self.value
        time.sleep(random.randint(0, 10) * 0.0001)
        self.value = current_value + 1
        self.lock.release()

If you replace RaceyCounter with SafeCounter in the script above, it will always output 1000.

Of course this introduces a scaling bottleneck when SafeCounter instances are concurrently updated. It's possible to implement more optimized locking strategies, but doing so requires knowledge of the problem.