Porting Python Packages to Support Free-Threading#

Many packages already support free-threaded Python. Check the tracking table in this guide, the free-threaded wheels tracker, and the documentation and PyPI release pages for packages your project depends on to evaluate whether your project can run on the free-threaded build. In addition, you may need to update your code to support the free-threaded build.

Why do projects need updates?#

Free-threaded Python can exploit the many cores present in modern CPUs in pure Python code. In all previous Python releases before the free-threaded build and in the current default build, only one thread at a time could execute Python code because of the global interpreter lock (the GIL).

Attempting to parallelize many workflows using the Python threading module will not produce any speedups on the GIL-enabled build. This means many codebases have threading bugs that up until now have only been theoretical or present in niche use cases. With free-threading, many more users will want to use Python threads, making it more important to fix existing thread safety issues. Additionally, free-threading makes new kinds of concurrent use possible, so situations where the GIL was providing safety will need new analysis to ensure they are safe under free-threaded Python.

Packages that have not yet been updated may exhibit behaviors such as:

Failing to produce deterministic results on the free-threaded build, and possibly not being deterministic with the GIL either.
Crashing the interpreter in multithreaded use if C extensions are involved, in ways that are impossible on the GIL-enabled build. Some extensions may even crash the interpreter under multithreaded use with the GIL enabled.

For a more in-depth look at the differences between the GIL-enabled and free-threaded build, we suggest reading the ft_utils documentation on this topic. Also see the section of this porting guide on extensions to understand why compiled code needs special updates to support the free threaded build.

Suggested Plan of Attack#

Below, we outline a plan of attack for updating a Python project to support the free-threaded build. Since the changes required in native extensions are more substantial, we have split off the guide for porting extension modules into a subsequent section.

Define and document thread safety guarantees#

Consider adding a section to your documentation clearly documenting the thread safety guarantees of your library. Note any use of global state as well as whether the mutable data structures exposed by your library support sequentially consistent shared concurrent use. You should document any locks that you expect might impact multithreaded scaling for realistic workflows. Encourage user feedback, particularly for reports of thread-unsafe behavior in code that is documented to be thread-safe, as well as reports of poor multithreaded scaling in code that you expect to scale well.

You can indicate the level of support for free-threading in your library by adding a trove classifier to the metadata of your package. The currently supported trove classifiers for this purpose are:

Programming Language :: Python :: Free Threading :: 1 - Unstable
Programming Language :: Python :: Free Threading :: 2 - Beta
Programming Language :: Python :: Free Threading :: 3 - Stable
Programming Language :: Python :: Free Threading :: 4 - Resilient

To give some guidance as to what each level means:

For experimentation and feedback only.
Free threaded usage is supported, but documentation of constraints and limitations may be incomplete.
Supported for production use, multithreaded use is tested, and thread safety issues are clearly documented.
Fully supported and fully thread safe.

You can see how supporting the free-threaded build is not an all-or-nothing thing. It is a perfectly valid choice to, for example, only support running on the free-threaded build in effectively single-threaded contexts and not support shared use of objects. It is then up to the users of your library to add locking where appropriate or needed. The advantage of this choice is that it does not force all consumers of your library to pay any cost associated with ensuring thread safety.

Thread Safety of Pure Python Code#

The CPython interpreter protects you from low-level memory unsafety due to data races. It does not protect you from introducing thread safety issues due to race conditions. It is possible to write algorithms that depend on the precise timing of threads completing work. It is up to you as a user of multithreaded parallelism to ensure that simultaneous reads and writes to the same Python variable are impossible.

Below we describe various approaches for improving the determinism of multithreaded pure Python code. The correct approach will depend on exactly what you are doing.

General considerations for porting#

Many projects assume the GIL serializes access to state shared between threads, introducing the possibility of data races in native extensions and race conditions that are impossible when the GIL is enabled.

Ideally it should be possible to add safety without adding any performance cost. This may be impossible in the real world but is the ideal goal. You should benchmark to check that single-threaded performance is not seriously impacted by work to improve thread safety. It may be possible to set things up so that single-threaded users of your library can find ways to avoid paying the cost of synchronization.

If there is no way to add zero-cost thread-safety but the GIL is sufficient to prevent races on the GIL-enabled build, consider adding logic that only triggers if the GIL is disabled at runtime or only triggers on the free-threaded build:

import sys
import sysconfig

if not getattr(sys, '_is_gil_enabled', lambda: True)():
    # logic that only happens if the GIL is disabled

if sysconfig.get_config_var("Py_GIL_DISABLED"):
    # logic that only happens on the free-threaded build

Here's an example of this approach. If adding a lock to a global cache would harm multithreaded scaling, and turning off the cache implies a small performance hit, consider doing the simpler thing and disabling the cache in the free-threaded build.

Single-threaded performance can always be improved later, once you've established free-threaded support and hopefully improved test coverage for multithreaded workflows.

NumPy, for example, decided not to add explicit locking to the ndarray object and does not support mutating shared ndarrays. This was a pragmatic choice given existing heavy multithreaded use of NumPy in the GIL-enabled build and a desire to not introduce scaling bottlenecks in existing workflows.

Eventually NumPy may need to offer explicitly thread-safe data structures, but it is a valid choice to initially support free-threading while still exposing possibly unsafe operations if users use the library unsafely.

For your libraries, we suggest to focus on thread safety issues that only occur with the GIL disabled. Any non-critical pre-existing thread safety issues can be dealt with later once the free-threaded build is used more. The goal for your initial porting effort should be to enable further refinement and experimentation by fixing issues that prevent using the library at all.

Multithreaded Python Programming#

The Python standard library offers a rich API for multithreaded programming. This includes the threading module, which offers relatively low-level locking and synchronization primitives, as well as the queue module for safe communication between threads, and the ThreadPoolExecutor high-level thread pool runner.

If you'd like to learn more about multithreaded Python programming in the GIL-enabled build, Santiago Basulto's tutorial from PyCon 2020 is a good place to start.

For a pedagogical introduction to multithreaded programming in free-threaded Python, we suggest reading the ft_utils documentation, particularly the section on the impact of the global interpreter lock on multithreaded Python programs. Many pure Python operations are not atomic and are susceptible to race conditions, or only appear to be thread-safe in the GIL-enabled build because of details of how CPython releases the GIL in a round-robin fashion to allow threads to run.

Dealing with mutable global state#

The most common source of thread safety issues in Python packages is use of global mutable state. Many projects use module-level or class-level caches to speed up execution but do not envision filling the cache simultaneously from multiple threads. See the testing guide for strategies to add tests to detect problematic global state.

For example, the do_calculation function in the following module is not thread-safe:

from internals import _do_expensive_calculation

global_cache = {}


def do_calculation(arg):
    if arg not in global_cache:
        global_cache[arg] = _do_expensive_calculation(arg)
    return global_cache[arg]

If do_calculation is called simultaneously in multiple threads, then it is possible for at least two threads to see that global_cache doesn't have the cached key and call _do_expensive_calculation. In some cases this is harmless, but depending on the nature of the cache, this could lead to unnecessary network access, resource leaks, or wasted unnecessary compute cost.

Converting global state to thread-local state#

One way of dealing with issues like this is to convert a shared global cache into a thread-local cache. In this approach, each thread will see its own private copy of the cache, making races between threads impossible. This approach makes sense if having extra copies of the cache in each thread is not prohibitively expensive or does not lead to excessive runtime network, CPU, or memory use.

In pure Python, you can create a thread-local cache using an instance of threading.local. Each thread will see independent versions of the thread-local object. You could rewrite the above example to use a thread-local cache like so:

import threading

from internals import _do_expensive_calculation

local = threading.local()

local.cache = {}


def do_calculation(arg):
    if arg not in local.cache:
        local.cache[arg] = _do_expensive_calculation(arg)
    return local.cache[arg]

This wouldn't help a case where each thread having a copy of the cache would be prohibitive, but it does fix possible resource leak issues due to races filling a cache.

Copy-on-Write#

Copy-on-Write (CoW) is a thread-safe pattern to implement lock-free sharing of data structures. It is useful when reads are much more frequent than writes. It is commonly used for caching, where reads are frequent and writes are infrequent.

Consider a library which generates the nth Fibonacci number. The library caches previously computed Fibonacci numbers.

cache = [0, 1]


def fib(nth: int) -> int:
    global cache
    if nth < 1:
        raise ValueError("nth must be a positive integer")

    # Atomically read shared reference to global cache
    local_cache = cache

    if nth > len(local_cache) + 1:
        # Make a new un-shared list
        local_cache = local_cache.copy()

        # Mutating here is safe because the list local_cache refers
        # to is private to this thread
        while nth >= len(local_cache):
            local_cache.append(local_cache[-1] + local_cache[-2])

        # Atomically update global shared reference to point to the new list
        cache = local_cache

    # Must use a reference to the local_cache because another thread
    # may have updated the global reference
    return local_cache[nth]

This code is thread-safe because the shared global cache is never modified in-place. Instead, a new copy of the cache is created and updated, and then the reference to the cache is updated atomically. This ensures that readers always see a consistent view of the cache, even if a writer is updating it concurrently.

This does not rely on the thread-safety of the underlying list. Instead, it relies on the fact that shared references can be read from and modified atomically. This means you can use this technique to allow lock-free access to a shared global cache implemented using a thread-unsafe data structure.

Note that for this to work correctly, readers must not assume that the shared reference (the global cache variable) will be unchanged from one access to the next. For example, this is not thread-safe:

if nth < len(cache):
    # Another thread may replace cache with a shorter list
    # after len(cache) but before cache[nth] so that this fails:
    return cache[nth]

Instead, readers should atomically copy the shared reference to a local variable and then only access the local variable:

    local_cache = cache
    if nth < len(local_cache):
        # No other thread will reassign the local_cache variable
        # or mutate the object that it points to.
        return local_cache[nth]

Also keep in mind that readers may not necessarily see the most up-to-date version of the cache. The CPU cost to calculate some entries will be wasted if there are races to create a new cache. For memoization and other caching this is often fine but may be problematic for some use-cases.

Locking#

If a thread-local cache doesn't make sense, then you can serialize access to the cache with a lock. A lock provides exclusive access to some resource by forcing threads to acquire a lock instance before they can use the resource and release the lock when they are done. The lock ensures that only one thread at a time can use the acquired lock - all other threads block execution until the thread that holds the lock releases it, at which point only one thread waiting to acquire the lock is allowed to run.

You could rewrite the above thread-unsafe example to be thread-safe using a lock like this:

import threading

from internals import _do_expensive_calculation

cache_lock = threading.Lock()
global_cache = {}


def do_calculation(arg):
    if arg in global_cache:
        return global_cache[arg]

    with cache_lock:
        if arg not in global_cache:
            global_cache[arg] = _do_expensive_calculation(arg)
    return global_cache[arg]

Note that after acquiring the lock, we first check if the requested key has been filled by another thread, to prevent unnecessary calls to _do_expensive_calculation if another thread filled the cache while the thread currently holding the lock was blocked on acquiring the lock. Also note that we avoid using Lock.acquire and Lock.release and instead we use the lock as a context manager. The difference is subtle: the context manager calls Lock.release in a try ... finally clause, so if _do_expensive_calculation were to raise an exception, this ensures that the lock won't stay locked forever.

Note that acquiring the same lock recursively leads to deadlocks. Also, in general, it is possible to create a deadlock in any program with more than one lock. Care must be taken to ensure that operations done while the lock is held cannot lead to recursive calls or lead to a situation where a thread owning the lock is blocked on acquiring a different mutex. You do not need to worry about deadlocking with the GIL in pure Python code, the interpreter will handle that for you.

There is also threading.RLock, which provides a reentrant lock allowing threads to recursively acquire the same lock, but is not quite as performant as a threading.Lock in single-threaded use.

Finally, note how the above code will ensure that only a single call to _do_expensive_calculation will run at any given time, regardless of arg. This may not be desirable; one might want to allow calling the function in parallel for different arguments. This however would require a substantially more complex locking pattern.

Raising errors under shared concurrent use#

Sometimes it's a programming error to share an object between threads. An example might be a wrapper for a low-level C compression library that does not support sharing compression contexts between threads. You could make it so users see an error at runtime when they try to share a compression context like this:

from dataclasses import dataclass


@dataclass
class CompressionContext:
    lock: threading.Lock
    state: _LowLevelCompressionContext

    def compress(self, data):
        if not self.lock.acquire(blocking=False):
            raise RuntimeError("Concurrent use detected!")
        try:
            self.state.compress(data)
        finally:
            self.lock.release()

This does require paying the cost of acquiring and releasing a mutex, but because no thread ever blocks on acquiring the lock, this approach cannot introduce hidden multithreaded scaling issues.

Dealing with thread-unsafe objects#

Mutability of objects is deeply embedded in the Python runtime and many tools freely assign to or mutate data stored in a python object.

In the GIL-enabled build, in many cases, you can get away with mutating a shared object safely. This is true so long as whatever mutation you are attempting to do is fast enough that a thread switch is very unlikely to happen while you are doing work.

In the free-threaded build there is no GIL to protect against mutation of state living on a Python object that is shared between threads. Just like when we used a lock to protect a global cache, we can also use a per-object lock to serialize access to state stored in a Python object. Consider the following class:

import time
import random


class RaceyCounter:
    def __init__(self):
        self.value = 0

    def increment(self):
        current_value = self.value
        time.sleep(random.randint(0, 10) * 0.0001)
        self.value = current_value + 1

Here we're simulating an in-place addition using an expensive function. A real example might have a method that looks something like this:

def increment(self):
    self.value = do_some_expensive_calculation(self.value)

If we run this example in a thread pool, you'll see that the answer you get will vary randomly depending on the timing of the sleeps:

from concurrent.futures import ThreadPoolExecutor

counter = RaceyCounter()


def closure(counter):
    counter.increment()


with ThreadPoolExecutor(max_workers=8) as tpe:
    futures = [tpe.submit(closure, counter) for _ in range(1000)]
    for f in futures:
        f.result()

print(counter.value)

On both the free-threaded and GIL-enabled build, you will see the output of this script randomly vary.

We can ensure the above script has deterministic answers by adding a lock to our counter:

import threading


class SafeCounter:
    def __init__(self):
        self.value = 0
        self.lock = threading.Lock()

    def increment(self):
        with self.lock:
            current_value = self.value
            time.sleep(random.randint(0, 10) * 0.0001)
            self.value = current_value + 1

If you replace RaceyCounter with SafeCounter in the script above, it will always output 1000.

Of course this introduces a scaling bottleneck when SafeCounter instances are concurrently updated. It's possible to implement more optimized locking strategies, but doing so requires knowledge of the problem.

Third-party libraries#

Both the ft_utils and cereggii libraries offer data structures that add enhanced atomicity or improved multithreaded scaling compared with standard library primitives.

Dependencies that don't support free-threading#

If one of your package's dependencies does not support free-threading, you might be able to switch to a fork that does. Find more details in our guidance for handling dependencies that don't support free-threading.