Skip to content

Porting Python Packages to Support Free-Threading

Many Python packages, particularly packages relying on C extension modules, are not thread-safe in the free-threaded build as of mid-2024. Up until now, the GIL has added implicit locking around any operation in Python or C that holds the GIL, and the GIL must be explicitly dropped before many thread safety issues become problematic. Also, because of the GIL, attempting to parallelize many workflows using the Python threading module will not produce any speedups, so thread safety issues that are possible even with the GIL are not hit often since users do not make use of threading as much as other parallelization strategies. This means many codebases have threading bugs that up-until-now have only been theoretical or present in niche use cases. With free-threading, many more users will want to use Python threads.

This means we must analyze Python codebases, particularly in low-level extension modules, to identify thread safety issues and make changes to thread-unsafe low-level code, including C, C++, Cython, and Rust code exposed to Python.

Updating Extension Modules

Declaring free-threaded support

Extension modules need to explicitly indicate they support running with the GIL disabled, otherwise a warning is printed and the GIL is re-enabled at runtime after importing a module that does not support the GIL.

C or C++ extension modules using multi-phase initialization can specify the Py_mod_gil module slot like this:

static PyModuleDef_Slot module_slots[] = {
    ...
#ifdef Py_GIL_DISABLED
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},
#endif
    {0, NULL}
};

The Py_mod_gil slot has no effect in the non-free-threaded build.

Extensions that use single-phase initialization need to call PyUnstable_Module_SetGIL() in the module's initialization function:

PyMODINIT_FUNC
PyInit__module(void)
{
    PyObject *mod = PyModule_Create(&module);
    if (mod == NULL) {
        return NULL;
    }

#ifdef Py_GIL_DISABLED
    PyUnstable_Module_SetGIL(mod, Py_MOD_GIL_NOT_USED);
#endif

    return mod;
}

C++ extension modules making use of pybind11 can easily declare support for running with the GIL disabled via the gil_not_used argument to create_extension_module. Example:

#include <pybind11/pybind11.h>
namespace py = pybind11;

PYBIND11_MODULE(example, m, py::mod_gil_not_used()) {
    ...
}

Cython code can be thread-unsafe and just like C and C++ code can exhibit undefined behavior due to data races.

Code operating on Python objects should not exhibit any low-level data corruption or C undefined behavior due to Python-level semantics. If you find such a case, it may be a Cython or CPython bug and should be reported as such.

That said, as opposed to data races, race conditions that produces random results from a multithreaded algorithm are not undefined behavior and are allowed in Python and therefore Cython as well. You will still need to add locking or synchronization where appropriate to ensure reproducible results when running a multithreaded algorithm on shared mutable data. See the suggested plan of attack below for more details about discovering and fixing thread safety issues for Python native extensions.

Starting with Cython 3.1.0 (available via the nightly wheels, a PyPI pre-release or the master branch as of right now), extension modules written in Cython can do so using the freethreading_compatible compiler directive.

You can do this in one of several ways, e.g., in a source file:

# cython: freethreading_compatible=True

Or by passing the directive when invoking the cython executable:

$ cython -X freethreading_compatible=True

Or via a build system specific way of passing directives to Cython.

Tip

Here are a few examples of how to globally enable the directive in a few popular build systems:

When using setuptools, you can pass the compiler_directives keyword argument to cythonize:

from Cython.Compiler.Version import version as cython_version
from packaging.version import Version

compiler_directives = {}
if Version(cython_version) >= Version("3.1.0a0"):
    compiler_directives["freethreading_compatible"] = True

setup(
    ext_modules=cythonize(
        extensions,
        compiler_directives=compiler_directives,
    )
)

When using Meson, you can add the directive to the cython_args you're passing to py.extension_module:

cy = meson.get_compiler('cython')

cython_args = []
if cy.version().version_compare('>=3.1.0')
    cython_args += ['-Xfreethreading_compatible=True']
endif

py.extension_module('modulename'
    'source.pyx',
    cython_args: cython_args,
    ...
)

You can also globally add the directive for all Cython extension modules:

cy = meson.get_compiler('cython')
if cy.version().version_compare('>=3.1.0')
    add_project_arguments('-Xfreethreading_compatible=true', language : 'cython')
endif

In CI, you will need to ensure a nightly cython is installed for free-threaded builds. See the docs on setting up CI for advice on how to build projects that depend on Cython.

If you use the CPython C API via PyO3, then you can follow the PyO3 Guide section on supporting free-threaded Python. You must also update your extension to at least version 0.23.

You should write multithreaded tests of any code you expose to Python. See the details about testing in our suggested plan of attack below as well as the guidance for updating test suites. You should fix any thread safety issues you discover while running multithreaded tests.

As of PyO3 0.23, PyO3 enforces Rust's borrow checking rules at runtime and may produce runtime panics if you simultaneously mutably borrow data in more than one thread. You may want to consider storing state in using atomic data structures, with mutexes or locks, or behind Arc pointers.

Once you are satisfied the Python modules defined by your rust crate are thread safe, you can pass gil_used = false to the pymodule macro:

#[pymodule(gil_used = false)]
fn my_module(py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
    ...
}

If you define any modules procedurally by manually creating a PyModule struct without using the pymodule macro, you can call PyModuleMethods::gil_used after instantiating the module.

If you use the pyo3-ffi crate and/or unsafe FFI calls to call directly into the C API, then see the section on porting C extensions in this guide as well as the PyO3 source code.

Starting with NumPy 2.1.0, extension modules containing f2py-wrapped Fortran code can declare they are thread-safe and support free-threading using the --freethreading-compatible command-line argument:

$ python -m numpy.f2py -c code.f -m my_module --freethreading-compatible

If you publish binaries and have downstream libraries that depend on your library, we suggest adding support as described above and uploading nightly wheels as soon as basic support for the free-threaded build is established in the development branch. This will ease the work of libraries that depend on yours to also add support for the free-threaded build.

Suggested Plan of Attack

Validating thread safety with testing

Put priority on thread safety issues surfaced by real-world testing. Run the test suite for your project and fix any failures that occur only with the GIL disabled. Some issues may be due to changes in Python 3.13 that are not specific to the free-threaded build.

Definitely run your existing test suite with the GIL disabled, but unless your tests make heavy use of the threading module, you will likely not hit many issues, so also consider constructing multithreaded tests to expose bugs based on workflows you want to support. Issues found in these tests are the issues your users will most likely hit first.

Multithreaded Python programs can exhibit race conditions which produce random results depending on the order of execution in a multithreaded context. This can happen even with the GIL providing locking, so long as the algorithm releases the GIL at some point, and many Python operations can lead to the GIL being released at some point. If your library was not designed with multithreading in mind, it is likely that some form of locking or synchronization is necessary to make mutable data structures defined by your library thread-safe. You should document the thread-safety guarantees of your library, both with and without the GIL.

You can look at pytest-run-parallel as well as pytest-freethreaded, which both offer pytest plugins to enable running tests in an existing pytest test suite simultaneously in many threads, with the goal of validating thread safety. unittest-ft offers similar functionality for running unittest-based tests in parallel. See the section below on global state in tests for more information about updating test suites to work with the free-threaded build.

These plugins are useful for discovering issues related to use of global state, but cannot discover issues from multithreaded use of data structures defined by your library.

If you would like to create your own testing utilities, the concurrent.futures.ThreadPoolExecutor class is a lightweight way to create multithreaded tests where many threads repeatedly call a function simultaneously. You can also use the threading module directly for more complicated multithreaded test workflows. Adding a threading.Barrier before a line of code that you suspect will trigger a race condition is a good way to synchronize workers and increase the chances that an infrequent test failure will trigger.

General considerations for porting

Many extensions assume the GIL serializes access to state shared between threads, introducing the possibility of data races and race conditions that are impossible when the GIL is enabled.

The CPython C API exposes the Py_GIL_DISABLED macro in the free-threaded build. You can use it to enable low-level code that only runs under the free-threaded build, isolating possibly performance-impacting changes to the free-threaded build:

#ifdef Py_GIL_DISABLED
// free-threaded specific code goes here
#endif

#ifndef Py_GIL_DISABLED
// code for gil-enabled builds goes here
#endif

We suggest focusing on safety over single-threaded performance. For example, if adding a lock to a global cache would harm multithreaded scaling, and turning off the cache implies a a small performance hit, consider doing the simpler thing and disabling the cache in the free-threaded build. Single-threaded performance can always be improved later, once you've established free-threaded support and hopefully improved test coverage for multithreaded workflows.

For NumPy, we are generally assuming users will not do pathological things like resizing an array while another thread is reading from or writing to it and do not explicitly account for this. Eventually we will need to add locking around data structures to avoid races caused by issues like this, but in this early stage of porting we are not planning to add locking on every operation exposed to users that mutates data. Locking will likely need to be added in the future, but that should be done carefully and with experience informed by real-world multithreaded scaling.

For your libraries, we suggest a similar approach for now. Focus on thread safety issues that only occur with the GIL disabled. Any non-critical preexisting thread safety issues can be dealt with later once the free-threaded build is used more. The goal for now should be to enable further refinement and experimentation by fixing issues that prevent using the library at all.

Locking and Synchronization Primitives

Native mutexes

If your extension is written in C++, Rust, or another modern language that exposes locking primitives in the standard library, you should consider using the locking primitives provided by your language or framework to add locks when needed.

If you need to call arbitrary Python code while the lock is held, care should be taken to avoid creating deadlocks with the GIL on the GIL-enabled build.

PyMutex

For C code or C-like C++ code, the CPython 3.13 C API exposes PyMutex, a high-performance locking primitive that supports static allocation. As of CPython 3.13, the mutex requires only one byte for storage, but future versions of CPython may change that, so you should not rely on the size of PyMutex in your code.

You can use PyMutex in both the free-threaded and GIL-enabled build of Python 3.13 or newer. PyMutex is hooked into the CPython runtime, so that if a thread tries to acquire the mutex and ends up blocked, garbage collection can still proceed and, in the GIL-enabled build, the blocked thread releases the GIL, allowing other threads to continue running. This implies that it is impossible to create a deadlock between a PyMutex and the GIL. For this reason, it is not necessary to add code for the GIL-enabled build to ensure the GIL is released before acquiring a PyMutex. If you do not call into the CPython C API while holding the lock, PyMutex has no special advantages over other mutexes, besides low-level details like performance or the size of the mutex object in memory.

See the section on dealing with thread-unsafe low-level libraries below for an example using PyMutex to lock around a thread-unsafe C library.

Critical Sections

Python 3.13 or newer also offers a critical section API that is useful for locking either a single object or a pair of objects during a low-level operation. The critical section API is intended to provide weaker, but still useful locking guarantees compared to directly locking access to an object using a mutex. This provides similar guarantees to the GIL and avoids the risk of deadlocks introduced by locking individual objects.

The main difference compared with using a per-object lock is that active critical sections are suspended if a thread calls PyEval_SaveThread (e.g. when the GIL is released on the GIL-enabled build), and then restored when the thread calls PyEval_RestoreThread (e.g. when the GIL is re-acquired on the GIL-enabled build). This means that while the critical sections are suspended, it's possible for any thread to re-acquire a thread state and mutate the locked object. This can also happen with the GIL, since the GIL is a re-entrant lock, and extensions are allowed to recursively release and acquire it in an interleaved manner.

Critical sections are most useful when implementing the low-level internals of a custom object that you fully control. You can apply critical sections around modification of internal state to effectively serialize access to that state.

See the section below on dealing with thread-unsafe objects for an example using the critical section API.

Dealing with global state

Many CPython C extensions make strong assumptions about the GIL. For example, before NumPy 2.1.0, the C code in NumPy made extensive use of C static global variables for storing settings, state, and caches. With the GIL, it is possible for Python threads to produce non-deterministic results from a calculation, but it is not possible for two C threads to simultaneously see the state of the C global variables, so no data races are possible.

In free-threaded Python, global state like this is no longer safe against data races and undefined behavior in C code. A cache of PyObject pointers stored in a C global array can be overwritten simultaneously by multiple Python threads, leading to memory corruption and segfaults.

Converting global state to thread local state

Often the easiest way to fix data races due to global state is to convert the global state to thread local state.

Python and Cython code can make use of threading.local to declare a thread-local Python object. C and C++ code can also use the Py_tss API to store thread-local Python object references. PEP 539 has more details about the Py_tss API.

Low-level C or C++ code can make use of the thread_local storage specified by recent standard versions. Note that standardization of thread-local storage in C has been slower than C++, so you may need to use platform-specific definitions to declare variables with thread-local storage. Also note that thread-local storage on MSVC has caveats, and you should not use thread-local storage for anything besides statically defined integers and pointers.

NumPy has a NPY_TLS macro in the numpy/npy_common.h header. While you can include the numpy header and use NPY_TLS directly on NumPy 2.1 or newer, you can also add the definition to your own codebase, along with some build configuration tests to test for the correct definition to use.

Caches

Global caches are also a common source of thread safety issues. For example, if a function requires an expensive intermediate result that only needs to be calculated once, many C extensions store the result in a global variable. This can lead to data races and memory corruption if more than one thread simultaneously tries to fill the cache.

If the cache is not critical for performance, consider simply disabling the cache in the free-threaded build:

static int *cache = NULL;

int my_function_with_a_cache(void) {
    int *my_cache = NULL;
#ifndef Py_GIL_DISABLED
    if (cache == NULL) {
        cache = get_expensive_result();
    }
    my_cache = cache;
#else
    my_cache = get_expensive_result();
#endif;
    // use the cache
}

CPython holds a per-module lock during import. This lock can be released to avoid deadlocks in unusual cases, but in most situations module initialization happens exactly once per interpreter in one C thread. Modules using static single-phase initialization can therefore set up per-module state in the PyInit function without worrying about concurrent initialization of modules in different threads. For example, you might set up a global static cache that is read-only after module initialization like this:

static int *cache = NULL;

PyMODINIT_FUNC
PyInit__module(void)
{
    PyObject *mod = PyModule_Create(&module);
    if (mod == NULL) {
        return NULL;
    }

    // don't need to lock or do anything special
    cache = setup_cache();

    // do rest of initialization
}

You can then read from cache at runtime in a context where you know the module is initialized without worrying about whether or not the per-module static cache is initialized.

If the cache is critical for performance, cannot be generated at import time, but generally gets filled quickly after a program begins, then you will need to use a single-initialization API to ensure the cache is only ever initialized once. In C++, use std::once_flag or std::call_once.

C does not have an equivalent portable API for single initialization. If you need that, take a look at this NumPy PR for an example using atomic operations and a global mutex.

If the cache is in the form of a data container, then you can lock access to the container, like in the following example:

#ifdef Py_GIL_DISABLED
static PyMutex cache_lock = {0};
#define LOCK() PyMutex_Lock(&cache_lock)
#define UNLOCK() PyMutex_Unlock(&cache_lock)
#else
#define LOCK()
#define UNLOCK()
#endif

static int *cache = NULL;
static PyObject *global_table = NULL;

int initialize_table(void) {
    // called during module initialization
    global_table = PyDict_New();
    return;
}

int function_accessing_the_cache(void) {
    LOCK();
    // use the cache

    UNLOCK();
}

Note

Note that, while the NumPy PR linked above uses PyThread_type_lock, that is only because PyMutex was not part of the public Python C API at the time. We recommend always using PyMutex. For pointers on how to do that, check this NumPy PR that ports all PyThread_type_lock usages to PyMutex.

Fixing thread-unsafe tests.

Many existing tests are written using global state. This is not a problem if the test only runs once, but if you would like to use your tests to check for possible thread safety issues by running existing tests on many threads, you will likely need to update the tests to eliminate use of global state.

Since tests using global state are inherently racey, this means that test failures associated with these tests are also inherently flakey. If you see tests failing intermittently, you should not discount that you are using global state in a test, or even inadvertently using global state in pytest itself.

pytest is not thread-safe

See the pytest docs for more information about this. While tests can manage their own threads, you should not assume that functionality provided by pytest is thread-safe.

Functionality that is known not to be thread-safe includes:

Note that the pytest maintainers have explicitly ruled out making pytest thread-safe, please do not open issues asking to fix thread safety issues in pytest itself.

The warnings module is not thread-safe

Many tests carefully ensure that warnings will be seen by the user in cases where the library author intends users to see them. These tests inevintably make use of the warnings module. As noted in the documentation for warnings.catch_warnings, the functionality provided by Python to track warnings is inherently thread-unsafe. This means tests that check for warnings should be marked as thread-unsafe and should be skipped when running tests on many threads simultaneously, since they will randomly pass or fail depending on thread timing.

Hopefully in the future it will be possible for Python to write a scalable infrastucture for tracking warnings to fix this issue once and for all. See the CPython issue tracking this problem for more information.

File system thread safety

Many tests make use of the file system, either via a temporary file, or by simply directly writing to the folder running the test. If the filename used by the test is a constant or it is ever shared between instances of the test, the filesystem becomes shared global state, and the test will not be thread-safe.

The easiest way to fix this is to use tempfile, which automatically handles generating file handles in a thread-safe manner. If for some reason this isn't practical, consider forcing the filenames used in tests to be unique, for example by appending a UUID to the filename.

Dealing with thread-unsafe libraries

Many C, C++, and Fortran libraries are not written in a thread-safe manner. It is still possible to call these libraries from free-threaded Python, but wrappers must add appropriate locks to prevent undefined behavior.

There are two kinds of thread unsafe libraries: reentrant and non-reentrant. A reentrant library generally will expose state as a struct that must be passed to library functions. So long as the state struct is not shared between threads, functions in the library can be safely executed simultaneously.

Wrapping reentrant libraries requires adding locking whenever the state struct is accessed.

typedef struct lib_state_struct {
    low_level_library_state *state;
    PyMutex lock;
} lib_state_struct;

int call_library_function(lib_state_struct *lib_state) {
    PyMutex_Lock(&lib_state->lock);
    library_function(lib_state->state);
    PyMutex_Unlock(&lib_state->lock)
}

int call_another_library_function(lib_state_struct *lib_state) {
    PyMutex_Lock(&lib_state->lock);
    another_library_function(lib_state->state);
    PyMutex_Unlock(&lib_state->lock)
}

With this setup, if two threads call library_function and another_library_functions simultaneously, one thread will block until the other thread finishes, preventing concurrent access to lib_state->state.

Non-reentrant libraries provide an even weaker guarantee: threads cannot call library functions simultaneously without causing undefined behavior. Generally this is due to use of global static state in the library. This means that non-reentrant libraries require a global lock:

static PyMutex global_lock = {0};

int call_library_function(int *argument) {
    PyMutex_Lock(&global_lock);
    library_function(argument);
    PyMutex_Unlock(&global_lock);
}

Any other wrapped function needs similar locking around each call into the library.

Dealing with thread-unsafe objects

Similar to the section above, objects may need locking or atomics if they can be concurrently modified from multiple threads. CPython 3.13 exposes a public C API that allows users to use the built-in per-object locks.

For example the following code:

int do_modification(MyObject *obj) {
    return modification_on_obj(obj);
}

Should be transformed to:

int do_modification(MyObject *obj) {
    int res;
    Py_BEGIN_CRITICAL_SECTION(obj);
    res = modification_on_obj(obj);
    Py_END_CRITICAL_SECTION(obj);
    return res;
}

A variant for locking two objects at once is also available. For more information about Py_BEGIN_CRITICAL_SECTION, please see the Python C API documentation on critical sections.

Cython thread safety

If your extension is written in Cython, you can generally assume that "Python-level" code that compiles to CPython C API operations on Python objects is thread-safe, but "C-level" code (e.g. code that will compile inside a with nogil block) may have thread safety issues. Note that not all code outside with nogil blocks is thread-safe. For example, a Python wrapper for a thread-unsafe C library is thread-unsafe if the GIL is disabled unless there is locking around uses of the thread-unsafe library. Another example: using thread-unsafe C-level constructs like a global variable is also thread-unsafe if the GIL is disabled.

CPython C API usage

In the free-threaded build it is possible for the reference count of an object to change "underneath" a running thread when it is mutated by another thread. This means that many APIs that assume reference counts cannot be updated by another thread while it is running are no longer thread-safe. In particular, C code returning "borrowed" references to Python objects in mutable containers like lists may introduce thread safety issues. A borrowed reference happens when a C API function does not increment the reference count of a Python object before returning the object to the caller. "New" references are safe to use until the owning thread releases the reference, as in non free-threaded code.

Most direct uses of the CPython C API are thread-safe. There is no need to add locking for scenarios that should be bugs in CPython. You can assume, for example, that the initializer for a Python object can only be called by one thread and the C-level implementation of a Python function can only be called on one thread. Accessing the arguments of a Python function is thread-safe no matter what C API constructs are used and no matter whether the reference is borrowed or owned because two threads can't simultaneously call the same function with the same arguments from the same Python-level context. Of course it's possible to implement argument parsing in a thread-unsafe manner using thread-unsafe C or C++ constructs, but it's not possible to do so using the CPython C API.

Unsafe APIs returning borrowed references

The PyDict and PyList APIs contain many functions returning borrowed references to items in dicts and lists. Since these containers are mutable, it's possible for another thread to delete the item from the container, leading to the item being de-allocated while the borrowed reference is still "alive". Even code like this:

PyObject *item = Py_NewRef(PyList_GetItem(list_object, 0))

Is not thread-safe, because in principle it's possible for the list item to be de-allocated before Py_NewRef gets a chance to increment the reference count.

For that reason, you should inspect Python C API code to look for patterns where a borrowed reference is returned to a shared, mutable data structure, and replace uses of APIs like PyList_GetItem with APIs exposed by the CPython C API returning strong references like PyList_GetItemRef. Not all usages are problematic (see above) and we do not currently suggest converting all usages of possibly unsafe APIs returning borrowed references to return new reference. This would introduce unnecessary reference count churn in situations that are thread-safe by construction and also likely introduce new reference counting bugs in C or C++ code using the C API directly. However, many usages are unsafe, and maintaining a borrowed reference to an objects that could be exposed to another thread is unsafe.

A good starting place to find instances of this would be to look for usages of the unsafe borrowed reference APIs mentioned in the free-threading compatibility docs.

Adopt pythoncapi-compat to use new C API functions

Rather than maintaining compatibility shims to use functions added to the C API for Python 3.13 like PyList_GetItemRef while maintaining compatibility with earlier Python versions, we suggest adopting the pythoncapi-compat project as a build-time dependency. This is a header-only library that can be vendored as e.g. a git submodule and included to expose shims for C API functions on older versions of Python that do not have implementations.

Some low-level APIs don't enforce locking

Some low-level functions like PyList_SET_ITEM and PyTuple_SET_ITEM do not do any internal locking and should only be used to build newly created values. Do not use them to modify existing containers in the free-threaded build.

Limited API support

The free-threaded build does not support the limited CPython C API. If you currently use the limited API to build wheels that do not depend on a specific Python version, you will not be able to use it while shipping binaries for the free-threaded build. In practice, the limited API is a subset of the full C API, so your extension will build, you just cannot set Py_LIMITED_API at build time. This also means that code inside #ifdef Py_GIL_DISABLED checks can use C API constructs outside the limited API if you would like to do that, although these uses will need to be removed once the free-threaded build gains support for compiling with the limited API.