Multithreading Visualization: C vs Python

I wanted to see how threads actually behave on a real machine, not just in theory.
So I spun up an EC2 instance and ran both CPU-bound and I/O-mixed workloads in C and Python, with two variations:

  • Threads pinned to specific CPUs
  • Threads unpinned, letting the Linux scheduler decide

I used perf and timechart to visualize thread activity. Here’s what I found.

Note: The code used is available here and here.


1. The four experiments

C workload, CPU pinned

C pinned

  • Threads stayed fixed to their assigned cores.
  • Execution blocks were long and continuous, showing minimal migration.
  • Very predictable and efficient, but if some threads finished early, a few cores went underutilized.

Python workload, CPU pinned

Python pinned

  • Even with threads pinned, only one was truly active at a time.
  • Other threads waited because of the Global Interpreter Lock (GIL).
  • The chart looks sparse: lots of idle gaps.

C workload, unpinned

C unpinned

  • Threads migrated between cores as the Linux scheduler balanced load.
  • More context switches, but overall CPU utilization was still high.
  • Linux did a good job keeping all cores busy.

Python workload, unpinned

Python unpinned

  • Same story as pinned: the GIL kept CPU-bound execution serial.
  • Threads migrated between cores, but only one was active at a time.
  • Unpinning gave no real improvement.

2. Why Python has a GIL

Python’s Global Interpreter Lock is a mutex that prevents multiple native threads from executing Python bytecode at the same time.
Why does it exist?

  • Memory safety: CPython’s memory management (reference counting) isn’t thread-safe. The GIL keeps it simple.
  • Simplicity: Removing the GIL would require pervasive locking or atomic operations throughout the interpreter, hurting single-thread performance.
  • Practicality: Many Python workloads are I/O-bound (network, disk). The GIL doesn’t block I/O concurrency, so for those use cases it’s “good enough.”

The tradeoff: Python threads are fine for I/O concurrency, but they don’t give you CPU parallelism.


3. Cache locality and pinning

In C:

  • Pinning a thread to a CPU core improves cache locality.
  • The data a thread frequently uses stays in that core’s L1/L2 caches.
  • This reduces cache misses, making execution faster and more deterministic.
  • Downside: if a pinned thread finishes early, that core may sit idle while others are still overloaded.

In Python:

  • Even when only one thread runs at a time (due to the GIL), cache locality still matters for per-thread latency and determinism, and it becomes crucial when threads execute native C code or perform blocking I/O that releases the GIL.
  • Threads may migrate, but since there’s no true parallelism in CPU-bound code, the benefits of pinning don’t materialize.

4. Key takeaways

  • C + pinned: Best for determinism and cache affinity.
  • C + unpinned: Slightly noisier, but Linux balances well — all cores stay busy.
  • Python (pinned or not): Same bottleneck — the GIL. No true parallelism for CPU-bound tasks.
  • I/O workloads: Python threads are still useful for overlapping I/O waits, since the GIL is released during blocking I/O.

Threading comparison


5. Lessons

  • If you need CPU-bound parallelism in Python → use multiprocessing, or drop into C/C++/Rust extensions (NumPy, Cython, etc.).
  • For I/O concurrency, Python threads are fine.
  • If you want to see the difference, perf timecharts make the contrast obvious:
    • C = full utilization, parallel blocks across cores
    • Python = serialized execution, idle gaps

6. When this actually matters

Understanding threading behavior isn’t just academic—it directly impacts real applications:

Image/video processing: Libraries like OpenCV use C extensions under the hood. When you call cv2.resize() or apply filters, you’re getting true parallelism despite running from Python. Pure Python image manipulation would crawl.

Scientific computing: NumPy and SciPy operations release the GIL and use optimized C/Fortran code. Matrix multiplication with numpy.dot() can use all your cores, but a pure Python nested loop cannot.

Web servers: Frameworks like FastAPI and asyncio work well despite the GIL because web apps are I/O-bound. The GIL is released during database queries, file reads, and network requests. Threads spend most of their time waiting, not computing.

Machine learning: Training loops in pure Python would be painfully slow. Libraries like PyTorch and TensorFlow drop into C++/CUDA for the heavy lifting, only using Python for orchestration.

The pattern is clear: Python excels at coordination and glue code, while performance-critical work gets delegated to compiled extensions.


7. A kitchen metaphor

Think of CPU cores as kitchen stations, and threads as chefs:

C threads: Four chefs, each with their own fully-equipped station. Everyone cooks simultaneously—onions sizzling, pasta boiling, sauces reducing. Pure parallel efficiency.

Python threads: Four chefs, but only one knife in the entire kitchen. No matter how many stations you have, only one chef can work at a time. The others stand around waiting for their turn with the knife.

The GIL is that single shared knife - a bottleneck that forces serialization no matter how much kitchen space (CPU cores) you have available.

In short:
C threads light up your CPUs like a busy restaurant kitchen.
Python threads, for CPU-bound work, just pass the knife around.


© 2023. All rights reserved.

Powered by Hydejack v9.1.6