Feb 04, 2024

Reducing Python Memory Usage by 97%

Hey there 👋 I'm building CodeTrail, which helps engineering teams document and share knowledge close to the codebase with no friction. If you're still using Notion, Confluence or Google Docs to document your engineering work, give it a try and let me know what you think!

I’ve recently had the honor to debug a Python script that was consuming more memory than it takes to run a full-sized LLM locally. The purpose of the script wasn’t anything out of the ordinary, it was reading a CSV file and performing some analytics on it. The CSV contained 10m lines of request logs, including a duration in seconds.

With some debugging, I quickly tracked down the issue to the loading and preprocessing step. Before running any data analysis, the duration should be converted to milliseconds, as that’s easier to work with and plot later on.

machine,duration
1,0.045
2,0.035
3,0.358

As you can see below, the original author used the built-in CSV reader to stream the data into Python. As a hint, the first mistake I made was not looking into how the data is parsed by Python, because the code does look pretty straightforward.

from csv import reader

rows = []
with open('./test.csv', 'r') as csv_file:
    csv_reader = reader(csv_file)
    next(csv_reader)  # Skip the header row
    for row in csv_reader:
        rows.append(row)

# convert duration to seconds
rows = [(row[0], row[1] * 1000) for row in rows]

Running the conversion step blew up the memory usage from around 5 GB to over 130 GB. Unfortunately, I had to terminate the script early because my Mac was swapping extensively and everything started to break down past 100 GB.

Obviously, something was wrong. I reduced the data set to 10 rows to debug every step without waiting forever. Inspecting the layout of the rows quickly yielded an interesting piece of information: Instead of floating-point data types, the duration (and all other rows for that matter) were read as strings. So rather than efficiently storing the values as 32-bit or 64-bit floats, Python’s CSV reader didn’t know how to parse the duration without any additional details and opted for strings which offer more possibilities but are the worst possible option for numerical values.

Multiplying a string with a number in Python does not (in classic JavaScript fashion) convert the value to a number and run with it like below. See for yourself below:

$ python3
>>> x = "0.045"
>>> x * 10
'0.0450.0450.0450.0450.0450.0450.0450.0450.0450.045'
>>> x = 0.045
>>> x * 10
0.44999999999999996
>>>

$ node
Welcome to Node.js v20.9.0.
> "0.5" * 10
5

Ah well, this is wrong on so many levels. Let’s start and fix it.

If we wanted to keep this script working as is and don’t introduce any additional libraries, we should convert the data types as early as possible to prevent running into these issues.

rows = []
with open('./test.csv', 'r') as csv_file:
    csv_reader = reader(csv_file)
    next(csv_reader)  # Skip the header row
    for row in csv_reader:
        row[1] = float(row[1]) # convert strings to floats
        rows.append(row)

Adding one line to the input reader will convert the duration to proper floats. This simple change brought memory usage down back to around 5 GB and allowed the script to run to completion.

Alternatively, we could have used a library like pandas to create a data frame from the CSV immediately:

import pandas as pd
df = pd.read_csv('./test.csv')
df['duration'] = df['duration'] * 1000

Not only is this much simpler, it’s way harder to screw up.


You might think “This will never happen to me” and “Oh boy, those were some seriously stupid mistakes”, but I wonder how many codebases end up in a situation like this. The root cause was manifold:

First, a lack of context blocked me from spotting the mistakes: I usually write JavaScript so the Python way of converting between data types wasn’t on my mind. I also used Pandas and other libraries in the past and expected the built-in CSV handling to parse input values similarly.

Second, the lack of a strict type system and validation layer made it harder to spot unexpected differences in data types. It might be helpful to require users to define their column types ahead of time and raise an exception if unexpected input values are supplied.

Finally, unit tests on smaller but representative data sets would have surfaced the invalid implementation at the time of writing the code.

Thanks for reading this post 🙌 I'm building CodeTrail, which helps engineering teams document and share knowledge close to the codebase with no friction. If you're still using Notion, Confluence or Google Docs to document your engineering work, give it a try and let me know what you think!

Bruno Scheufler

At the intersection of software engineering
and management.

On other platforms