Scaling Network Concurrency in Go

I recently built a job that retrieves a list of files, processes, and batches those to then re-upload said files to a different location. Quickly after deploying the initial version, when running the job on larger targets, it became evident that something was off, essentially making the service unusable.

The job ran into all sorts of weird timeout errors, during debugging sessions on my local machine I saw host not found errors that I couldn't really relate to the running code. It took a lot of time to trace the issues back to the root cause, which was networking in Go.

As usual, this wasn't specifically a Go issue but rather due to how I handled the first iteration's HTTP client implementation.

Since the files had to be uploaded to common object storage solutions like S3, Google Cloud Storage, etc. I used the standard libraries available for the task, for example, aws-sdk for uploading objects to S3. This will be important later on.

Re-using one, and only one, client

One of the conventions when using the net/http package is, as I learned, to always re-use the same client throughout your application. This is vital for all the performance tweaks we'll implement next. For my use case, I created a simple helper function to construct a client that would be passed around to all places where I wanted to send requests, including configurations of third-party libraries like the AWS SDK.

// This doesn't do much yet, but you'll see
func CreateHttpClient() (*http.Client) {
	return &http.Client{}
}

To add your newly-created client into, let's say, a session used for sending S3 requests, you can do the following

config := aws.NewConfig()

// Add our http client to the mix
config.WithHTTPClient(httpClient)

Now that you're sure that all of your connections are made through your client, we can move on to tweaking timeouts and adding safety guarantees when it comes to the consumption of available networking resources.

Configuring in-depth HTTP lifecycle timeouts

Once you wrap your head around it, there's a lot of ways to customize the crucial options for how client requests are handled in Go: Ranging from connection to general-purpose timeouts over Keep-alive settings to idling and limiting your client to a max. amount of open connections against each host, you can build services with absolute certainty to how traffic will be handled. Let's go over your options and a preset I've used successfully so far:

func CreateHttpClient() (*http.Client) {
	// Configure dialer for establishing connections
	dialer := &net.Dialer{
    // Set a strict dial/connect timeout after 30s
    Timeout:   30 * time.Second,

    // Set the keep-alive ping interval for idle connections
		KeepAlive: 20 * time.Second,
	}

	transport := &http.Transport{
		// Use context of dialer configured above
		DialContext: dialer.DialContext,

		// Allow a maximum amount of 100 concurrent idle keep-alive connections
		MaxIdleConns: 100,

		// Allow max. 25 connections + 25 idle ones per host
		MaxIdleConnsPerHost: 25,
		MaxConnsPerHost:     25,

		// Configure other miscellaneous timeouts
		TLSHandshakeTimeout:   3 * time.Second,
		ExpectContinueTimeout: 1 * time.Second,
		IdleConnTimeout:       90 * time.Second,
	}

	return &http.Client{
    Transport: transport,

		// Cancel operations that take longer than 30 minutes in total
		Timeout: time.Minute * 30,
	}
}

With the configuration above, I'm assured that my client will use only a limited amount of open connections so the service doesn't run into connection thresholds the host imposes.

Since we are handling requests that might take a long time depending on the payload size, we set a generous timeout of thirty minutes so even the most long-running operations should come to an end successfully. It's best to test this with real-world data to get the best setting for your specific case, though, so be sure to play around!

Fixing bugs in our SDKs

Once I had the configuration above figured out and refactored all systems to use the single HTTP client that got constructed on startup, I started the service and observed my requests being handled as expected. That was until I hit 25 requests to S3. You might think, 25 is an oddly-specific number, right? If you take another look at the configuration above, you can see that I set the maximum amount of active connections per host to 25. So then, what made our client block?

The answer is simple: A subset of operations in the AWS SDK for Go create a readable response, which you have to close by yourself. Even if you're not using it! I found a specific operation that I believed was causing the blocking behavior, and after refactoring and running again, it turned out I was right about it:

// Retrieve an object from S3
response, err := client.GetObject(...)

// Close the response body if it's not nil
if response != nil {
  err = response.Body.Close()
  // handle our potential error
}

Just by checking if the S3 operation returned a body and closing it to terminate the request, the block disappeared.

I hope this short guide helped to demonstrate how straightforward it is, to customize how connections and their requests can be held in control. If you have any questions, suggestions or feedback in general, don't hesitate to reach out on Twitter or by mail 🚀