google compute engine - how - Some 502 errors in GCP HTTP Load Balancing

http load balancers (2)

I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(

But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.

Eg mine were: jsonPayload: { statusDetails: "failed_to_pick_backend" @type: "‌​ancerLogEntry" }

If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.

The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: tcp:1-5000 Apply to all targets

It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.

Any help will be apreciated.

It seems that there are no an easy solution for this.

As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):

It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.

In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).

It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"):

They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).