Home Debugging Race Conditions in RPC Calls
Post
Cancel

Debugging Race Conditions in RPC Calls

While I was implementing a primary/backup system for a project in my distributed systems course, I kept running into these weird errors.

1
2
3
4
5
read unix @->/var/tmp/824-1000/pb-3123-xyz-1: read: connection reset by peer
unexpected EOF

write unix @->/var/tmp/824-1000/pb-3123-xyz-1: write: broken pipe
unexpected EOF

My test code was passing, so my logic supposedly wasn’t wrong but in a way..unstable? Where was this coming from?

I googled the error logs and found out that this was happening due to race conditions in RPC calls.

When I traced back the call stack(which was very difficult as many go routines were running in parallel), this was where the error was rooting from:

1
ok := call(pb.latestView.Backup, "PBServer.Forward", &ForwardArgs, &ForwardReply)

While executing this call, I noticed that if I set a breakpoint at the puts, the test would pass. However, if I let it run without interruption, it would fail. This odd behavior prompted me to dive deeper into the underlying mechanisms of RPC.

Understanding the Race Condition

Upon investigation, I discovered that the RPC call could lead to race conditions. Specifically, when I made the call to the backup server, I was waiting for a response. However, there was a risk that the backup server could die during this waiting period. If that happened, it would leave the RPC connection in a half-open state, which could cause gob decoder errors. Essentially, the connection would break mid-call because the server went down.

The Solution: Quick Fail on Connection Issues

To address the race condition, I restructured the way the RPC call handled responses. Instead of waiting for a response, I modified the call to fail quickly if the backup server was unavailable. This way, if the backup server died, the connection would fail immediately, preventing any hanging connections.

Here’s how I changed the code:

1
call(pb.latestView.Backup, "PBServer.Forward", &ForwardArgs, &ForwardReply)

By making this adjustment, I was able to solve the two tests related to concurrent operations. It was a simple yet effective solution that significantly improved the reliability of the RPC calls.

I was able to optimize my code a little more with the help of my genius partner, but yes, sometimes in distributed systems, the solution is just move forward if a failure happens. Not in all situations but for the sake of availability, yes.

Happy coding!

This post is licensed under CC BY 4.0 by the author.

Google File Systems

Working with Unix Sockets