While I was implementing a primary/backup system for a project in my distributed systems course, I kept running into these weird errors.
1
2
3
4
5
read unix @->/var/tmp/824-1000/pb-3123-xyz-1: read: connection reset by peer
unexpected EOF
write unix @->/var/tmp/824-1000/pb-3123-xyz-1: write: broken pipe
unexpected EOF
My test code was passing, so my logic supposedly wasn’t wrong but in a way..unstable? Where was this coming from?
I googled the error logs and found out that this was happening due to race conditions in RPC calls.
When I traced back the call stack(which was very difficult as many go routines were running in parallel), this was where the error was rooting from:
1
ok := call(pb.latestView.Backup, "PBServer.Forward", &ForwardArgs, &ForwardReply)
While executing this call, I noticed that if I set a breakpoint at the puts, the test would pass. However, if I let it run without interruption, it would fail. This odd behavior prompted me to dive deeper into the underlying mechanisms of RPC.
Understanding the Race Condition
Upon investigation, I discovered that the RPC call could lead to race conditions. Specifically, when I made the call to the backup server, I was waiting for a response. However, there was a risk that the backup server could die during this waiting period. If that happened, it would leave the RPC connection in a half-open state, which could cause gob decoder errors. Essentially, the connection would break mid-call because the server went down.
The Solution: Quick Fail on Connection Issues
To address the race condition, I restructured the way the RPC call handled responses. Instead of waiting for a response, I modified the call to fail quickly if the backup server was unavailable. This way, if the backup server died, the connection would fail immediately, preventing any hanging connections.
Here’s how I changed the code:
1
call(pb.latestView.Backup, "PBServer.Forward", &ForwardArgs, &ForwardReply)
By making this adjustment, I was able to solve the two tests related to concurrent operations. It was a simple yet effective solution that significantly improved the reliability of the RPC calls.
I was able to optimize my code a little more with the help of my genius partner, but yes, sometimes in distributed systems, the solution is just move forward if a failure happens. Not in all situations but for the sake of availability, yes.
Happy coding!