The LTC4332 on majority of our systems works flawlessly, but on a couple of systems, we are observing that after a couple of hours of normal operation, suddenly the SPI communication was stopping. I investigated and figured out that the FAULT bit is getting set in the EVENT Register (value in EVENT register is 0x05). Upon further investigation, I saw that the content of the FAULT register is 0x11, which means the RX_BUF_UNDERFLOW and the SPI_WRITE_FAULT bits are set, pointing to some SPI communication error. Power cycling the system, fixes the issue immediately and the system starts running normally again for the next few hours till we see this happen again. We are still unsure what causes the communication error and investigation is going on. The wires and the connectors connecting the local and the remote look good and the issue does not go away if we replace the wires or the connectors. The issue at the moment is always seen on the remote side, as the problem travels with the faulty remote boards. But we fear that this could very well happen with the local board as well, so we cannot be certain that only remote could give us issues.
Question:
While we figure out the root cause, we would still like to be able to restore the communication using a firmware/hardware solution. So, can I restore the communication without power-cycling the system or the LTC4332? I'm clearing the EVENT register by writing 0 to it but this is not helping in restoring communication. I'm pulling the ON pin low for >180ms (~200ms) to trigger a remote reset, but I still do not see the communication getting restored and the EVENT register continues to show communication fault even after the above steps. I do reinitialize the LTC4332 registers after the remote reset, but the fault comes back as soon as any attempt is made to communicate over LTC4332. What else can I try? The circuit on the local side is shown below:

The remote connections are shown below:

Any pointers/ideas on this would be very helpful.
Thanks