Not a problem for me now after pulling out my hair for a week, but the documentation needs to be fixed.
The documentation in question is the CrossCore Embedded Studio Loader and Utilities Manual, Rev 1.2, April 2013.
Table 7.1 shows that the initial word in SPI Master mode is sent after the 8bit cmd and 24 bit address. Well that doesn't work for me. What works for me is sending the initial word while the 8bit cmd and 24 bit address is being sent. Also, it doesn't matter what you send during the first 32 bits, because it will be discarded anyway. Sending the initial word after the 32 bits causes the kernel data to be misaligned.
I've attached a pic of the offending table.