Optimized SPI transfer#2747
Conversation
Optimized SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 MBit/s to nearly to 40 Mbit/s.
Current coverage is 27.80% (diff: 100%)@@ master #2747 diff @@
==========================================
Files 20 20
Lines 3625 3625
Methods 335 335
Messages 0 0
Branches 656 656
==========================================
Hits 1008 1008
Misses 2441 2441
Partials 176 176
|
|
@KelrobGithub do you compose any example of your work? |
|
@mkeyno As you can see I've just copied the transfer and setdataBits functions from SPI library to my sketch and changed them to faster behavior. In main loop I've done two 64-byte-accesses using the default SPI library and 2 accesses of same size and content using the local functions. The result will look like the screenshot above. It's worth to mention that sending data at that speed is no problem as SPI CLK and SPI MOSI have a fixed timing relationship. The other direction is far more difficult as when using 80 MHz the SPI counterpart (a FPGA in my project) has to sample in the SPI CLK signal, generate the next bit internally and shift it out on MISO in less than 12.5ns. Currently I'm using a test pattern and a FPGA-internal adjustable delayline with a resolution of ~300ps to adjust the MISO timing for best sampling on ESP8266 side so it's independent from small chip differences, voltage supply and aging (at least I hope ;-). But if you only want to see the increase in speed (e.g. if only downstreaming data is needed) no SPI counterpart has to be connected to ESP8266 except the oscilloscope for probing. Best regards, /*
#define LED_PIN LED_BUILTIN #define ledOn() digitalWrite(LED_PIN, LOW) #include <SPI.h> char outBuffer[spiBufferSize] attribute ((aligned (4))); // buffer needs to be aligned to 32-bit-boundaries as needed by SPI class. void setup() uint32_t spiFreq = spiFreqMHz * 1000000; inline void spiSetDataBits(uint32_t bits) { void spiTransferBytes(uint8_t * out, uint8_t * in, uint32_t size) void loop() // SPI class implementation // Fast implementation |
igrr
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I really like this performance optimization. I left a few comments on how I think the code can be improved.
| * @param size uint32_t | ||
| */ | ||
| void SPIClass::writeBytes(uint8_t * data, uint32_t size) { | ||
| uint32_t localSize; |
There was a problem hiding this comment.
Something wacky is going on with indentation here; possibly accidental tabs instead of spaces?
There was a problem hiding this comment.
@igrr
That's true, I'll beautify it. Seems that my editor indeed has used sometimes tabs, sometimes spaces. I've already removed the tabs in my local files, will submit them ASAP.
| #include <stdlib.h> | ||
|
|
||
| #define SPI_HAS_TRANSACTION | ||
| #define SPI_SKIP_INITIAL_BUSY_CHECK //comment this line for backward compatibility if really needed |
There was a problem hiding this comment.
I would say, backwards compatibility should be the default option. Making people modify library source code to achieve compatibility is not very nice (especially for a library bundled with the core, which is installed by the IDE who knows where...).
Actually, we only need to wait for SPI to become !busy if the current transfer uses different settings (frequency, data size, data order, etc) from the previous one. So I suggest to move the decision of waiting or not for the transfer to the run time.
So maybe add a member function like _waitForSPINotBusy which would compare new settings to the ones used during previous transfer and either return (if they are the same) or wait for SPI to be !busy (if they are different).
There was a problem hiding this comment.
@igrr
I do think for backwards compatibility in the same way as you do.
But in fact I think that the initial check for !busy is not needed in any of the SPI functions as every write- and transfer-function already checks that its own SPI transfer completes before leaving its context by using
while(SPI1CMD & SPIBUSY) {}
at the end of the function.
Therefore it is not possible to change frequency during a transfer.
I've only added the #define which can be commented as a service in case I've overlooked something. But currently I don't see any threat here, so maybe we can just delete the #define and the initial check completely.
Or is there an effect I currently don't know about that changing the frequency, data size or whatever activates SPI's busy flag? Then I would suggest to wait in the setFrequency or setDataBit functions but not in every transfer.
There was a problem hiding this comment.
If the define shouldn't be necessary, let's just remove it.
| } | ||
|
|
||
| if(data) { | ||
| memcpy((uint8_t*)&SPI1W0, data, size); |
There was a problem hiding this comment.
In general, accessing most peripheral registers other than with l32i/s32i isn't valid, but if you use memcpy with an unaligned source buffer this may very well happen.
memcpy is indeed better than a plain loop, because on Xtensa it interleaves multiple loads and stores, reducing pipeline stalls due to memory/APB access. I suggest to have a check whether data is 32-bit aligned first. If it is, then go via "fast path" using memcpy. If it's not, go via "slow path" of a loop with one memcpy per word. Unlike the old code this solution will gracefully handle unaligned source buffers, removing the alignment restriction (and for aligned buffers we will gain performance due to memcpy).
There was a problem hiding this comment.
@igrr
OK, a 32-bit alignment check can be done easily during run-time.
But, what exactly shall I do in case the data it is not aligned? The old code has just copied 32-bit of data in a loop which could cause an "unaligment exception".
I could go for byte-wise copying (I fear word-wise as you mentioned solves only half of the problem, doesn't it?), but as memcpy is defined for byte-wise operation by nature as it does not know anything about the data types transferred it should be safe from my point of view even for an unaligned source buffer.
Nearly as stated here: https://github.com/jcmvbkbc/u-boot-xtensa/blob/master/doc/README.unaligned-memory-access.txt
I've tested my implementation on non-aligned memory locations withount any errors by creating an aligned buffer and setting the pointer used for SPI access to the first 4 locations with identical results (see code below):
// Define buffer which is aligned
char spiIntfOutBuffer[SPI_BUFFER_SIZE] __attribute__ ((aligned (4)));
// Set start pointer somewhere in buffer to unalign it
uint8_t* spiIntfOutBufferPtr = (uint8_t*)&(spiIntfOutBuffer[3]);
// Use start pointer for SPI transfer
spiIntfTransferBytes(outBufferPtr, NULL, SPI_ADJ_SETUP_BUFFER_SIZE);So from my point of view there is no need for changes. Or did I miss something which is not covered by my tests?
There was a problem hiding this comment.
Memcpy will always use load/store instructions with correct alignment, so unaligned load/store exceptions are not the issue here. The issue may be that memcpy will try to write a smaller than 4-byte value into the peripheral register (i.e. perform a 2-byte or 1-byte write, aligned on 2-byte or 1-byte boundary, respectively). Doing so will not cause any exception, but the SPI FIFO was not designed to support 2-byte or 1-byte writes. So you may possibly end up with incorrect data being sent.
I'll try to come up with some test case which illustrates this.
There was a problem hiding this comment.
After some testing it appears that even though the hardware wasn't designed to accommodate 2-byte and 1-byte writes, it behaves okay for some reason.
@KelrobGithub If you can address the other two minor comments, I will happily merge this. Thanks!
|
Why did this didn't get merged? |
|
@Crypter changes were requested thst apparently weren't done, the implementation may have not been verified to be ok by someone else, the branch currently has conflicts that must be resolved. |
|
@devyte I might be able to. If anything gets done I'll try my best. |
|
Hi to all involved, |
|
I fixed the merge conflicts |
Hi ,
I've optimized the SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 Mbit/s to nearly 40 Mbit/s. This was needed to transfer megabytes of data from an FPGA over ESP8266 and Wifi in reasonable time, so now the bottleneck is shifted to Wifi ;-).
The screenshot shows 2 x 64 byte frames using the current implementation on the left side of the screen and the result of my changes on the right side. As one can see the gap between these frames is quite reduced.

Modifications done:
(1) Changed counter variables and function parameters from 8bit to 32bit as these are handled faster by the Tensilica CPU.
(2) Replaced loops for buffer copying by memcpy and memset calls.
(3) Removed initial check (if SPI interface is busy) as this check is done inside every write and transfer function. These functions are only exited when the interface is not busy any more so the initial check when entering the function is not needed from my point of view.
(4) Restructured write and transfer functions to call _write and _transfer only once in their code so the inline function definitions are really "inlined" (and not only suggested and ignored by GCC compiler).
Code has been tested with various buffer sizes, the SPI class implementation is only 200ns slower than if copy-and-pasting the _transfer function directly into my sketch due to some more checks.
Best regards,
KelrobGithub