Optimized SPI transfer by KelrobGithub · Pull Request #2747 · esp8266/Arduino

KelrobGithub · 2016-12-08T17:49:28Z

Hi ,
I've optimized the SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 Mbit/s to nearly 40 Mbit/s. This was needed to transfer megabytes of data from an FPGA over ESP8266 and Wifi in reasonable time, so now the bottleneck is shifted to Wifi ;-).

The screenshot shows 2 x 64 byte frames using the current implementation on the left side of the screen and the result of my changes on the right side. As one can see the gap between these frames is quite reduced.

Modifications done:
(1) Changed counter variables and function parameters from 8bit to 32bit as these are handled faster by the Tensilica CPU.
(2) Replaced loops for buffer copying by memcpy and memset calls.
(3) Removed initial check (if SPI interface is busy) as this check is done inside every write and transfer function. These functions are only exited when the interface is not busy any more so the initial check when entering the function is not needed from my point of view.
(4) Restructured write and transfer functions to call _write and _transfer only once in their code so the inline function definitions are really "inlined" (and not only suggested and ignored by GCC compiler).

Code has been tested with various buffer sizes, the SPI class implementation is only 200ns slower than if copy-and-pasting the _transfer function directly into my sketch due to some more checks.

Best regards,
KelrobGithub

Optimized SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 MBit/s to nearly to 40 Mbit/s.

codecov-io · 2016-12-08T18:10:30Z

Current coverage is 27.80% (diff: 100%)

Merging #2747 into master will not change coverage

@@             master      #2747   diff @@
==========================================
  Files            20         20          
  Lines          3625       3625          
  Methods         335        335          
  Messages          0          0          
  Branches        656        656          
==========================================
  Hits           1008       1008          
  Misses         2441       2441          
  Partials        176        176

Powered by Codecov. Last update 7b32e6a...a2f2c6b

mkeyno · 2016-12-10T19:14:06Z

@KelrobGithub do you compose any example of your work?

KelrobGithub · 2016-12-12T07:35:25Z

@mkeyno
Yes, of course, I copy'n'pasted it below.

As you can see I've just copied the transfer and setdataBits functions from SPI library to my sketch and changed them to faster behavior. In main loop I've done two 64-byte-accesses using the default SPI library and 2 accesses of same size and content using the local functions. The result will look like the screenshot above.

It's worth to mention that sending data at that speed is no problem as SPI CLK and SPI MOSI have a fixed timing relationship. The other direction is far more difficult as when using 80 MHz the SPI counterpart (a FPGA in my project) has to sample in the SPI CLK signal, generate the next bit internally and shift it out on MISO in less than 12.5ns. Currently I'm using a test pattern and a FPGA-internal adjustable delayline with a resolution of ~300ps to adjust the MISO timing for best sampling on ESP8266 side so it's independent from small chip differences, voltage supply and aging (at least I hope ;-).

But if you only want to see the increase in speed (e.g. if only downstreaming data is needed) no SPI counterpart has to be connected to ESP8266 except the oscilloscope for probing.

Best regards,
KelrobGithub

/*

Fast SPI test on ESPDuino board
- GPIO14 is HSPI_CLK (clock output as we are the master)
- GPIO12 is HSPI_MISO (where data is transmitted by FPGA and read by ESP8266)
- GPIO13 is HSPI_MOSI (where data is transmitted by ESP8266 and read by FPGA)
- GPIO15 is HSPI_CS (if low activates the FPGA SPI interface)
  */

#define LED_PIN LED_BUILTIN

#define ledOn() digitalWrite(LED_PIN, LOW)
#define ledOff() digitalWrite(LED_PIN, HIGH)
#define ledToggle() digitalWrite(LED_PIN, !digitalRead(LED_PIN))
#define isLedOn() (digitalRead(LED_PIN) == LOW)

#include <SPI.h>
#define spiFreqMHz 80.0
#define spiBufferSize 64

char outBuffer[spiBufferSize] attribute ((aligned (4))); // buffer needs to be aligned to 32-bit-boundaries as needed by SPI class.
char inBuffer[spiBufferSize] attribute ((aligned (4)));

void setup()
{
pinMode(LED_PIN, OUTPUT);
ledOff();

uint32_t spiFreq = spiFreqMHz * 1000000;
SPI.begin();
SPI.setHwCs(true); // use default chip select output
SPI.setDataMode(SPI_MODE0);
SPI.setBitOrder(MSBFIRST);
SPI.setFrequency(spiFreq);
}

inline void spiSetDataBits(uint32_t bits) {
const uint32_t mask = ~((SPIMMOSI << SPILMOSI) | (SPIMMISO << SPILMISO));
bits--;
SPI1U1 = ((SPI1U1 & mask) | ((bits << SPILMOSI) | (bits << SPILMISO)));
}

void spiTransferBytes(uint8_t * out, uint8_t * in, uint32_t size)
{
spiSetDataBits(size * 8);
if(out) memcpy((uint8_t*)&SPI1W0, out, size);
else memset((uint8_t*)&SPI1W0, 0xFFFFFFFF, size);
SPI1CMD |= SPIBUSY;
while(SPI1CMD & SPIBUSY) {}
if(in) memcpy(in, (uint8_t*)&SPI1W0, size);
}

void loop()
{
delay(2000);
ledToggle();

// SPI class implementation
SPI.transferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
SPI.transferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);

// Fast implementation
spiTransferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
spiTransferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
}

igrr

Thanks for the contribution! I really like this performance optimization. I left a few comments on how I think the code can be improved.

igrr · 2017-01-06T02:58:08Z

 * @param size uint32_t
 */
 void SPIClass::writeBytes(uint8_t * data, uint32_t size) {
+	uint32_t localSize;


Something wacky is going on with indentation here; possibly accidental tabs instead of spaces?

@igrr
That's true, I'll beautify it. Seems that my editor indeed has used sometimes tabs, sometimes spaces. I've already removed the tabs in my local files, will submit them ASAP.

igrr · 2017-01-06T03:05:44Z

 #include <stdlib.h>

 #define SPI_HAS_TRANSACTION
+#define SPI_SKIP_INITIAL_BUSY_CHECK //comment this line for backward compatibility if really needed


I would say, backwards compatibility should be the default option. Making people modify library source code to achieve compatibility is not very nice (especially for a library bundled with the core, which is installed by the IDE who knows where...).

Actually, we only need to wait for SPI to become !busy if the current transfer uses different settings (frequency, data size, data order, etc) from the previous one. So I suggest to move the decision of waiting or not for the transfer to the run time.
So maybe add a member function like _waitForSPINotBusy which would compare new settings to the ones used during previous transfer and either return (if they are the same) or wait for SPI to be !busy (if they are different).

@igrr
I do think for backwards compatibility in the same way as you do.
But in fact I think that the initial check for !busy is not needed in any of the SPI functions as every write- and transfer-function already checks that its own SPI transfer completes before leaving its context by using
while(SPI1CMD & SPIBUSY) {}
at the end of the function.

Therefore it is not possible to change frequency during a transfer.

I've only added the #define which can be commented as a service in case I've overlooked something. But currently I don't see any threat here, so maybe we can just delete the #define and the initial check completely.

Or is there an effect I currently don't know about that changing the frequency, data size or whatever activates SPI's busy flag? Then I would suggest to wait in the setFrequency or setDataBit functions but not in every transfer.

If the define shouldn't be necessary, let's just remove it.

igrr · 2017-01-06T03:10:31Z

-    }
-
+	if(data) {
+		memcpy((uint8_t*)&SPI1W0, data, size);


In general, accessing most peripheral registers other than with l32i/s32i isn't valid, but if you use memcpy with an unaligned source buffer this may very well happen.
memcpy is indeed better than a plain loop, because on Xtensa it interleaves multiple loads and stores, reducing pipeline stalls due to memory/APB access. I suggest to have a check whether data is 32-bit aligned first. If it is, then go via "fast path" using memcpy. If it's not, go via "slow path" of a loop with one memcpy per word. Unlike the old code this solution will gracefully handle unaligned source buffers, removing the alignment restriction (and for aligned buffers we will gain performance due to memcpy).

@igrr
OK, a 32-bit alignment check can be done easily during run-time.

But, what exactly shall I do in case the data it is not aligned? The old code has just copied 32-bit of data in a loop which could cause an "unaligment exception".
I could go for byte-wise copying (I fear word-wise as you mentioned solves only half of the problem, doesn't it?), but as memcpy is defined for byte-wise operation by nature as it does not know anything about the data types transferred it should be safe from my point of view even for an unaligned source buffer.

Nearly as stated here: https://github.com/jcmvbkbc/u-boot-xtensa/blob/master/doc/README.unaligned-memory-access.txt

I've tested my implementation on non-aligned memory locations withount any errors by creating an aligned buffer and setting the pointer used for SPI access to the first 4 locations with identical results (see code below):

// Define buffer which is aligned char spiIntfOutBuffer[SPI_BUFFER_SIZE] __attribute__ ((aligned (4))); // Set start pointer somewhere in buffer to unalign it uint8_t* spiIntfOutBufferPtr = (uint8_t*)&(spiIntfOutBuffer[3]); // Use start pointer for SPI transfer spiIntfTransferBytes(outBufferPtr, NULL, SPI_ADJ_SETUP_BUFFER_SIZE);

So from my point of view there is no need for changes. Or did I miss something which is not covered by my tests?

Memcpy will always use load/store instructions with correct alignment, so unaligned load/store exceptions are not the issue here. The issue may be that memcpy will try to write a smaller than 4-byte value into the peripheral register (i.e. perform a 2-byte or 1-byte write, aligned on 2-byte or 1-byte boundary, respectively). Doing so will not cause any exception, but the SPI FIFO was not designed to support 2-byte or 1-byte writes. So you may possibly end up with incorrect data being sent.
I'll try to come up with some test case which illustrates this.

After some testing it appears that even though the hardware wasn't designed to accommodate 2-byte and 1-byte writes, it behaves okay for some reason.
@KelrobGithub If you can address the other two minor comments, I will happily merge this. Thanks!

Crypter · 2018-01-16T15:50:07Z

Why did this didn't get merged?

devyte · 2018-01-16T19:23:43Z

@Crypter changes were requested thst apparently weren't done, the implementation may have not been verified to be ok by someone else, the branch currently has conflicts that must be resolved.
@KelrobGithub will you be finishing this?
@Crypter can you test this once the PR is fixed?

Crypter · 2018-01-16T19:33:09Z

@devyte I might be able to. If anything gets done I'll try my best.

KelrobGithub · 2018-01-28T20:16:40Z

Hi to all involved,
I'm sorry for being offline from Github for such a long time, but due to high workload and shifted topics I wasn't able to complete these tasks at work.
The situation hasn't relaxed yet, and I'm not so familiar with the git workflow to do it asap at home.
I would be happy if anyone is able to merge my changes by using the provided files, deleting the ifndef part incl. the while loop and beautifying the indentation.

d-a-v · 2020-11-09T17:17:52Z

I fixed the merge conflicts

Optimized SPI transfer

a2f2c6b

Optimized SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 MBit/s to nearly to 40 Mbit/s.

igrr requested changes Jan 6, 2017

View reviewed changes

d-a-v self-assigned this Feb 7, 2018

earlephilhower added the merge-conflict PR has a merge conflict that needs manual correction label Nov 18, 2019

Merge branch 'master' into spi_highspeed

9281779

d-a-v removed the merge-conflict PR has a merge conflict that needs manual correction label Nov 9, 2020

d-a-v removed their assignment Apr 15, 2021

Conversation

KelrobGithub commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Dec 8, 2016

Current coverage is 27.80% (diff: 100%)

Uh oh!

mkeyno commented Dec 10, 2016

Uh oh!

KelrobGithub commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

igrr left a comment

Choose a reason for hiding this comment

Uh oh!

igrr Jan 6, 2017

Choose a reason for hiding this comment

Uh oh!

KelrobGithub Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igrr Jan 6, 2017

Choose a reason for hiding this comment

Uh oh!

KelrobGithub Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igrr Feb 3, 2017

Choose a reason for hiding this comment

Uh oh!

igrr Jan 6, 2017

Choose a reason for hiding this comment

Uh oh!

KelrobGithub Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igrr Feb 3, 2017

Choose a reason for hiding this comment

Uh oh!

igrr Feb 19, 2017

Choose a reason for hiding this comment

Uh oh!

Crypter commented Jan 16, 2018

Uh oh!

devyte commented Jan 16, 2018

Uh oh!

Crypter commented Jan 16, 2018

Uh oh!

KelrobGithub commented Jan 28, 2018

Uh oh!

d-a-v commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

KelrobGithub commented Dec 8, 2016 •

edited

Loading

KelrobGithub commented Dec 12, 2016 •

edited

Loading

KelrobGithub Jan 17, 2017 •

edited

Loading

KelrobGithub Jan 17, 2017 •

edited

Loading

KelrobGithub Jan 17, 2017 •

edited

Loading