Skip to content

Optimized SPI transfer#2747

Open
KelrobGithub wants to merge 2 commits intoesp8266:masterfrom
KelrobGithub:spi_highspeed
Open

Optimized SPI transfer#2747
KelrobGithub wants to merge 2 commits intoesp8266:masterfrom
KelrobGithub:spi_highspeed

Conversation

@KelrobGithub
Copy link
Copy Markdown

@KelrobGithub KelrobGithub commented Dec 8, 2016

Hi ,
I've optimized the SPI transfer functions for less overhead to increase continuous transfer rate at 80 MHz SPI clock from 18.5 Mbit/s to nearly 40 Mbit/s. This was needed to transfer megabytes of data from an FPGA over ESP8266 and Wifi in reasonable time, so now the bottleneck is shifted to Wifi ;-).

The screenshot shows 2 x 64 byte frames using the current implementation on the left side of the screen and the result of my changes on the right side. As one can see the gap between these frames is quite reduced.
esp8266_spi_speed

Modifications done:
(1) Changed counter variables and function parameters from 8bit to 32bit as these are handled faster by the Tensilica CPU.
(2) Replaced loops for buffer copying by memcpy and memset calls.
(3) Removed initial check (if SPI interface is busy) as this check is done inside every write and transfer function. These functions are only exited when the interface is not busy any more so the initial check when entering the function is not needed from my point of view.
(4) Restructured write and transfer functions to call _write and _transfer only once in their code so the inline function definitions are really "inlined" (and not only suggested and ignored by GCC compiler).

Code has been tested with various buffer sizes, the SPI class implementation is only 200ns slower than if copy-and-pasting the _transfer function directly into my sketch due to some more checks.

Best regards,
KelrobGithub

Optimized SPI transfer functions for less overhead to increase
continuous transfer rate at 80 MHz SPI clock from 18.5 MBit/s to nearly
to 40 Mbit/s.
@codecov-io
Copy link
Copy Markdown

Current coverage is 27.80% (diff: 100%)

Merging #2747 into master will not change coverage

@@             master      #2747   diff @@
==========================================
  Files            20         20          
  Lines          3625       3625          
  Methods         335        335          
  Messages          0          0          
  Branches        656        656          
==========================================
  Hits           1008       1008          
  Misses         2441       2441          
  Partials        176        176          

Powered by Codecov. Last update 7b32e6a...a2f2c6b

@mkeyno
Copy link
Copy Markdown

mkeyno commented Dec 10, 2016

@KelrobGithub do you compose any example of your work?

@KelrobGithub
Copy link
Copy Markdown
Author

KelrobGithub commented Dec 12, 2016

@mkeyno
Yes, of course, I copy'n'pasted it below.

As you can see I've just copied the transfer and setdataBits functions from SPI library to my sketch and changed them to faster behavior. In main loop I've done two 64-byte-accesses using the default SPI library and 2 accesses of same size and content using the local functions. The result will look like the screenshot above.

It's worth to mention that sending data at that speed is no problem as SPI CLK and SPI MOSI have a fixed timing relationship. The other direction is far more difficult as when using 80 MHz the SPI counterpart (a FPGA in my project) has to sample in the SPI CLK signal, generate the next bit internally and shift it out on MISO in less than 12.5ns. Currently I'm using a test pattern and a FPGA-internal adjustable delayline with a resolution of ~300ps to adjust the MISO timing for best sampling on ESP8266 side so it's independent from small chip differences, voltage supply and aging (at least I hope ;-).

But if you only want to see the increase in speed (e.g. if only downstreaming data is needed) no SPI counterpart has to be connected to ESP8266 except the oscilloscope for probing.

Best regards,
KelrobGithub

/*

  • Fast SPI test on ESPDuino board
    • GPIO14 is HSPI_CLK (clock output as we are the master)
    • GPIO12 is HSPI_MISO (where data is transmitted by FPGA and read by ESP8266)
    • GPIO13 is HSPI_MOSI (where data is transmitted by ESP8266 and read by FPGA)
    • GPIO15 is HSPI_CS (if low activates the FPGA SPI interface)
      */

#define LED_PIN LED_BUILTIN

#define ledOn() digitalWrite(LED_PIN, LOW)
#define ledOff() digitalWrite(LED_PIN, HIGH)
#define ledToggle() digitalWrite(LED_PIN, !digitalRead(LED_PIN))
#define isLedOn() (digitalRead(LED_PIN) == LOW)

#include <SPI.h>
#define spiFreqMHz 80.0
#define spiBufferSize 64

char outBuffer[spiBufferSize] attribute ((aligned (4))); // buffer needs to be aligned to 32-bit-boundaries as needed by SPI class.
char inBuffer[spiBufferSize] attribute ((aligned (4)));

void setup()
{
pinMode(LED_PIN, OUTPUT);
ledOff();

uint32_t spiFreq = spiFreqMHz * 1000000;
SPI.begin();
SPI.setHwCs(true); // use default chip select output
SPI.setDataMode(SPI_MODE0);
SPI.setBitOrder(MSBFIRST);
SPI.setFrequency(spiFreq);
}

inline void spiSetDataBits(uint32_t bits) {
const uint32_t mask = ~((SPIMMOSI << SPILMOSI) | (SPIMMISO << SPILMISO));
bits--;
SPI1U1 = ((SPI1U1 & mask) | ((bits << SPILMOSI) | (bits << SPILMISO)));
}

void spiTransferBytes(uint8_t * out, uint8_t * in, uint32_t size)
{
spiSetDataBits(size * 8);
if(out) memcpy((uint8_t*)&SPI1W0, out, size);
else memset((uint8_t*)&SPI1W0, 0xFFFFFFFF, size);
SPI1CMD |= SPIBUSY;
while(SPI1CMD & SPIBUSY) {}
if(in) memcpy(in, (uint8_t*)&SPI1W0, size);
}

void loop()
{
delay(2000);
ledToggle();

// SPI class implementation
SPI.transferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
SPI.transferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);

// Fast implementation
spiTransferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
spiTransferBytes((uint8_t*)outBuffer, (uint8_t*)inBuffer, spiBufferSize);
}

Copy link
Copy Markdown
Member

@igrr igrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I really like this performance optimization. I left a few comments on how I think the code can be improved.

Comment thread libraries/SPI/SPI.cpp
* @param size uint32_t
*/
void SPIClass::writeBytes(uint8_t * data, uint32_t size) {
uint32_t localSize;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something wacky is going on with indentation here; possibly accidental tabs instead of spaces?

Copy link
Copy Markdown
Author

@KelrobGithub KelrobGithub Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igrr
That's true, I'll beautify it. Seems that my editor indeed has used sometimes tabs, sometimes spaces. I've already removed the tabs in my local files, will submit them ASAP.

Comment thread libraries/SPI/SPI.h
#include <stdlib.h>

#define SPI_HAS_TRANSACTION
#define SPI_SKIP_INITIAL_BUSY_CHECK //comment this line for backward compatibility if really needed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, backwards compatibility should be the default option. Making people modify library source code to achieve compatibility is not very nice (especially for a library bundled with the core, which is installed by the IDE who knows where...).

Actually, we only need to wait for SPI to become !busy if the current transfer uses different settings (frequency, data size, data order, etc) from the previous one. So I suggest to move the decision of waiting or not for the transfer to the run time.
So maybe add a member function like _waitForSPINotBusy which would compare new settings to the ones used during previous transfer and either return (if they are the same) or wait for SPI to be !busy (if they are different).

Copy link
Copy Markdown
Author

@KelrobGithub KelrobGithub Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igrr
I do think for backwards compatibility in the same way as you do.
But in fact I think that the initial check for !busy is not needed in any of the SPI functions as every write- and transfer-function already checks that its own SPI transfer completes before leaving its context by using
while(SPI1CMD & SPIBUSY) {}
at the end of the function.

Therefore it is not possible to change frequency during a transfer.

I've only added the #define which can be commented as a service in case I've overlooked something. But currently I don't see any threat here, so maybe we can just delete the #define and the initial check completely.

Or is there an effect I currently don't know about that changing the frequency, data size or whatever activates SPI's busy flag? Then I would suggest to wait in the setFrequency or setDataBit functions but not in every transfer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the define shouldn't be necessary, let's just remove it.

Comment thread libraries/SPI/SPI.cpp
}

if(data) {
memcpy((uint8_t*)&SPI1W0, data, size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, accessing most peripheral registers other than with l32i/s32i isn't valid, but if you use memcpy with an unaligned source buffer this may very well happen.
memcpy is indeed better than a plain loop, because on Xtensa it interleaves multiple loads and stores, reducing pipeline stalls due to memory/APB access. I suggest to have a check whether data is 32-bit aligned first. If it is, then go via "fast path" using memcpy. If it's not, go via "slow path" of a loop with one memcpy per word. Unlike the old code this solution will gracefully handle unaligned source buffers, removing the alignment restriction (and for aligned buffers we will gain performance due to memcpy).

Copy link
Copy Markdown
Author

@KelrobGithub KelrobGithub Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igrr
OK, a 32-bit alignment check can be done easily during run-time.

But, what exactly shall I do in case the data it is not aligned? The old code has just copied 32-bit of data in a loop which could cause an "unaligment exception".
I could go for byte-wise copying (I fear word-wise as you mentioned solves only half of the problem, doesn't it?), but as memcpy is defined for byte-wise operation by nature as it does not know anything about the data types transferred it should be safe from my point of view even for an unaligned source buffer.

Nearly as stated here: https://github.com/jcmvbkbc/u-boot-xtensa/blob/master/doc/README.unaligned-memory-access.txt

I've tested my implementation on non-aligned memory locations withount any errors by creating an aligned buffer and setting the pointer used for SPI access to the first 4 locations with identical results (see code below):

// Define buffer which is aligned
char spiIntfOutBuffer[SPI_BUFFER_SIZE] __attribute__ ((aligned (4))); 
// Set start pointer somewhere in buffer to unalign it
uint8_t* spiIntfOutBufferPtr = (uint8_t*)&(spiIntfOutBuffer[3]);
// Use start pointer for SPI transfer
spiIntfTransferBytes(outBufferPtr, NULL, SPI_ADJ_SETUP_BUFFER_SIZE);

So from my point of view there is no need for changes. Or did I miss something which is not covered by my tests?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memcpy will always use load/store instructions with correct alignment, so unaligned load/store exceptions are not the issue here. The issue may be that memcpy will try to write a smaller than 4-byte value into the peripheral register (i.e. perform a 2-byte or 1-byte write, aligned on 2-byte or 1-byte boundary, respectively). Doing so will not cause any exception, but the SPI FIFO was not designed to support 2-byte or 1-byte writes. So you may possibly end up with incorrect data being sent.
I'll try to come up with some test case which illustrates this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some testing it appears that even though the hardware wasn't designed to accommodate 2-byte and 1-byte writes, it behaves okay for some reason.
@KelrobGithub If you can address the other two minor comments, I will happily merge this. Thanks!

@Crypter
Copy link
Copy Markdown
Contributor

Crypter commented Jan 16, 2018

Why did this didn't get merged?

@devyte
Copy link
Copy Markdown
Collaborator

devyte commented Jan 16, 2018

@Crypter changes were requested thst apparently weren't done, the implementation may have not been verified to be ok by someone else, the branch currently has conflicts that must be resolved.
@KelrobGithub will you be finishing this?
@Crypter can you test this once the PR is fixed?

@Crypter
Copy link
Copy Markdown
Contributor

Crypter commented Jan 16, 2018

@devyte I might be able to. If anything gets done I'll try my best.

@KelrobGithub
Copy link
Copy Markdown
Author

Hi to all involved,
I'm sorry for being offline from Github for such a long time, but due to high workload and shifted topics I wasn't able to complete these tasks at work.
The situation hasn't relaxed yet, and I'm not so familiar with the git workflow to do it asap at home.
I would be happy if anyone is able to merge my changes by using the provided files, deleting the ifndef part incl. the while loop and beautifying the indentation.

@d-a-v d-a-v self-assigned this Feb 7, 2018
@earlephilhower earlephilhower added the merge-conflict PR has a merge conflict that needs manual correction label Nov 18, 2019
@d-a-v
Copy link
Copy Markdown
Collaborator

d-a-v commented Nov 9, 2020

I fixed the merge conflicts

@d-a-v d-a-v removed the merge-conflict PR has a merge conflict that needs manual correction label Nov 9, 2020
@d-a-v d-a-v removed their assignment Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants