Mode-specific memory allocation

Excellent. Did we agree on the following, at least conceptually, as a way to relieve RAM pressure from the various modes?

  1. Provide “mode buffer” allocator … with intent to replace static globals that would not be used outside the mode … effectively giving each mode 128k to play with, while reserving the other 128k for use by system bus pirate purposes.

  2. upon entering a mode, all previously allocated “mode buffers” are implicitly free()d … ensuring a known-good starting state

  3. a mode may allocate memory from the “mode buffer”, with intent that this would occur during init()

  4. Aligned memory can be requested, and if non-null, the allocation will be aligned to the requested alignment.

  5. No API to free individual allocations made from “mode buffer” memory … again, intent is to remove large static globals, not a full malloc()

  6. Instead of free, an API will allow the mode to reset the state of the mode buffer to its known-good starting state (all memory free and available to allocate).

There’s a similar set of features in another MIT-licensed project I use, so I will review what API it exposes, and propose something for your thoughts.

The proposal will include up to two mode function pointers (init() and cleanup()), some APIs for requesting mode-specific memory, and a list of guarantees for those mode-specific memory allocators. Examples:

A. upon entering init(), 128k of memory is generally available, aligned to at least a 4k boundary (32k in current implementation, but thought 4k should be sufficient for the API guarantee).

B. If requesting aligned memory allocation, the returned pointer is guaranteed to meet that alignment requirement. If not possible to align as requested, nullptr will be returned.

C. There are no guarantees as to the relative values of pointer for successive allocations. A later allocation may have a pointer prior to (value less than) or after (value greater than) a prior allocation. Similarly, there is not guarantee that successive allocations will be adjoining … the memory may include overhead and/or space set aside to help debug issues. Debug builds might have compile-time options that will change the allocator behavior (etc.).

In principal, yes, this all sounds great. If there’s an existing system for something like this all the better.

2 Likes

OK. This seems like a critical feature to remove RAM pressure. Given only 2% of FLASH is currently used, if the RAM pressure is removed, it should allow enabling all modes in a single binary.

Given this change is fundamental, I want to start with few features (e.g., no free() except to fully reset the allocator). Makes it easier to later add more advanced features without breaking any prior guarantees.

Hm… I need to check how the menu system deals with >9 options…

Please note that the logic analyzer code requires a 32K alignment for its 128K allocation. I am speculating that there may be a way to alleviate this constraint by clever use of DMA control blocks

Oh? Thank you for raising that requirement early!

Can you help me understand the underlying cause of the 32k alignment requirement?

edit: Also, can you point me to the corresponding code that sets up the hardware that is requiring the 32k alignment?

Thanks!

The logic analyzer memory looks like a 128K ring buffer. The LA once started does not need CPU intervention. The buffer is filled then overwritten just based on the DMA activity. The RP2040 has a notion of DMA ring buffer but it is cheaply implemented. The modulo operator of the destination address is implemented by masking the low bytes. Hence the requirement is to align the buffer to its size (128K)!! To reduce this alignment constraint, the LA code is devilishly clever. It divides the buffer into 4 32K buffers, each controlled by a 1 DMA unit. These units are chained so each transfer completion of 32K triggers the next DMA unit to start its own 32K transfer. The last one starts the first one to form a DMA transfer ring of 128K. Each DMA unit is configured as a buffer ring of 32K so when they arer re-started they start at the beginning of their buffer. logicanalyzer_setup(void) at logicanalyzer.c:548 and restart_dma() at logicanalyzer.c:430 does the setup for this.

Edit: the blocks of 32K need to be contiguous.

2 Likes

Those are some neat hacks!

Since this is using four (!) DMA units, there are likely ways to improve this.

Chaining two DMA units allows one to implement full scatter-gather (aka Vectored IO).

Basically, you have a first DMA unit that’s doing the bulk transfer, and a second DMA unit whose sole purpose is to DMA the control values to the first DMA unit’s controlling registers.

This is a very powerful technique, as it allows (essentially) arbitrary-sized, non-contiguous blocks of memory to be DMA’d without interruption. Of course, each individual block of memory must start / end at the hardware’s DMA alignment requirement, but that’s typically small (e.g., 16 bytes or less) on any hardware supporting scatter-gather lists, and I think 4 byte alignment on PICO?

I believe you could reduce to just two DMA units. Still allocate the 128k buffer … but alignment requirement is now 4 bytes (or whatever the DMA engines on the PICO are). Create an array of control blocks that are written via the second DMA engine to setup DMA on the first DMA engine. If I understand correctly, that DMA will write to “shadow” registers, so an existing in-progress DMA can continue to be processed while setting up (via DMA) the next DMA operation. When the first DMA engine is done, it can be configured to latch the shadow registers into its internal copies, which seamlessly starts the next DMA section (or ends, if a register contains a predetermined sentinel value such as NULL). Neat, eh?

For perpetual ring-buffer DMA, you can setup the second DMA to loop as you describe … at which point the control blocks have a slightly larger alignment requirement (but still reasonably small).

Look into it … not only will implementing scatter-gather on PICO grant you the DMA Wizard merit badge, it’s a really useful technique … especially when dealing with protocols that wrap the data (I’m looking at you, TCP/IP…), as you will no longer need to copy the data to a buffer with extra space for the CRC / headers / ECC / whatever else it’s wrapped with.

I implemented this on an nRF52840 a while back. Does the RP2040 have the similar “shadow register” concept, which allows setup of the next DMA transfer while the current one is active? If so, it’s a really clean solution…

1 Like

See 2.5.6.2. DMA Control Blocks in the RP2040 datasheet, including the CHAIN_TO and four sets of register aliases. :slight_smile:

So… no shadow registers, which just means you need to count clock cycles to ensure the next DMA occurs before any PIO buffers fill up and bits get dropped.

Option 1:
Always write all four registers, and choose any of the register aliases, since the last write will always be a trigger.

Option 2:
Determine one or more value(s) which remain static and don’t require updates. Then, select a register alias set where the static value(s) are prior to the values that need to be updated. Now the second DMA engine can write fewer registers each cycle, but still finish on a register alias that will trigger.

Look at alias #2 … I would guess that CTRL and TRANS_COUNT could be left alone by always transfering same-sized blocks (e.g., 2k or 32k) on the first DMA engine. That reduces the data that the second DMA engine writes to just two 32-bit values. If READ_ADDR is also static (which I think it may be), then you only need to have an array of the write addresses, and the second DMA engine writes a single 32-bit value to a single register … i.e., single-clock delay, hidden by the PIO program’s efficiency and/or buffer settings.

Oh, and I’m presuming you’ve set the PIO buffer to be 2x deep (output only). If not, it likely would not hurt, and give a few extra cycles for the DMA setup…

Can you also help me understand a few items? (so I’m working from the same understanding as you):

  1. How many (maximum) samples are shown on the OLED at a time?

  2. Am I reading correctly that 32k of data is captured prior to any updates of the OLED? (e.g., because it has to wait for a DMA operation to finish?)

  3. Is there any interest in having the LA store the samples directly to the NAND (for huge traces)?

Just to be clear. I didn’t write this LA code, I just cleaned one or two things. I also never looked at the LA UI. I was more interested in the OLS interface and fixing the sigrok code.
As I said earlier I know about the DMA control blocks and I suspect it could work but I don’t have much incentive to modify the code since it is a frustrating endeavor. Possible to test but impossible to debug due to the nature of the DMA engine.

1 Like

The only question I can answer is 2.
The OLS interface can provide very short samples like 100 samples. The number of samples is controlled by the PIO engine, not the DMA engine. The DMA never finishes, it just loops around forever, waiting for data coming from the PIO engine.
About 3. I can only express an opinion here. If the write speed of the NAND is greater than the USB throughput, then it would be interesting. Otherwise, I would much prefer using sigrok/Pulseview to gather longer traces.

1 Like

Ah! My mistake. Yes, debugging DMA configuratoin is definitely painful.

Thanks for helping me understand the limitations currently part of the LA architecture. While painful, it will inform the mode allocator.

On the “good news” side of things:
I figured out how to reserve a 32k-aligned buffer of arbitrary size via the ldscript. Thus, I can reserve 138k (instead of 128k), allowing for both large aligned and smaller, lesser-aligned allocations. Working through the logic now…

  • How many (maximum) samples are shown on the OLED at a time?

Currently none :slight_smile: There is a scope for the display, but not the LA. It would be cute, but I guess my general feeling is that software (or the VT100 LA I started as a proof of concept at one point) is a nicer way to interact with the data. Especially because we only have one button for a physical interface.

  • Is there any interest in having the LA store the samples directly to the NAND (for huge traces)?

I would guess minimal. There is also a sigrok compatible firmware for the Bus Pirate that does continuous sampling by shoving samples down USB as fast as possible. It’s ported from the pico-la-sigrok project on github. It can sustain ~3MSPS, triggers are done in sigrok.

This is one of the things that I hope to relocate into a large firmware and jump to based on a second stage bootloader. The overhead of the Bus Pirate firmware is too much for a project like pico-la-sigrok that tests the limits of the RP2040 hardware.