Mode-specific memory allocation

henrygab · July 10, 2024, 4:51am

Excellent. Did we agree on the following, at least conceptually, as a way to relieve RAM pressure from the various modes?

Provide “mode buffer” allocator … with intent to replace static globals that would not be used outside the mode … effectively giving each mode 128k to play with, while reserving the other 128k for use by system bus pirate purposes.
upon entering a mode, all previously allocated “mode buffers” are implicitly free()d … ensuring a known-good starting state
a mode may allocate memory from the “mode buffer”, with intent that this would occur during init()
Aligned memory can be requested, and if non-null, the allocation will be aligned to the requested alignment.
No API to free individual allocations made from “mode buffer” memory … again, intent is to remove large static globals, not a full malloc()
Instead of free, an API will allow the mode to reset the state of the mode buffer to its known-good starting state (all memory free and available to allocate).

There’s a similar set of features in another MIT-licensed project I use, so I will review what API it exposes, and propose something for your thoughts.

The proposal will include up to two mode function pointers (init() and cleanup()), some APIs for requesting mode-specific memory, and a list of guarantees for those mode-specific memory allocators. Examples:

A. upon entering init(), 128k of memory is generally available, aligned to at least a 4k boundary (32k in current implementation, but thought 4k should be sufficient for the API guarantee).

B. If requesting aligned memory allocation, the returned pointer is guaranteed to meet that alignment requirement. If not possible to align as requested, nullptr will be returned.

C. There are no guarantees as to the relative values of pointer for successive allocations. A later allocation may have a pointer prior to (value less than) or after (value greater than) a prior allocation. Similarly, there is not guarantee that successive allocations will be adjoining … the memory may include overhead and/or space set aside to help debug issues. Debug builds might have compile-time options that will change the allocator behavior (etc.).

ian · July 10, 2024, 9:16am

In principal, yes, this all sounds great. If there’s an existing system for something like this all the better.

henrygab · July 10, 2024, 6:44pm

OK. This seems like a critical feature to remove RAM pressure. Given only 2% of FLASH is currently used, if the RAM pressure is removed, it should allow enabling all modes in a single binary.

Given this change is fundamental, I want to start with few features (e.g., no free() except to fully reset the allocator). Makes it easier to later add more advanced features without breaking any prior guarantees.

Hm… I need to check how the menu system deals with >9 options…

phdussud · July 17, 2024, 6:03pm

Please note that the logic analyzer code requires a 32K alignment for its 128K allocation. I am speculating that there may be a way to alleviate this constraint by clever use of DMA control blocks

henrygab · July 17, 2024, 8:12pm

Oh? Thank you for raising that requirement early!

Can you help me understand the underlying cause of the 32k alignment requirement?

edit: Also, can you point me to the corresponding code that sets up the hardware that is requiring the 32k alignment?

Thanks!

phdussud · July 17, 2024, 9:00pm

The logic analyzer memory looks like a 128K ring buffer. The LA once started does not need CPU intervention. The buffer is filled then overwritten just based on the DMA activity. The RP2040 has a notion of DMA ring buffer but it is cheaply implemented. The modulo operator of the destination address is implemented by masking the low bytes. Hence the requirement is to align the buffer to its size (128K)!! To reduce this alignment constraint, the LA code is devilishly clever. It divides the buffer into 4 32K buffers, each controlled by a 1 DMA unit. These units are chained so each transfer completion of 32K triggers the next DMA unit to start its own 32K transfer. The last one starts the first one to form a DMA transfer ring of 128K. Each DMA unit is configured as a buffer ring of 32K so when they arer re-started they start at the beginning of their buffer. logicanalyzer_setup(void) at logicanalyzer.c:548 and restart_dma() at logicanalyzer.c:430 does the setup for this.

Edit: the blocks of 32K need to be contiguous.

henrygab · July 17, 2024, 10:32pm

Those are some neat hacks!

Since this is using four (!) DMA units, there are likely ways to improve this.

Chaining two DMA units allows one to implement full scatter-gather (aka Vectored IO).

Basically, you have a first DMA unit that’s doing the bulk transfer, and a second DMA unit whose sole purpose is to DMA the control values to the first DMA unit’s controlling registers.

This is a very powerful technique, as it allows (essentially) arbitrary-sized, non-contiguous blocks of memory to be DMA’d without interruption. Of course, each individual block of memory must start / end at the hardware’s DMA alignment requirement, but that’s typically small (e.g., 16 bytes or less) on any hardware supporting scatter-gather lists, and I think 4 byte alignment on PICO?

I believe you could reduce to just two DMA units. Still allocate the 128k buffer … but alignment requirement is now 4 bytes (or whatever the DMA engines on the PICO are). Create an array of control blocks that are written via the second DMA engine to setup DMA on the first DMA engine. If I understand correctly, that DMA will write to “shadow” registers, so an existing in-progress DMA can continue to be processed while setting up (via DMA) the next DMA operation. When the first DMA engine is done, it can be configured to latch the shadow registers into its internal copies, which seamlessly starts the next DMA section (or ends, if a register contains a predetermined sentinel value such as NULL). Neat, eh?

For perpetual ring-buffer DMA, you can setup the second DMA to loop as you describe … at which point the control blocks have a slightly larger alignment requirement (but still reasonably small).

Look into it … not only will implementing scatter-gather on PICO grant you the DMA Wizard merit badge, it’s a really useful technique … especially when dealing with protocols that wrap the data (I’m looking at you, TCP/IP…), as you will no longer need to copy the data to a buffer with extra space for the CRC / headers / ECC / whatever else it’s wrapped with.

I implemented this on an nRF52840 a while back. Does the RP2040 have the similar “shadow register” concept, which allows setup of the next DMA transfer while the current one is active? If so, it’s a really clean solution…

henrygab · July 17, 2024, 10:37pm

See 2.5.6.2. DMA Control Blocks in the RP2040 datasheet, including the CHAIN_TO and four sets of register aliases.

So… no shadow registers, which just means you need to count clock cycles to ensure the next DMA occurs before any PIO buffers fill up and bits get dropped.

Option 1:
Always write all four registers, and choose any of the register aliases, since the last write will always be a trigger.

Option 2:
Determine one or more value(s) which remain static and don’t require updates. Then, select a register alias set where the static value(s) are prior to the values that need to be updated. Now the second DMA engine can write fewer registers each cycle, but still finish on a register alias that will trigger.

Look at alias #2 … I would guess that CTRL and TRANS_COUNT could be left alone by always transfering same-sized blocks (e.g., 2k or 32k) on the first DMA engine. That reduces the data that the second DMA engine writes to just two 32-bit values. If READ_ADDR is also static (which I think it may be), then you only need to have an array of the write addresses, and the second DMA engine writes a single 32-bit value to a single register … i.e., single-clock delay, hidden by the PIO program’s efficiency and/or buffer settings.

Oh, and I’m presuming you’ve set the PIO buffer to be 2x deep (output only). If not, it likely would not hurt, and give a few extra cycles for the DMA setup…

henrygab · July 17, 2024, 10:59pm

Can you also help me understand a few items? (so I’m working from the same understanding as you):

How many (maximum) samples are shown on the OLED at a time?
Am I reading correctly that 32k of data is captured prior to any updates of the OLED? (e.g., because it has to wait for a DMA operation to finish?)
Is there any interest in having the LA store the samples directly to the NAND (for huge traces)?

phdussud · July 17, 2024, 11:16pm

Just to be clear. I didn’t write this LA code, I just cleaned one or two things. I also never looked at the LA UI. I was more interested in the OLS interface and fixing the sigrok code.
As I said earlier I know about the DMA control blocks and I suspect it could work but I don’t have much incentive to modify the code since it is a frustrating endeavor. Possible to test but impossible to debug due to the nature of the DMA engine.

phdussud · July 17, 2024, 11:25pm

The only question I can answer is 2.
The OLS interface can provide very short samples like 100 samples. The number of samples is controlled by the PIO engine, not the DMA engine. The DMA never finishes, it just loops around forever, waiting for data coming from the PIO engine.
About 3. I can only express an opinion here. If the write speed of the NAND is greater than the USB throughput, then it would be interesting. Otherwise, I would much prefer using sigrok/Pulseview to gather longer traces.

henrygab · July 18, 2024, 1:03am

Ah! My mistake. Yes, debugging DMA configuratoin is definitely painful.

Thanks for helping me understand the limitations currently part of the LA architecture. While painful, it will inform the mode allocator.

On the “good news” side of things:
I figured out how to reserve a 32k-aligned buffer of arbitrary size via the ldscript. Thus, I can reserve 138k (instead of 128k), allowing for both large aligned and smaller, lesser-aligned allocations. Working through the logic now…

ian · July 18, 2024, 10:42am

How many (maximum) samples are shown on the OLED at a time?

Currently none There is a scope for the display, but not the LA. It would be cute, but I guess my general feeling is that software (or the VT100 LA I started as a proof of concept at one point) is a nicer way to interact with the data. Especially because we only have one button for a physical interface.

Is there any interest in having the LA store the samples directly to the NAND (for huge traces)?

I would guess minimal. There is also a sigrok compatible firmware for the Bus Pirate that does continuous sampling by shoving samples down USB as fast as possible. It’s ported from the pico-la-sigrok project on github. It can sustain ~3MSPS, triggers are done in sigrok.

This is one of the things that I hope to relocate into a large firmware and jump to based on a second stage bootloader. The overhead of the Bus Pirate firmware is too much for a project like pico-la-sigrok that tests the limits of the RP2040 hardware.

henrygab · July 23, 2024, 2:26am

OK, I took the feedback from Alpha APIv1, and pushed a significant API update to PR78. This is still untested code … so don’t run it unless you want to find and fix bugs. Its purpose is to discuss the new API.

What’s needed

Review of mem.h, which is where the API is defined.

Is this API usable?
Do you have better names than Temporary and LongLived to describe the two types of allocations?
Does its intended functionality cover the intended use cases?
Any other thoughts?
[optional ] Did you review the implementation and notice any bugs?

Thanks!

ian · July 23, 2024, 8:33am

The comments in mem.h make sense to me, but I do not have a natural talent for pointers.

The functions are prefixed with bigbuffer, but aren’t related to the “big buffer” we use for the 128K ring buffer? Just checking my understanding.

Temporary is a good word, long lived is a bit wordy Short and long already have memory related meanings. That’s a good question.

henrygab · July 23, 2024, 11:09pm

The functions are a 100% replacement for allocating / free’ing the 128k ring buffer.

As for the 128k contiguous / 32k-aligned ring buffer...

I’m not 100% clear on why the ring buffer DMA was architected to use 4x DMA channels that using CHAIN_TO amongst themselves, vs. using Vectored IO (aka Scatter/Gather).

In Vectored IO, one control DMA channel configures a second bulk DMA channel to do the actual data transfers… allowing the bulk buffers to be essentially arbitrarily placed. For perpetual transfer, the control DMA channel’s buffer could loop / autowrap.

Using Vectored IO DMA, this would avoid the need for the 128k buffer to be contiguous, and also remove the 32k alignment requirement (drop to 4, I believe?).

Long-term, I would recommend moving to a Vectored IO based architecture, to reduce the memory alignment and remove the contiguous allocation requirements. It might also reduce RAM requirements by half, since if the CPU can’t keep up anyways, there’s not much purpose of having more than two buffers outstanding…

ian · July 24, 2024, 8:34am

Ok, I read back and understand a bit better now. If I have a usage example I think I can implement it in all the modes fairly easily.

henrygab · July 25, 2024, 8:09am

Hi Ian,

N.B. – this is 100% entirely untested code (although each commit should compile)

If I have a usage example I think I can implement it in all the modes fairly easily.

Great! The Alpha APIv2 is staying substantially unchanged.
I’ve pushed more commits to this branch, some of which transition a file from the legacy “single allocation only” API to this Alpha APIv2. I think you’ll agree that the changes are straightforward.

I’ve also fleshed out significant checks for invariants, which should catch many bugs early. These can be removed from non-debug builds easily.

Next up is to either add a mode whose sole purpose is to test the allocation subsystem, or use some sort of test wrapper / framework to unit-test the new BigBuffer_* API.

ian · July 26, 2024, 11:06am

I was studying the protocol buffer and it seems pretty memory hungry. This will come in handy there.

henrygab · July 28, 2024, 7:21pm

The current BigBuffer API implementation is dirt-simple … It only tracks a single “free” buffer area (between short-lived allocations from the bottom, and long-lived allocations from the top), and allocates from that one region. Thus, if the sequence is:

A = BigBuffer_AllocTemporary(1K…)
B = BigBuffer_AllocTemporary(1K…)
BigBuffer_FreeTemporary(A…)

Then the current implementation will not consider A as part of the allocatable space until B (and all other later temporary allocations) are also free’d. This was my attempt to follow the mantra:
Everything should be made as simple as possible, and not simpler.

Interestingly, C11 provides aligned_alloc(), which after allocation can simply be released via free(). This should “just work” if the current BigBuffer memory was released for use by the heap, at least until the heap becomes fragmented. Then again, the current BigBuffer API is too simple to handle fragmentation at all (see above), so not clear anything is actually lost by doing this.

I may just look into this, if/when the current solution becomes insufficient.