Status Bar Corruption -- Cause Found?

henrygab · October 12, 2024, 9:10pm

I think I found two reasons for the status bar occasionally getting corrupted?

1. Both cores updating same buffer

If both cores are executing the ui_statusbar_update() function at the same time, this can lead to corruption of the status bar contents.

Here’s a copy of the details in github issue #126:

this is clearly possible in the code...

The function ui_statusbar_update() is called from four locations:

Core1: when lcd_update_request was set to true
Core0: when configuration file exists, at terminal connect
Core0: when user configures to use ANSI toolbar
Core0: when cls command is handled

The primary location where the status bar contents are
updated is from Core1, when the flag lcd_update_request
is set to true. This flag is set to true in all the
following scenarios:

Inter-core message BP_ICM_ENABLE_LCD_UPDATES processed
Inter-core message BP_ICM_FORCE_LCD_UPDATE processed
Every 500ms by the lcd_timer_callback() function

However, the function is ALSO called from Core0 in
the above three locations.

Therefore, if Core0 is handling the user command cls, and
Core1 simultaneously is looping just after the 500ms timer
fired, both cores could call into the function simultaneously.

2. Buffer can be modified while it’s in use

As a second issue, after preparing the status bar with ui_statusbar_update(), Core1 will eventually handle actually sending the updated status bar characters to the terminal. Nothing prevent Core0 from causing the statusbar to be updated AGAIN, even in the midst of Core1 sending the buffer across the USB port.

Early thoughts on a fix

The simplest fix is to have core1 be the only core which updates the buffer.

Perhaps, if the ui_statusbar_update() function is called from Core0, it should instead send inter-core message BP_ICM_FORCE_LCD_UPDATE?

void ui_statusbar_update(uint32_t update_flags) {
    // new lines for the function...
    uint core = get_core_num();
    if (core == 0) {
        icm_core0_send_message_synchronous(BP_ICM_FORCE_LCD_UPDATE);
        return;
    }
    // ... and then the existing functionality.

This would appear to cause UI_UPDATE_ALL, which seems functionally equivalent to the existing code, except that it also re-enables the LCD refresh IRQ. So, maybe adding a new ICM_ message specifically for ensuring the function is handled in Core1?

Can anyone check my logic on this being a potential cause of the status bar corruption?

ian · October 14, 2024, 11:32am

Untangling that is part of a couple other things.

When adding the splash screen I noted that the first system monitor update should probably be moved to core1 and happen after splash so the system can be responsive even while splash is displayed and before the normal backgound is drawn.
There is a relatively new folder called toolbars where the logic command VT100 graph thing lives. Eventually, I plan to move the status bar there as well. Currently the logic bar adjusts position based on the presence of the status bar. This is clunky and at the point of adding a third toolbar will become unusable. I envision some kind of centralized method to register and position toolbars, and coordinate their output (and eventually input, for tab between bars). In fact, the command line area itself could be treated as a toolbar, which swerves into the big input parsing cleanup coming soon.

Corruption is almost inevitable in the way the statusbar is currently handled. My best-effort hack is to pack the full status bar update in a buffer, write that all at once, and hope for the best. I haven’t had terminal corruption since this change, but it is not just possible it is inevitable. When other processes in the Bus Pirate (say, the progress indicator on the flash app) and the status bar are running, collisions will occasionally happen.

How did early terminal systems guard against collisions? I can see a few scenarios and none of them are particular appetizing at first glance:

Set a minimum “time since last character sent” before dumping the status bar. This is another close your eyes and hope approach with extra steps.
Use a spin lock or mutex when sending VT100. This would probably involve a centralized library for sending VT100, instead of the ad-hoc send-what-you-want way we do it now. I feel this is probably the best of the options, but is a level of professionalization I’m not sure I want get into any time soon. Maybe there is an existing small library (eg not ncursus).
A variation to the above where toolbars cooperatively multi-task and must complete all VT100 stuff before returning control to the next toolbar. This would I guess have some internal cursor tracking and use higher level positioning function, even if VT100 can still be freely written however developers want.

I hope I didn’t hurt anyone with the professionalism of my diagram.

henrygab · October 14, 2024, 4:57pm

Today, the only cause of corruption is because both cores might be updating the buffer at the same time.

This is easily fixed (e.g., ICM message, if needing to wait for completion, or just setting a variable to have Core1 update it next loop).

Do you have a preference?

Do not fix it.
Fix by setting a variable to update by Core1 without actually immediately updating UI)
Fix by using a new BP_ICM_ message, allowing Core0 to wait for Core1 to complete the UI update

ian · October 14, 2024, 6:10pm

The core 1 stuff is coop multitasking, but core 0 sometimes shoves vt100 stuff into the queue too. There’s no lock to prevent them from interrupting each other’s escape sequences. Which I assume is why we mostly don’t notice when it happens, because invalid vt100 is discarded.

Happy if it’s easier than that.

henrygab · October 14, 2024, 9:38pm

I’m glad we agree this is broken.

I’m thinking that option #2 is easiest, and resolves the issue fully, as only Core1 would ever be using that buffer at any time.

My dev box is unavailable for at least a couple days. I’ll loop back to this when it’s available again.