Storage architecture discussion

henrygab · October 5, 2024, 6:48pm

Current Design Target

Four NAND Volume States

The state of the NAND volume may exist in one of the following four states:

typedef enum _nand_volume_state_t {
    NAND_VOLUME_STATE_EJECTED = 0,
    NAND_VOLUME_STATE_SHARED_READONLY,
    NAND_VOLUME_STATE_FW_EXCLUSIVE,
    NAND_VOLUME_STATE_HOST_EXCLUSIVE,
} nand_volume_state_t;

In addition, there is a MediaChangeNotification flag, which stores whether the host must see a sequence of Sense/ASC/ASCQ errors when it tries to access the volume.

State Transition Table

from \ to	`ejected`	`shared R/O`	`FW Exc`	`Host Exc`
`ejected`	`Y`	`M`	`Y`	`M`
`shared R/O`	`Y`	`Y`	`Y`	`M`
`FW Exc`	`Y`	`M`	`Y`	`M`
`Host Exc`	`Y`	`X`	`Y`	`Y`

Where M indicates allowed, but must set MediaChangeNotification flag.
Where X may not be permitted … to be decided later.

Main Transition Events

Transition to NAND_VOLUME_STATE_FW_EXCLUSIVE, unless media is ejected (e.g., in process of formatting, mounting, …):

nand/nand/ftl_diskio.c / diskio_write()

When calling f_close(), check if the handle being closed is a write-capable handle. If so, and this is the final write-capable handle being closed:

Assert state is NAND_VOLUME_STATE_FW_EXCLUSIVE and transition to state NAND_VOLUME_STATE_SHARED_RO

Non-FileHandle Events

The following FW APIs do not use a file handle, but may change the media state. Therefore, they should set the MediaChangeNotification flag explicitly on success. Note that it’s safe to set this flag even if the host is not permitted to access the volume:

f_mkdir()
f_unlink()
f_rename()
f_chmod()
f_setlabel()

Events to be specially handled

The following change the volume state at a fundamental level. Transition to NAND_VOLUME_STATE_SHARED_RO AND set MediaChangeNotification:

f_mkfs()
f_mount()
f_fdisk()

Assertions to help validate correctness

The following APIs each require the use of a write-capable file handle. To help catch any missed edge cases, assert that state ends up as NAND_VOLUME_STATE_FW_EXCLUSIVE
for successful calls to:

f_write()
f_truncate()
f_sync()
f_forward()
f_expand()
f_putc()
f_puts()
f_printf()

It seems this is doable, and that it will require careful review to ensure all edge cases are caught. However, the user experience should be quite friendly.

thestumbler · October 8, 2024, 8:38am

Not following this closely, but I realize that MicroPython must face a similar problem. You can see the flash drive from your PC and your device scripts can write to the flash disk as well.

Has anyone looked at how they do it?

grymoire · October 8, 2024, 11:53am

I’m not sure. I was looking at MicroPython’s documentation and it looks like pybpard.py 's function is to copy the file onto the processor and then ruin the code. In other words, the host determines when the code is executed.

But then I read that if boot.py/main.py exists, the code will autorun. Is shared file access then not allowed?

[Edited] CircuitPython does have a shared file system, but for the processor to write to the file system, it has to be Read Only mounted on the host.

ian · October 8, 2024, 1:33pm

Looked at Henry’s draft pull. It’s quite a lot of changes. I’ll have to test it in real life to understand everything that is going on.

I have had occasional non-responsive start up with some of the most recent disk changes. When the dust settles I’ll have a good look at that.

henrygab · October 8, 2024, 4:11pm

In addition, although CircuitPython exposes the volume as R/W to the host, if the host writes any file, the board will reboot a moment later. This is how Adafruit prevents the currently running code from being corrupted. This method was rejected (and rightly so) for the BP5.

henrygab · October 8, 2024, 4:30pm

@ian … my draft PR is NOT intended to be merged. It won’t work as-is. However, here’s a summary of the changes I am attempting to work into this.

click to expand

MCN == Media Change Notification == sending a sequence of Sense/ASC/ASCQ errors to the host, so it flushes any cache and updates its view of the file system.

My review exposed a number of cases where I do not think the MCN was being properly handled. For example, when a media change was requested, READ and WRITE commands seem to still have been allowed?

I really need the RTT integration, so I can add lots of output on specific debug channels…

My general overview:

There is now an enum for the NAND volume state:
a. Ejected – Host and firmware both should get error codes indicating the media is ejected
b. Shared RO – Host sees the volume as read-only. Firmware sees the volume as read/write … but has made no changes to the volume since the last MCN was indicated
c. Firmware-Exclusive – Host sees media as ejected; Firmware sees the volume as read/write
d. Host-Exclusive – Host sees media as R/W; Firmware sees the volume as ejected (no current code paths to reach this state).
Asserting that all FastFAT operations are occuring ONLY from Core0. This will help prevent the need for additional synchronization primitives, and generally simplify the code.
Asserting that all USB mass storage requests are occuring ONLY
from Core1.
Only Core0 can modify the NAND volume state.
MCN state is via an enumeration.
a. Enumeration with value zero means no interference with command results due to MCN.
b. Other enumerations are listed in reverse order of intended reporting. Thus, the final MCN Sense/ASC/ASCQ before normal command processing resumes would have value 1, and higher valued enums would be the errors reported first.
c. Core1 (USB MS requests) can only reduce the enum by one (min: zero).
d. Core0 can only set the enum to the maximum value.
e. In combination, these ensure that all the necessary error codes get reported to the host (state machine of the Sense/ASC/ASCQ to force detection of the media change).
A common function is used for all USB mass storage commands (except INQUIRY), which sets error codes for media being ejected for the host, and also for handling the error code state machine for media change notifications.
A common function is used for setting the NAND volume state. This function allows forcing the MCN event to occur, but is primarily designed to hide the complexity of determining which NAND volume state changes require an MCN event occur.

My draft PR does NOT currently work. Here’s what I’ve tried to implement thus far:

click to expand

By default, volume should be in NAND_VOLUME_STATE_SHARED_RO.
Firmware should be able to write to the volume in this state.
As soon as the firmware writes to the volume, it also transitions the state to NAND_VOLUME_STATE_FW_EXCLUSIVE. In this state, the USB mass storage commands should report ejected media to the host.
When f_close() is called, the code checks is there are any remaining handles with write permission. If not, and the state is NAND_VOLUME_STATE_FW_EXCLUSIVE, the volume should be reverting to NAND_VOLUME_STATE_SHARED_RO (because the firmware isn’t writing to the media anymore).

The beauty of having the transition delayed until an actual write to the media is that it captures any modification of the NAND.

One downside of the FastFAT library is that the functions have many, many exit points. This makes it very difficult to perform any post-processing without wrapping the whole function.

One commit’s purpose was to just wrap the FastFAT API. In the wrapper, I added the assertions that FastFAT occurs only from Core0, and assertions that a function (if it inherently modifies the NAND) transitions to FW_EXCLUSIVE mode on successful return.

I’ve either added an infinite loop, or an assertion is being hit. I’ve only gotten the debug probe working once (uploading binary from command line), and never got the integrated VSCode debugger working. Getting the debugger setup and configured will delay my being productive on this effort.

ian · October 8, 2024, 4:50pm

I’m aware it’s a work in progress and just had a peek. I got to the bits changing dhara related stuff and decided best to wait for you to finish updates as I’ll need to actually use it to provide useful feedback.

henrygab · December 17, 2024, 6:14am

So… I am currently getting the terminal to work through RTT. As part of that, I am adding debug statements and assertions to verify I understand the current architecture … normal development stuff. But I keep hitting assertions in the spi_busy_wait() function.

So I start digging… Too soon to tell, but if there’s somewhere not properly using the SPI mutex, that could cause corruption of the NAND. I’m kind of hoping this pans out, and finds a really terrible bug…

henrygab · December 17, 2024, 6:14am

Promising:

!!!! SPI mutex release from core 0 (id 0), WRONG OWNER -1     (line  503  file src/nand/spi_nand.c)

What this means: Someone is releasing the mutex, when there is no owner. Typically this means some other code path already released the mutex … a double-release (e.g., similar to a double-free of memory). This suggests there is a ticking time bomb…

henrygab · December 17, 2024, 6:15am

Helper function is making it harder to pinpoint the real culprit.

csel_select() acquires the mutex
csel_deselect() releases the mutex

So looking where the release might be called, without a prior acquire…

`spi_nand_init()` starts off by unconditionally calling `csel_deselect()` (!!!)

This seems improper …

… called anytime disk_initialize() is called …
… called from mount_volume() …
… called from f_mount(), f_chdir(), f_getcwd(), f_opendir(), f_stat(), f_getfree(), f_unlink(), f_mkdir(), f_rename(), f_chmod(), f_utime(), f_getlabel(), f_setlabel(), …

henrygab · December 17, 2024, 6:16am

So, here’s one code path that could currently occur…

Core1: Core1 acquires the SPI mutex, because it’s going to be updating the LCD (e.g., status bar).
Core0: While Core1 has the mutex, Core0 (in the volume mount path) calls csel_deselect(), which calls spi_busy_wait(false)
Core0: spi_busy_wait() – the call to lock_owner_id_is_valid() returns true, because the mutex is currently owned by Core1.
Core0: spi_busy_wait() – calls mutex_exit(), which unconditionally releases the mutex, even though Core1 was the owner of the mutex.

As a result, the intended core-safe mutual exclusion method for arbitrating access to the SPI bus is rendered ineffective. I posit that this may be one of the causes of NAND corruption?

And the problem can continue to chain, so long as the Cores are actively trying to use the SPI bus. For example:

Core0: Core0 now acquires the mutex…
Core1: Eventually Core1 finishes, and releases the mutex … except it is now releasing the mutex that Core0 acquired (because its own lock was released earlier)…
etc.

Next steps is to try to confirm this with very targeted debug info … since the LCD calls this function many times a second, it quickly overloads the RTT buffer (at least when using a Pi Probe), so will need to only log when a non-owner is releasing the mutex…

henrygab · December 17, 2024, 6:25am

From spi_busy_wait(), when called with parameter false:

// the check is to protect against the first csel_deselect
// call not matched by a csel_select
if (lock_is_owner_id_valid(spi_mutex.owner)) {

Rephrasing that comment:
We have this mutual exclusion structure, and the code currently tries to release this lock, even though it’s not the owner. Rather than fixing the code to use a lock correctly, I’m going to add this check to avoid fixing the calling code.

The problem is… this breaks the basic mutual exclusion contract.

You might think this check could be made “better” by verifying the owner is the current execution unit (aka core in this case). But that would just hide corruption until later, because if the lock was held by something on the current core, that same code would expect that lock to remain, and would release that lock.

This is making me REALLY SCARED to remove this hack. It needs to happen, but … what will break in the file system as a result?

henrygab · December 17, 2024, 7:10am

OK, seems like the SPI bus arbitrarion issue might have been contained. This doesn’t mean that all the causes of storage corruption have been found, but this definitely could have been one of the causes.

If you have experienced NAND corruption, please consider trying a binary (BP5 / BP6 or BP5XL) built from PR #165.

Scenarios to test:

Boot
View file system from PC
Create / edit / modify / read files from the PC
Connect to the terminal
list files from the terminal
add/delete files from the terminal
disconnect the terminal
Create / edit / modify / read files from the PC again

If it doesn’t lock up the bus pirate … that’s a success! If you also previously experienced file system corruption, but now do not, that would also be great to know

ian · December 17, 2024, 1:06pm

It really seems like you’re on to something.

You could try slowing BP_LCD_REFRESH_RATE_MS to -2000 so the LCD related debug is less chatty.

I also noticed that void spi_busy_wait(bool enable) is defined in pirate.c as well as mutex.c.

As I understand the above, the initialization of the NAND Chip Select via diskio_initialize()->spi_nand_init()->csel_deselect() releases the mutex despite the fact we don’t own it with a csel_select() first.

I noticed there is a step after that, csel_setup() which currently does nothing. We could add the raw pin setup there instead of piggybacking on the deselect(), which takes the mutex out of the setup process.

Another thing I note is that there is a (&diskio_mutex) in nand_ftl_diskio.c. Is this still needed/warranted since we now have write lock when the terminal is open (and later, exclusive disk and terminal modes)?

void spi_busy_wait_internal(bool enable, const char *file, int line) {

    if (!enable) {

        BP_ASSERT(lock_get_caller_owner_id() == spi_mutex.owner);
        mutex_exit(&spi_mutex);

    } else {

        mutex_enter_blocking(&spi_mutex);
        BP_ASSERT(lock_get_caller_owner_id() == spi_mutex.owner);
    }
}

Is this just change just for debug output and testing the error condition? I’m trying to understand how it avoids the issue of wrong core mutex release, or perhaps that now the purpose.

henrygab · December 17, 2024, 4:28pm

Ian,

Thanks for the review! Answers inline…

On my machine, spi_busy_wait(bool enable) is declared in pirate.h. It used to be a function, and is now a mutex to allow tracing owners. I don’t see this defined in mutex.c. Can you point to the source you’re seeing this in on Github?

Correct! Exclusive access to the SPI bus has been constantly broken by the NAND code!

FLASH_STORAGE_CS is set to 1 via storage_init(). It must stay at 1 except during the time the mutex is acquired, as otherwise the SPI bus is potentially in use by another peripheral. I think the current code in the PR is doing the right thing?

Good question … but shouldn’t hold up this PR. I’ve opened issue #166 so we don’t lose this question / potential improvement. (That said, my gut says we still need it…)

Not just for debug output. If an exclusive access mechanism is being misused, this is a serious code issue. This deserves to hard_assert(). The BP_ASSERT() ensures both that the owner is set as expected (during acquire), and then relies on that behavior in the release to verify the same owner is releasing the mutex. This should remain in the main firmware, always-enabled, because finding these bugs is exceptionally taxing.

Let me know if there are additional questions?

ian · December 17, 2024, 4:57pm

I understand what you’re getting at, and it does make it idiot/bug proof. I don’t believe the second core is active during the pin configuration phase of the start up sequence though, so if abusing csdeselect is causing issues we can avoid it with careful startup sequencing.

I’m having some doubts though, so let me confirm my thinking with a review of the code.

mbrugman · December 17, 2024, 7:59pm

Ok, trying this one
(TL;DR - it didn’t lock up the BP, and worked more-or-less as I would have expected)

Edit: I haven’t had any filesystem issues since the last major work was done in that area

Results and kernel log at each step below:

1. Boot

[ 8768.328593] sd 1:0:0:0: Attached scsi generic sg1 type 0
[ 8768.336851] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
[ 8768.337507] sd 1:0:0:0: [sdb] Write Protect is off
[ 8768.337510] sd 1:0:0:0: [sdb] Mode Sense: 03 00 00 00
[ 8768.338208] sd 1:0:0:0: [sdb] No Caching mode page found
[ 8768.338220] sd 1:0:0:0: [sdb] Assuming drive cache: write through
[ 8768.364031]  sdb: sdb1
[ 8768.364323] sd 1:0:0:0: [sdb] Attached SCSI removable disk

BTW - the Xmas theme is cute

2. View file system from PC

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ll
total 16432
-rw-r--r-- 1 matty matty      16 Dec 31  2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty      82 Dec 31  2019 BPSPI.BP
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

Also looked with GUI file explorer

3. Create/ edit / modify / read files from the PC

Checked directory:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ll
total 16432
-rw-r--r-- 1 matty matty      16 Dec 31  2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty      82 Dec 31  2019 BPSPI.BP
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

Reading a file:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat UGLITCH.BP           
{
"trigger": 13,
"delay": 80,
"wander": 3,
"recycle": 10,
"failchar": 80,
"retries": 100
}

Create a file (read contents for sure):

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ls -l > testdir.txt                      

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat testdir.txt  
total 16432
-rw-r--r-- 1 matty matty      16 Dec 31  2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty      82 Dec 31  2019 BPSPI.BP
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty       0 Dec 17 13:32 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

Modify a file (and re-read contents):

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ vi testdir.txt   

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat testdir.txt
total 16432
-rw-r--r-- 1 matty matty      16 Dec 31  2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty      82 Dec 31  2019 BPSPI.BP
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
<<< DELETED 3 LINES >>>
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty       0 Dec 17 13:32 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

Delete a file and recheck directory:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ rm BPSPI.BP 

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ll                 
total 16432
-rw-r--r-- 1 matty matty      16 Dec 31  2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty     375 Dec 17 13:33 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

4. Connect to the terminal

Kernel log:

[ 9363.860255] sdb: detected capacity change from 191296 to 0
[ 9363.882856] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[ 9363.882874] I/O error, dev sdb, sector 252 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 9363.882884] Buffer I/O error on dev sdb1, logical block 0, lost sync page write
[ 9365.978135] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
[ 9365.979134] sd 1:0:0:0: [sdb] Write Protect is on
[ 9365.979144] sd 1:0:0:0: [sdb] Mode Sense: 03 00 80 00
[ 9365.981677] sdb: detected capacity change from 0 to 191296
[ 9365.991361]  sdb: sdb1

5. list files from the terminal

Listing files (in minicom):

Welcome to minicom 2.8

OPTIONS: I18n 
Port /dev/ttyACM0, 13:28:31

Press CTRL-A Z for help on special keys



VT100 compatible color mode? (Y/n)> n
HiZ> ls
        16 bpbinmod.bp
        90 uglitch.bp
   8388608 dev.bin
   8388608 test.bin
   <DIR>   stuff
       403 dir.txt
        96 bpuart.bp
   <DIR>   trash-~1
       375 testdir.txt
2 dirs, 7 files

(Note the addition of trash directory from the host)

6. add/delete files from the terminal

Add file in minicom:

HiZ> mkdir testdir

Add file Kernel log:

[ 9754.413354] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
[ 9754.425540]  sdb: sdb1

Delete file in minicom:

HiZ> rm bpbinmod.bp

Delete file in Kernal log:

[ 9559.710165] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)

7. disconnect the terminal

Closed minicom, nothing in kernel log. Volume has been unmounted in the host. Remounted:

┌──(matty💊s76)-[~]
└─$ sudo mount /dev/sdb1 ./tmp            
[sudo] password for matty: 
mount: /home/matty/tmp: WARNING: source write-protected, mounted read-only.

So, to continue on, I unmounted and ejected. Unplugged BP5 and remounted.

I did properly unmount and eject, but kernel log shows dirty bit is set:

[10103.706326] sd 1:0:0:0: Attached scsi generic sg1 type 0
[10103.714904] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
[10103.715704] sd 1:0:0:0: [sdb] Write Protect is off
[10103.715715] sd 1:0:0:0: [sdb] Mode Sense: 03 00 00 00
[10103.716546] sd 1:0:0:0: [sdb] No Caching mode page found
[10103.716557] sd 1:0:0:0: [sdb] Assuming drive cache: write through
[10103.737557]  sdb: sdb1
[10103.737696] sd 1:0:0:0: [sdb] Attached SCSI removable disk
[10107.099478] FAT-fs (sdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.

8. Create / edit / modify / read files from the PC again

Similar to step 3 above.
Checked directory:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ll   
total 16432
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
drwxr-xr-x 2 matty matty    8192 Dec 31  2019 TESTDIR
-rw-r--r-- 1 matty matty     375 Dec 17 13:33 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

Reading a file:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat BPUART.BP            
{
"baudrate": 115200,
"data_bits": 8,
"stop_bits": 1,
"parity": 0,
"flow_ctrl": 0,
"invert": 0
}

Create a file and read contents:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ echo "some file contents" >> somefile.txt              

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat somefile.txt 
some file contents

Modify file and read contents:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ echo "Adding MOAR" >> somefile.txt

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat somefile.txt 
some file contents
Adding MOAR

Remove file (directory); before & after listings:

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ls -l
total 16440
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
-rw-r--r-- 1 matty matty      31 Dec 17 13:55 somefile.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
drwxr-xr-x 2 matty matty    8192 Dec 31  2019 TESTDIR
-rw-r--r-- 1 matty matty     375 Dec 17 13:33 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ rmdir TESTDIR

┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ls -l
total 16432
-rw-r--r-- 1 matty matty      96 Dec 31  2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 DEV.BIN
-rw-r--r-- 1 matty matty     403 Oct  5 10:20 dir.txt
-rw-r--r-- 1 matty matty      31 Dec 17 13:55 somefile.txt
drwxr-xr-x 2 matty matty    8192 Oct  3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31  2019 TEST.BIN
-rw-r--r-- 1 matty matty     375 Dec 17 13:33 testdir.txt
-rw-r--r-- 1 matty matty      90 Dec 31  2019 UGLITCH.BP

And that’s it

Let me know if there’s anything else you want me to try
Matt

ian · December 17, 2024, 8:15pm

Glad you like the holiday theme The Santa hat on VLC media player was my inspiration. But I also considered the vintage twinkle light effects Technology Connections has been on about for years, maybe next time.

It will disappear in a few days

henrygab · December 17, 2024, 10:09pm

Wow, thank you Matt! I really appreciate having a second view of this. It sounds like things all worked, except that the volume failed to cleanly unmount at step #4 (when the terminal was connected). Since you fsync/umount manually, it seems this is unlikely a firmware issue; More likely some other process on the host is accessing the volume?

logs of note

mbrugman:

[ 9363.860255] sdb: detected capacity change from 191296 to 0
[ 9363.882856] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[ 9363.882874] I/O error, dev sdb, sector 252 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 9363.882884] Buffer I/O error on dev sdb1, logical block 0, lost sync page write

The above logs and explanation make it clear that the host is not cleanly dismounting the volume. The kernel log is clear that a write to LBA 0 was lost. Later, in remounting at step 7, the “dirty” bit was noticed on the volume.

For FAT16, “dirty” could be set in two locations:

LBA 0, byte offset 0x25 (37) … officially, unused byte of the Boot Parameter Block (aka BPB)
FAT, the two most significant bits of FAT entry [1] (… ancient method … I think this was only on Win9x).

Given the kernel logged the loss of LBA 0, that’s would support the dirty bit being stored by Linux in (at least) the BPB, and explain the “dirty” flag existence when the volume was remounted at step 7.

Given the high use of the SPI bus for LCD updates, and thus the likelihood of its interfering with the NAND, this looks promising!

To be clear … The current architecture is still going to have scenarios where the host will lose information that it has written to the volme. This PR is simply trying to fix a bug that opened a (relatively) large timing window for data corruption.

[Edit: Also, to be clear, I do not see any issues remounting in Windows when running the above scenario. So, I think this is good to go. Just awaiting @Ian’s final OK]

mbrugman · December 17, 2024, 10:14pm

I guess I meant that I issued commands to unmount, but who knows what happened under the hood?

You’re welcome - I appreciate all the work you do; the least I can do is test and document if I can

Storage architecture discussion

Current Design Target

spi_nand_init() starts off by unconditionally calling csel_deselect() (!!!)

Scenarios to test:

`spi_nand_init()` starts off by unconditionally calling `csel_deselect()` (!!!)