In addition, there is a MediaChangeNotification flag, which stores whether the host must see a sequence of Sense/ASC/ASCQ errors when it tries to access the volume.
State Transition Table
from \ to
ejected
shared R/O
FW Exc
Host Exc
ejected
Y
M
Y
M
shared R/O
Y
Y
Y
M
FW Exc
Y
M
Y
M
Host Exc
Y
X
Y
Y
Where M indicates allowed, but must set MediaChangeNotification flag.
Where X may not be permitted … to be decided later.
Main Transition Events
Transition to NAND_VOLUME_STATE_FW_EXCLUSIVE, unless media is ejected (e.g., in process of formatting, mounting, …):
nand/nand/ftl_diskio.c / diskio_write()
When calling f_close(), check if the handle being closed is a write-capable handle. If so, and this is the final write-capable handle being closed:
Assert state is NAND_VOLUME_STATE_FW_EXCLUSIVE and transition to state NAND_VOLUME_STATE_SHARED_RO
Non-FileHandle Events
The following FW APIs do not use a file handle, but may change the media state. Therefore, they should set the MediaChangeNotification flag explicitly on success. Note that it’s safe to set this flag even if the host is not permitted to access the volume:
f_mkdir()
f_unlink()
f_rename()
f_chmod()
f_setlabel()
Events to be specially handled
The following change the volume state at a fundamental level. Transition to NAND_VOLUME_STATE_SHARED_RO AND set MediaChangeNotification:
f_mkfs()
f_mount()
f_fdisk()
Assertions to help validate correctness
The following APIs each require the use of a write-capable file handle. To help catch any missed edge cases, assert that state ends up as NAND_VOLUME_STATE_FW_EXCLUSIVE
for successful calls to:
f_write()
f_truncate()
f_sync()
f_forward()
f_expand()
f_putc()
f_puts()
f_printf()
It seems this is doable, and that it will require careful review to ensure all edge cases are caught. However, the user experience should be quite friendly.
Not following this closely, but I realize that MicroPython must face a similar problem. You can see the flash drive from your PC and your device scripts can write to the flash disk as well.
I’m not sure. I was looking at MicroPython’s documentation and it looks like pybpard.py 's function is to copy the file onto the processor and then ruin the code. In other words, the host determines when the code is executed.
But then I read that if boot.py/main.py exists, the code will autorun. Is shared file access then not allowed?
In addition, although CircuitPython exposes the volume as R/W to the host, if the host writes any file, the board will reboot a moment later. This is how Adafruit prevents the currently running code from being corrupted. This method was rejected (and rightly so) for the BP5.
@ian … my draft PR is NOT intended to be merged. It won’t work as-is. However, here’s a summary of the changes I am attempting to work into this.
click to expand
MCN == Media Change Notification == sending a sequence of Sense/ASC/ASCQ errors to the host, so it flushes any cache and updates its view of the file system.
My review exposed a number of cases where I do not think the MCN was being properly handled. For example, when a media change was requested, READ and WRITE commands seem to still have been allowed?
I really need the RTT integration, so I can add lots of output on specific debug channels…
My general overview:
There is now an enum for the NAND volume state:
a. Ejected – Host and firmware both should get error codes indicating the media is ejected
b. Shared RO – Host sees the volume as read-only. Firmware sees the volume as read/write … but has made no changes to the volume since the last MCN was indicated
c. Firmware-Exclusive – Host sees media as ejected; Firmware sees the volume as read/write
d. Host-Exclusive – Host sees media as R/W; Firmware sees the volume as ejected (no current code paths to reach this state).
Asserting that all FastFAT operations are occuring ONLY from Core0. This will help prevent the need for additional synchronization primitives, and generally simplify the code.
Asserting that all USB mass storage requests are occuring ONLY
from Core1.
Only Core0 can modify the NAND volume state.
MCN state is via an enumeration.
a. Enumeration with value zero means no interference with command results due to MCN.
b. Other enumerations are listed in reverse order of intended reporting. Thus, the final MCN Sense/ASC/ASCQ before normal command processing resumes would have value 1, and higher valued enums would be the errors reported first.
c. Core1 (USB MS requests) can only reduce the enum by one (min: zero).
d. Core0 can only set the enum to the maximum value.
e. In combination, these ensure that all the necessary error codes get reported to the host (state machine of the Sense/ASC/ASCQ to force detection of the media change).
A common function is used for all USB mass storage commands (except INQUIRY), which sets error codes for media being ejected for the host, and also for handling the error code state machine for media change notifications.
A common function is used for setting the NAND volume state. This function allows forcing the MCN event to occur, but is primarily designed to hide the complexity of determining which NAND volume state changes require an MCN event occur.
My draft PR does NOT currently work. Here’s what I’ve tried to implement thus far:
click to expand
By default, volume should be in NAND_VOLUME_STATE_SHARED_RO.
Firmware should be able to write to the volume in this state.
As soon as the firmware writes to the volume, it also transitions the state to NAND_VOLUME_STATE_FW_EXCLUSIVE. In this state, the USB mass storage commands should report ejected media to the host.
When f_close() is called, the code checks is there are any remaining handles with write permission. If not, and the state is NAND_VOLUME_STATE_FW_EXCLUSIVE, the volume should be reverting to NAND_VOLUME_STATE_SHARED_RO (because the firmware isn’t writing to the media anymore).
The beauty of having the transition delayed until an actual write to the media is that it captures any modification of the NAND.
One downside of the FastFAT library is that the functions have many, many exit points. This makes it very difficult to perform any post-processing without wrapping the whole function.
One commit’s purpose was to just wrap the FastFAT API. In the wrapper, I added the assertions that FastFAT occurs only from Core0, and assertions that a function (if it inherently modifies the NAND) transitions to FW_EXCLUSIVE mode on successful return.
I’ve either added an infinite loop, or an assertion is being hit. I’ve only gotten the debug probe working once (uploading binary from command line), and never got the integrated VSCode debugger working. Getting the debugger setup and configured will delay my being productive on this effort.
I’m aware it’s a work in progress and just had a peek. I got to the bits changing dhara related stuff and decided best to wait for you to finish updates as I’ll need to actually use it to provide useful feedback.
So… I am currently getting the terminal to work through RTT. As part of that, I am adding debug statements and assertions to verify I understand the current architecture … normal development stuff. But I keep hitting assertions in the spi_busy_wait() function.
So I start digging… Too soon to tell, but if there’s somewhere not properly using the SPI mutex, that could cause corruption of the NAND. I’m kind of hoping this pans out, and finds a really terrible bug…
!!!! SPI mutex release from core 0 (id 0), WRONG OWNER -1 (line 503 file src/nand/spi_nand.c)
What this means: Someone is releasing the mutex, when there is no owner. Typically this means some other code path already released the mutex … a double-release (e.g., similar to a double-free of memory). This suggests there is a ticking time bomb…
So, here’s one code path that could currently occur…
Core1: Core1 acquires the SPI mutex, because it’s going to be updating the LCD (e.g., status bar).
Core0: While Core1 has the mutex, Core0 (in the volume mount path) calls csel_deselect(), which calls spi_busy_wait(false)
Core0: spi_busy_wait() – the call to lock_owner_id_is_valid() returns true, because the mutex is currently owned by Core1.
Core0: spi_busy_wait() – calls mutex_exit(), which unconditionally releases the mutex, even though Core1 was the owner of the mutex.
As a result, the intended core-safe mutual exclusion method for arbitrating access to the SPI bus is rendered ineffective. I posit that this may be one of the causes of NAND corruption?
And the problem can continue to chain, so long as the Cores are actively trying to use the SPI bus. For example:
Core0: Core0 now acquires the mutex…
Core1: Eventually Core1 finishes, and releases the mutex … except it is now releasing the mutex that Core0 acquired (because its own lock was released earlier)…
etc.
Next steps is to try to confirm this with very targeted debug info … since the LCD calls this function many times a second, it quickly overloads the RTT buffer (at least when using a Pi Probe), so will need to only log when a non-owner is releasing the mutex…
From spi_busy_wait(), when called with parameter false:
// the check is to protect against the first csel_deselect
// call not matched by a csel_select
if (lock_is_owner_id_valid(spi_mutex.owner)) {
Rephrasing that comment:
We have this mutual exclusion structure, and the code currently tries to release this lock, even though it’s not the owner. Rather than fixing the code to use a lock correctly, I’m going to add this check to avoid fixing the calling code.
The problem is… this breaks the basic mutual exclusion contract.
You might think this check could be made “better” by verifying the owner is the current execution unit (aka core in this case). But that would just hide corruption until later, because if the lock was held by something on the current core, that same code would expect that lock to remain, and would release that lock.
This is making me REALLY SCARED to remove this hack. It needs to happen, but … what will break in the file system as a result?
OK, seems like the SPI bus arbitrarion issue might have been contained. This doesn’t mean that all the causes of storage corruption have been found, but this definitely could have been one of the causes.
If you have experienced NAND corruption, please consider trying a binary (BP5 / BP6 or BP5XL) built from PR #165.
Scenarios to test:
Boot
View file system from PC
Create / edit / modify / read files from the PC
Connect to the terminal
list files from the terminal
add/delete files from the terminal
disconnect the terminal
Create / edit / modify / read files from the PC again
If it doesn’t lock up the bus pirate … that’s a success! If you also previously experienced file system corruption, but now do not, that would also be great to know
You could try slowing BP_LCD_REFRESH_RATE_MS to -2000 so the LCD related debug is less chatty.
I also noticed that void spi_busy_wait(bool enable) is defined in pirate.c as well as mutex.c.
As I understand the above, the initialization of the NAND Chip Select via diskio_initialize()->spi_nand_init()->csel_deselect() releases the mutex despite the fact we don’t own it with a csel_select() first.
I noticed there is a step after that, csel_setup() which currently does nothing. We could add the raw pin setup there instead of piggybacking on the deselect(), which takes the mutex out of the setup process.
Another thing I note is that there is a (&diskio_mutex) in nand_ftl_diskio.c. Is this still needed/warranted since we now have write lock when the terminal is open (and later, exclusive disk and terminal modes)?
Is this just change just for debug output and testing the error condition? I’m trying to understand how it avoids the issue of wrong core mutex release, or perhaps that now the purpose.
On my machine, spi_busy_wait(bool enable) is declared in pirate.h. It used to be a function, and is now a mutex to allow tracing owners. I don’t see this defined in mutex.c. Can you point to the source you’re seeing this in on Github?
Correct! Exclusive access to the SPI bus has been constantly broken by the NAND code!
FLASH_STORAGE_CS is set to 1 via storage_init(). It must stay at 1 except during the time the mutex is acquired, as otherwise the SPI bus is potentially in use by another peripheral. I think the current code in the PR is doing the right thing?
Good question … but shouldn’t hold up this PR. I’ve opened issue #166 so we don’t lose this question / potential improvement. (That said, my gut says we still need it…)
Not just for debug output. If an exclusive access mechanism is being misused, this is a serious code issue. This deserves to hard_assert(). The BP_ASSERT() ensures both that the owner is set as expected (during acquire), and then relies on that behavior in the release to verify the same owner is releasing the mutex. This should remain in the main firmware, always-enabled, because finding these bugs is exceptionally taxing.
I understand what you’re getting at, and it does make it idiot/bug proof. I don’t believe the second core is active during the pin configuration phase of the start up sequence though, so if abusing csdeselect is causing issues we can avoid it with careful startup sequencing.
I’m having some doubts though, so let me confirm my thinking with a review of the code.
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ls -l > testdir.txt
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat testdir.txt
total 16432
-rw-r--r-- 1 matty matty 16 Dec 31 2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty 82 Dec 31 2019 BPSPI.BP
-rw-r--r-- 1 matty matty 96 Dec 31 2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31 2019 DEV.BIN
-rw-r--r-- 1 matty matty 403 Oct 5 10:20 dir.txt
drwxr-xr-x 2 matty matty 8192 Oct 3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31 2019 TEST.BIN
-rw-r--r-- 1 matty matty 0 Dec 17 13:32 testdir.txt
-rw-r--r-- 1 matty matty 90 Dec 31 2019 UGLITCH.BP
Modify a file (and re-read contents):
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ vi testdir.txt
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ cat testdir.txt
total 16432
-rw-r--r-- 1 matty matty 16 Dec 31 2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty 82 Dec 31 2019 BPSPI.BP
-rw-r--r-- 1 matty matty 96 Dec 31 2019 BPUART.BP
<<< DELETED 3 LINES >>>
-rw-r--r-- 1 matty matty 8388608 Dec 31 2019 TEST.BIN
-rw-r--r-- 1 matty matty 0 Dec 17 13:32 testdir.txt
-rw-r--r-- 1 matty matty 90 Dec 31 2019 UGLITCH.BP
Delete a file and recheck directory:
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ rm BPSPI.BP
┌──(matty💊s76)-[/media/matty/5021-0000]
└─$ ll
total 16432
-rw-r--r-- 1 matty matty 16 Dec 31 2019 BPBINMOD.BP
-rw-r--r-- 1 matty matty 96 Dec 31 2019 BPUART.BP
-rw-r--r-- 1 matty matty 8388608 Dec 31 2019 DEV.BIN
-rw-r--r-- 1 matty matty 403 Oct 5 10:20 dir.txt
drwxr-xr-x 2 matty matty 8192 Oct 3 13:44 stuff
-rw-r--r-- 1 matty matty 8388608 Dec 31 2019 TEST.BIN
-rw-r--r-- 1 matty matty 375 Dec 17 13:33 testdir.txt
-rw-r--r-- 1 matty matty 90 Dec 31 2019 UGLITCH.BP
4. Connect to the terminal
Kernel log:
[ 9363.860255] sdb: detected capacity change from 191296 to 0
[ 9363.882856] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[ 9363.882874] I/O error, dev sdb, sector 252 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 9363.882884] Buffer I/O error on dev sdb1, logical block 0, lost sync page write
[ 9365.978135] sd 1:0:0:0: [sdb] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
[ 9365.979134] sd 1:0:0:0: [sdb] Write Protect is on
[ 9365.979144] sd 1:0:0:0: [sdb] Mode Sense: 03 00 80 00
[ 9365.981677] sdb: detected capacity change from 0 to 191296
[ 9365.991361] sdb: sdb1
5. list files from the terminal
Listing files (in minicom):
Welcome to minicom 2.8
OPTIONS: I18n
Port /dev/ttyACM0, 13:28:31
Press CTRL-A Z for help on special keys
VT100 compatible color mode? (Y/n)> n
HiZ> ls
16 bpbinmod.bp
90 uglitch.bp
8388608 dev.bin
8388608 test.bin
<DIR> stuff
403 dir.txt
96 bpuart.bp
<DIR> trash-~1
375 testdir.txt
2 dirs, 7 files
(Note the addition of trash directory from the host)
Glad you like the holiday theme The Santa hat on VLC media player was my inspiration. But I also considered the vintage twinkle light effects Technology Connections has been on about for years, maybe next time.
Wow, thank you Matt! I really appreciate having a second view of this. It sounds like things all worked, except that the volume failed to cleanly unmount at step #4 (when the terminal was connected). Since you fsync/umount manually, it seems this is unlikely a firmware issue; More likely some other process on the host is accessing the volume?
logs of note
The above logs and explanation make it clear that the host is not cleanly dismounting the volume. The kernel log is clear that a write to LBA 0 was lost. Later, in remounting at step 7, the “dirty” bit was noticed on the volume.
For FAT16, “dirty” could be set in two locations:
LBA 0, byte offset 0x25 (37) … officially, unused byte of the Boot Parameter Block (aka BPB)
FAT, the two most significant bits of FAT entry [1] (… ancient method … I think this was only on Win9x).
Given the kernel logged the loss of LBA 0, that’s would support the dirty bit being stored by Linux in (at least) the BPB, and explain the “dirty” flag existence when the volume was remounted at step 7.
Given the high use of the SPI bus for LCD updates, and thus the likelihood of its interfering with the NAND, this looks promising!
To be clear … The current architecture is still going to have scenarios where the host will lose information that it has written to the volme. This PR is simply trying to fix a bug that opened a (relatively) large timing window for data corruption.
[Edit: Also, to be clear, I do not see any issues remounting in Windows when running the above scenario. So, I think this is good to go. Just awaiting @Ian’s final OK]