USB Mass storage and local FATFS conflicts

ian · April 30, 2024, 12:01pm

I created a test branch that is tracked in this forum thread. It includes several major updates, and seems to address these issues. I’ve used it on my bench for a few days and I’m happy with it. It’s such a big change it would be nice to have some confirmation it works for everyone before I push to main.

Thank you every one for your input and help on this one, it made a huge difference in getting this blocking bug fixed.

henrygab · May 21, 2024, 10:47pm

Hi Ian,

While my job is no longer as a developer, I used to be a kernel-mode dev in the Windows storage stack (disk.sys, cdrom.sys, classpnp.sys), working closely with the file systems folks. More recently, I’ve also worked with hathach to make significant improvements to the tinyuf2 bootloader / ghostfat code. And I added a decoder for the 010 Editor for FAT/exFAT. So… I have some good history in this space.

If I understand correctly, the existing (problematic) BP5 firmware exposed the mass storage device to the host (PC), while simultaneously also exposing the mass storage device to the firmware. Both the host and the firmware read the data, and generate a file system view of the data stored thereon.

As you have discovered, having this simultaneously exposed to both is going to be problematic for many reasons. The simplest solution is to choose one device to “own” the file system that’s on the mass storage device … but there are still edge cases to be cautious about.

Can you confirm that your solution is as follows?

At boot, USB MSD is exposed as R/W (read/write) to the host.
When a serial connection attaches, USB MSD ownership moves from host to firmware
When serial connection closed, USB MSD ownership moves from firmware back to host

In addition, can you confirm a bit more detail on actions taken during those transitions?

Ownership transitions from firmware → host:

(E.g, when the serial connection is closed.)

Generally, the firmware unmounts the file system, and re-exposes the MSD to the host. This is done by allowing media commands to succeed (e.g., TEST_UNIT_READY, READ, WRITE, READ_CAPACITY, etc.). However, the exposure of the MSD to the host is deferred by at least one sense code error indicating that the media is not present (“latching the error”).

Ownership transitions from host → firmware

(E.g., when a serial connection is attached)

Generally, the firmware first notifies the host that the media was suddenly removed. This includes ensuring the host receives at least one error for a TEST_UNIT_READY, READ, WRITE, etc. that indicates that the media is not present (“latching the error”). This is intended to cause the host to invalidate any cached data (emulating ejection of the media).

Next post for additional thoughts

Presuming the above is substantially correct, see my next post for additional notes / questions.

Henry

henrygab · May 21, 2024, 11:55pm

Notes and Questions:

First, we can ignore host-side caching of data that is not modified by the host, as the media ejection emulation causes the information to require re-validation. Thus, the remainder of this discussion implicitly focuses on data that is modified.

Caching by the file system / cache manager

The file system / cache manager in Windows historically cached data for up to ~30 seconds, leading to the infamous messages to re-insert USB flash memory. Current versions of Windows’ file system and cache manager are configured to not cache writes for removable USB flash, because the device can be surprise-removed at any time.

Caching by applications

Applications can (and often do) cache data when editing a file. This is a layer above the file system / cache manager, and this problem applies to any/all operating systems. Therefore, it’s important to choose an editor that does not make this problem worse.

See, as an example, Adafruit’s guide on creating and editing code when using CircuitPython. They also provide a recommended editors page which lists some better-behaved editors.

User education required

There will ALWAYS be edge cases that cannot easily be handled, if both the host and the firmware are allowed to modify the file system. That said, your solution of giving “ownership” of the MSD based on whether the serial connection exists should generally work for 99% of the cases.

However, if an application (or the OS) has edits to open files, those edits could be lost when the user opens a serial connection. Most editors allow the file to be saved when the media is re-inserted (even if a “different” piece of media). Thus, you may find this acceptable.

If you want to handle more edge cases, there are some choices that firmware can make to further reduce risk. They are not particularly difficult, but require the concious application of those principles throughout the firmware. Let me know if discussion on that topic would be of value.

phdussud · May 22, 2024, 12:09am

Hi Henry.
Your understanding is correct.
You should read the code (file msc_disk.c and storage.c) if you want to delve in the details. I think it will make the discussion more productive.

henrygab · May 22, 2024, 12:24am

Well, the thread tracking the builds does not link directly back to the branch on github. Even if manually parsing the build artifact name, it implies the branch name of test and the commit’s short id.

However, comparison of test vs. main does not appear to capture the relevant changes?

Can you suggest which commit(s) have the relevant changes relevant to swapping who the current “owner” of the MSD is?

Thank you!

phdussud · May 22, 2024, 1:31am

I believe this is the relevant checkin on the main branch
Merge pull request #29 from phdussud/main · DangerousPrototypes/BusPirate5-firmware@35b7b11 (github.com)

ian · May 22, 2024, 1:51pm

@phdussud is correct I think, the most relevant changes are in this part of that commit:

henrygab · May 24, 2024, 4:27am

Looks like I’m finally out of the “new user” lockout … (I was prevented from replying anymore on this thread).

There do appear to be potential for data corruption … or at least I’ve not yet discovered how it would be prevented in these situations.

It appears the firmware can have the file system open, even when it’s exposed as read/write to the host. This is problematic, and given how various OS work, this might not be easily mitigated.

Example:
The host can change the purpose of any cluster. For example, a sector used for file DirA\File1.txt might become free because the file is deleted, and then get re-used for a different file. If the firmware has an open read-only file handle, it may not notice this file was deleted.

It appears the host can see read-only media (and thus load the file system), even when it’s exposed to the firmware as read/write. This is problematic for substantially the same reasons as given above, only the actors swap roles.
It appears the firmware causes the media to virtually be ejected, write-protected, and then re-inserted as soon as the COM port is connected … even if the host is actively writing to the media. This is problematic and can easily cause corruption, but likely can be mitigated.

Recommended changes

First, do not allow both host and firmware to “see” the volume at the same time. FAT32 has no safe way for this to work(*).

Second, try to avoid ejecting the media when the host is actively writing. For example, delay transition from host to firmware ownership until the the host has stopped writing for at least N seconds (e.g., 15 seconds).

Track time of last host-issued write command. (very easy)
When serial connection is noticed, first check if last host-issued write command was at least N seconds ago (e.g., 15 seconds). If not, loop to allow host to finish before causing the virtual eject of the media / mounting the file system for the firmware to use.

This then leaves one major decision … should the firmware pend (or block) file operations that occur during this new state of waiting for the host to stop writing? Pros and cons to both sides…

NOTE: Still heuristic…

The above will improve the situation, based on how modern OSs attempt to limit how the cache writes to hot-pluggable media. Applications might still have unwritten edits, and there’s no official standard value to choose for N … but it will remove the largest causes of data corruption … which is that there is no safe way for both the host and firmware to simultaneously view the data, if either one is allowed to edit the data.

Conclusion

MTP was created for a good reason. It’s unfortunate that there is no shim to expose the MTP files / containers as a file system so it’s more globally accessible.

(*) More powerful devices might keep “in sync” with host changes by tracking writes to their underlying files, but writes are not guaranteed to come “in order”, and the complexity of this solution is highly error-prone.

henrygab · May 24, 2024, 4:32am

There is one other option I’ve seen… the following is substantially how CircuitPython handles it, and it removes the complexities and edge cases above, at the cost of forcing additional reboots of the firmware after files are changed by the host.

Mount as read/write on both the host and firmware.
Anytime that the host writes, firmware is paused … and then when writes have stopped for N seconds, the firmware reboots.

Let me know if either of the above are directions that you’d like to pursue.

phdussud · May 24, 2024, 6:45pm

Hi Henry! Thanks for spending some time on this issue!

Yes, it can be a problem if the other (binary) connection provokes a write to the file system. So far to my knowledge this does not happen, but this is not a guarantee. On my firmware, I made the change to unmount the firmware file system when there is no terminal connection to the device. The firmware still works.
Since the host cannot change the underlying storage, I am not sure what harm be done. This behavior is desirable for people dumping a flash memory to the firmware file system and then picking it up on the host file system view to process it.
This is a great suggestion. At the user level it means that the connection with the device could be unresponsive for a time up to the timeout amount. @ian should opine on this (small) change in user model.

phdussud · May 24, 2024, 7:17pm

I spoke too soon. It is doable but more complicated than I thought.

henrygab · May 24, 2024, 8:46pm

For case #2 (r/o @ host, r/w @ firmware), it will not “harm” the device or corrupt what the firmware sees. However, the host will not recognize changes made by the firmware, and may end up with a corrupt (r/o) view … especially where the firmware makes many edits.

The model used by CircuitPython is cleaner

Given the early stages of development, it’s probably not too late to convert to using the CircuitPython model.

Both host and firmware see a writable volume.
Firmware can write anytime, without restriction.
If the host writes to the media, then firmware is paused, firmware view of media is closed, the host is allowed to finish writing (N seconds), and then the firmware is rebooted.

This does prioritize host changes … so if both host and firmware allocated a cluster … the host will overwrite the firmware’s changes.

This is conceptually very clean, and thus a very attractive model.

At boot, prior to exposing the media to the host, certain zero-byte files are ensured to exist on the file system. This minimizes host interference:

\.fseventsd\no_log
.metadata_never_index
.Trashes

For any files that the firmware requires to be in the root directory, it is recommended to also create a zero-byte file as a placeholder.

Creating these files at boot is important because of the particularly fragile nature of the root directory, and increased data loss if the root directory’s contents get corrupted.

General rules

Regardless of the model used, firmware should avoid any interactions with files in the root directory. As one option, firmware could define a subdirectory which the firmware declares is to be used for firmware-written data, and a separate subdirectory which the firmware declares is to be used for host-written data. (And to ensure those subdirectories exists in the initial mounting process, before exposing the media to the host).

Conclusion

If this is of interest, I am open to contributing. My BP5 is shipped, although not yet received.

henrygab · May 24, 2024, 8:51pm

FYI - I have a couple trivial PRs pending on github.

Out of curiousity, would quality-of-life PRs also be worth submitting?

Examples would include:

comments
static_assert() additions
improving parameter names to be more self-documenting (e.g., parameter 1 of rgb_assign_color() from l to index_mask)
reduction of hard-coded “magic numbers” through new #defines (e.g., the intercore messaging constants)
splitting large functions into smaller logical parts (e.g., reducing cyclomatic complexity)

I also noted fragility and inconsistency with how the intercore messaging is used. Nothing is currently functionally broken, but I’d like to clean that up also to improve debuggability going forward.

I’m happy to contribute, but don’t want to waste time on areas that are not of interest to the project owners. Thus, please do let me know if any of the above are areas where contributions would be welcome.

phdussud · May 24, 2024, 9:07pm

About the circuitpython model… What do you mean by rebooted? In other words, what firmware state changes are you targetting by rebooting? A general reboot of the device can have negative consequences on the experiment someone could be doing… like interacting with a I2C sensor.

henrygab · May 24, 2024, 9:27pm

The reboot simplifies the model. The firmware may have open handles to files and/or directories. Those files / directories might be deleted by the host, and thus the open handles are rendered invalid when the host writes to the same file system.

Simply unmounting / remounting the volume is not sufficient, unless the file system has logic that invalidates all existing file handles … in which case, all firmware that logs data now also has to handle a bunch of additional error cases when in the midst of using a file … which is arguably worse because complexity → edge cases → instability / heisenbugs.

In complex systems (such as a PC), when the same media is re-inserted, and the cache manager has cached all relevant metadata, it’s possible for file handles to persist. This is not realistic in a heavily-constrained IoT device, and attempting this could easily result in data corruption.

Brainstorming … perhaps a command that manually transitions ownership of the media from firmware to host (and back)? This does make it less user-friendly as the burden is on the end-user. Something along the lines of…

Firmware → Host:

Checks if any open files / directories … rejects command if true.
Dismounts firmware view of the file system
Exposes the storage device to the host

Host → Firmware:

Checks when host last wrote to the device
If written less than N seconds ago:
a. optionally wait for up to N seconds until true
b. if not waiting or doesn’t occur, reject the command
Eject the media from the host’s view
Mount the media for firmware access

henrygab · June 7, 2024, 7:57am

As best as I can tell, the mounting/unmounting of the file system based on COM port connectivity appears to add some unexpected edge cases.

In particular, it appears that it will interfere with any long-running logging of data, where the firmware would normally keep an open file handle and simply periodically write to the file via that file handle.

Why? Anytime the file system transitions from unmounted → mounted, all existing file handles are implicitly invalidated and future attempts to use those file handles will fail.

Technical details

FatFS includes a counter in the FATFS->id field. Each time the file system is unmounted, the value in this field in increased:

	fs->id = ++Fsid;		/* Volume mount ID */

When opening a file handle (FIL file), the file handle itself stores both the FATFS* and FATFS->id values in the file handle as FIL->fs and FILE->id, respectively. When attempting to use that file handle for any purpose, it is first validated that FIL->fs->id == FILE->id. If not, that means the file system was unmounted (and maybe mounted) since that file was opened, and prevents the file operation from occuring.

The good news is that it is possible to detect when the firmware has an open file handle, prior to initiating a dismount of the file system. The bad news is that the current FatFS does not expose this via any externally accessible API.

Scenario: A long-running command (e.g., dumping triggered trace data to a file periodically) is initiated and is running. The COM port connection is dropped (intentionally or otherwise), while the firmware continues happily. the user reconnects their terminal connection.

Q: Would be expect the above disconnect/reconnect of the serial port to invalidate the file handles being used by the long-running trace?

For myself, if that scenario is broken, then I don’t think I’d like that … at all.

The only solution short of PTP/MTP that I can quickly come up with is to split the flash media into two volumes.

First volume is R/O to firmware, R/W to host

Simple hack exposed from FatFS: Whenever a write is sent from the host to the LBA range for the first volume, new function in ff.c would increment `FATFS->id` and the global static variable `Fsid`. This would implicitly invalidate any open handles that the firmware might have open.

Second volume is R/O to host, R/W to firmware

Whenever the firmware has an open file handle with write access, this volume should appear as ejected media to the host. When the last writable file handle is closed, the volume can be exposed to the host again. This will ensure the host will see the changes that the firmware makes, once all files are closed.

FatFS already tracks the open file handles and if they are read-only or have write access, but only in a global static variable FILESEM[] in ff.c. Thus, this would require another new function or two exposed from that file.

The code changes seem manageable. Would this be a reasonable solution for the BP5?