Storage architecture discussion

henrygab · September 25, 2024, 11:43pm

Continuing discussion from related thread

Questions, corrections, clarifications, and discussion are welcome.

This first post will contain background information on the technology in use. The second will describe the current design. The third post will explain some problems with the current design.

Summary of NAND stack

The BP5 Rev10 (and BP5XL, BP6) have a NAND chip soldered onto the board.

This NAND chip does not have any wear-leveling algorithms, nor any remapping of pages that have errors. Therefore, a layer is added above the raw NAND to provide some of these features, named Dhara. Presuming Dhara works, the result is the NAND appears slightly smaller in capacity, and sometimes a single operation will transparently cause multiple writes to the NAND.

Summary of FAT File System

On top of the Dhara layer is the FAT file system. The FAT file system has a few major structures:

BPB aka Boot Sector. The BPB identifes the format as FAT12, FAT16, FAT32, etc. It defines how large each FAT is, and for FAT32, defines the sector for the root directory.
FAT, aka file allocation table. Each entry in the FAT is a link to the next “cluster” of a file. Some values indicate that there are no more clusters associate with a given file.
- Thus, if a file starts at cluster “X”, by following the clusters linked to in the FAT, all the sectors for that file are found. Put yet another way, it’s a singly-linked list of clusters for a file.
Root directory. The root directory (true for all FAT directories) is nothing more than a file, whose contents are interpreted to contain file-system-defined structures. A directory has a cluster chain in the FAT, just like any other file.
- For FAT12 and FAT16, the root directory is at a fixed location and size. It’s a special case… and why FAT12/FAT16 have a limit on number of files/directories in the root.
- For FAT32, the root directory can be moved to a different starting cluster, and has a cluster chain like any other directory.

Thus, to allocate space for a file, or to extend a file past its currently allocated list of clusters, the file system implementation would have to:

Find a free cluster. This may require traversing the file system, since (until exFAT) there was no on-media tracking of free clusters.
Modify the FAT so the currently-final entry in the cluster chain points to that free cluster.
Modify the directory entry for the file to indicate the larger size.

As you can imagine, this is not efficient if reading and writing to the media for each step. Thus, most hosts / complex implementations will cache information from the media in RAM. The FAT is a likely candidate for this cache, as it needs to be modified for every file. The data for each directory are also likely to be cached. Often, a host will also convert the directory entries and FAT into an in-memory bitmap of which sectors on the media are used, to speed up the process of finding free sectors.

Download fatgen103.pdf from the bundle available at: https://www.microsoft.com/en-us/download/details.aspx?id=53426&msockid=24c9bdff5e2f6c792400ae0a5f956dd1

Storage device behavioral contract (as inferred by host)

If a media is mounted (even if read-only), only the host will initiate changes to the media contents.
If a sector of the media has content A, then unless the host sends a command that modifies the media, a future read of that sector while the media remains mounted will continue to have content A. (contents of mounted media won’t change between reads … allowing caching of the values.)
Media that is ejected and then reinserted may have had its values changes.

henrygab · September 25, 2024, 11:51pm

Summary of Current Design

The BP5 exposes the storage volume of the NAND only when it finds a valid FAT file system on the NAND.

When the BP5 firmware does not discover a valid FAT file system, a fake read-only 8k (16-sector) FAT volume is exposed with a hard-coded single text file.

By default, the BP5 firmware exposes the storage volume as R/W (readable and writable) to both the firmware and the host.

When a connection is made via the terminal, the media is surprise-removed, marked as read-only (for the host only… firmware can still write to the media), and then re-exposed to the host.

The host will receive at least one error indicating that the media had been ejected as a result of this transition from { Host: R/W, firmware: R/W} → { Host: R/O, firmware: R/W }.

When the terminal connection is closed, the host similarly receive a second error indicating that the media has been ejected, as the media transitions from { Host: R/O, firmware: R/W } to { Host: R/W, firmware: R/W }.

Thus, for a large percentage of the time, the host sees the media as being rewritable simultaneously with the media being mounted by the firmware (and sometimes, even seen as writable by the firmware at the same time).

henrygab · September 26, 2024, 12:05am

Unfortunately, having a FAT volume that is writable by one initiator (e.g., the firmware or the host), while simultaneously allowing access by another initiator is inherently succeptible to data corruption. This post will attempt to explain some of the conditions that this could occur under.

At present, the following are the ONLY states that should be allowed for a FAT volume with two potential initiators:

Host	Firmware	Notes
`None`	`None`	Useful for intermediate states
`R/O`	`R/O`	Both firmware and host can only read
`R/W`	`None`	Only host can read or write the media
`None`	`R/W`	Only firmware can read or write the media

… TODO: Fill in some of the details showing multi-initiator corruption when either initiator has write access.

rough draft scenarios

… generally, cached data updated by one initiator might not be immediately visible by the other initiator …

Example:

Host wants to extend file \alpha\foo.bin, which currently uses only cluster F
Firmware wants to extend file \beta\bar.bin, which currently uses only cluster B
Host reads FAT / directory structures, and finds that sector X is free, and will use that sector to extend \alpha\foo.bin
Firmware reads FAT / directory structures, and finds that sector N is free, and will use that sector to extend \beta\bar.bin
Firmware starts to extend file \beta\bar.bin by updating the FAT entry for cluster B to link to cluster X, and ensuring the entry for cluster X indicates end-of-chain. Thus, file has chain of B -> X -> EOF. This is written to the FAT.
Host starts to extend file alpha\foo.bin by updating the FAT entry for cluster F to link to cluster X, and ensuring the entry for cluster X indicates end-of-chain. Thus, file has chain of F -> X -> EOF. This is written to the FAT … the FAT chain is now cross-linked (corrupt).
Host and Firmware update the directory entries for their respective files to indicate the larger file size.
Either / both of them write data to the second cluster of the corresponding file. The data is stored at sector X in both cases, with one overwriting the other. More corruption.

similar situation possible when host caches the media’s information, and then firmware does ANY update to the file system structures. the host doesn’t see those updates. the host could read invalid data (using the cached entries), write to the wrong sectors, etc.

grymoire · September 26, 2024, 1:50pm

Let me summarize the problems I have seen on Linux.

When I connect to the BP interface with tio, and simultaneously copying a file onto the BP, I should get an error on the command line because when I am connected to the BP, the file system should be read only. No error is reported but the system logs report problems with the file system. At times, I have tried this and Linux goes into a state where it will no longer mount the file system, even if I unplug and replug the BP. I have to reboot Linux to mount the BP file system again.

I created a syslog of the events with more details

https://github.com/user-attachments/files/17132085/syslog.txt

grymoire · September 26, 2024, 2:00pm

I tried to find references to Microsoft’s “Surprise removal sequence” WRT Linux systems, and they seem to refer to hot-pluggable NVMe drive, which is mentioned on servers with virtual machines. So far I have not seen any discussion on how to handle the case when the attached device’s file system decides to switch to Read Only.
Looking for similar low-level errors like “Mode Sense: 03 00 80 00” mention file system corruption.

The Linux OS eventually reports errors like

CDB: Write(10) 2a 00 00 00 00 88 00 00 01 00
2024-09-25T09:32:57.238777-04:00 laptop kernel: critical target error, dev sda, sector 544 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0

When I search for this error message - it seems to indicate a hardware error.

henrygab · September 26, 2024, 4:52pm

It sounds like Linux is not noticing that the media has changed. It may be that the BP is not fully emulating a media change sequence, and thus some OS are “missing” the transition when it occurs.

I’m reaching back into my history for this, so errors are likely:

Status / SK / ASC / ASCQ

Historically, Windows OS was based on decoding SCSI status codes `Status / SK / ASC / ASCQ` (Status, Sense Key, Additional Sense Code, Additional Sense Code Qualifier). ATA devices would have their responses translated to the SCSI status, to work with the existing storage stack. USB devices wrapped SCSI commands, and similarly use the same error codes.

Status was typically either 0x00 (success) or 0x02 (check condition). Historically, the SK/ASC/ASCQ was only requested when status was non-zero. Nowadays, it’s often automatically provided at the same time.

Sense Key

SK meanings generally:

0x00 means no error
0x01 means a “soft” error, such as data that was automatically corrected by ECC
0x02 covers the “not ready” errors, such as no media present, device spinning up the media, etc.
0x03 covers medium errors, such as damaged sectors, failed writes, etc.
0x04 covers hardware errors. Rare to see, may be treated similar to 0x03
0x05 covered illegal requests … the command sent to the device had wrong bits set, an LBA out of range, or it couldn’t be processed due to current state (e.g., eject when device locked)
0x06 was UNIT ATTENTION … when the device went from not-ready (e.g., no media) to read (e.g., media ready for access), or when mode parameters changes, the device was required to report this so the host could interrogate the device to see what’s changes / flush caches, etc.

So, what would I expect a device to do, when transitioning from no media present, to having media?

SK	ASC	ACQ	Notes
`02`	`3A`	`00`	Medium not present
`06`	`29`	xx	Device was reset
`06`	`28`	`00`	Not ready to ready (medium changed)
`00`	`00`	`00`	No error

??? Could one or more of the above be required on Linux, but be missing from the BP responses ???

phdussud · September 26, 2024, 9:45pm

Thank you Henry! Super useful. The default implementation of the USB drive comes from the tinyusb library. It only has “No Error” and “Medium not present” as sense responses. I will add the 2 remaining states to our implementation and see if Linux behaves reasonably then.
PS If you find the document that describe this I would be happy to keep it as a reference.

phdussud · September 26, 2024, 9:55pm

Based on your syslog, it is obvious that after you connected to your BP with tio, Linux has noticed that the BP is read-only.
“Trying to write to read-only block-device sda1”
I’m not sure why the command line does not give you an error.

grymoire · September 26, 2024, 10:44pm

Linux didn’t know the file system was mounted read only. When I do “mount -l” to list the current mounts, it’s clearly RW.

/dev/sda1 on /media/grymoire/BUS_PIRATE5 type vfat (rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022
,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,showexec,utf8,flush,errors=remount-ro,
uhelper=udisks2) [BUS_PIRATE5]

Note that the mount option says to remount the file system read only in the case of an error. But from what I understand, this only applies if an error occurs during the mount process.

I suspect at the file system level - it’s trying to write a file, yet at the driver level - it fails because the device no longer accepts write commands, and the system reports errors like:

critical target error, dev sda, sector 544 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0

That is - I suspect the OS wants to write a file and the hardware is unable to do so.

phdussud · September 26, 2024, 11:00pm

I see. That makes sense. Thanks for the explanation.

thestumbler · September 27, 2024, 2:56am

This discussion brings back a fuzzy memory from like 2010 or so, when I was using a USB mounted, simulated flash drive to download firmware updates to a set top box powered by an LPC1768 (Arm Cortex M3). I was using Windows and we expected 99% of consumers would be as well. But just for fun, one day I tried it using a Linux system and it failed.

I dug into it briefly, and there was a subtle mismatch between how Windows wrote to the USB mounted drive and how Linux did so. I think I even figured out how to fix it, but the client told me to pass on that task (we were backlogged with more serious problems).

As I recall, it had something to do with the sequence of writing the data? It’s real fuzzy 10+ years on. If this sounds potentially useful to investigate the issues here, let me know and I’ll dig up my notes and post a more detailed explanation.

grymoire · September 27, 2024, 4:44pm

I was just looking at a BusPirate where I did a “format” on it, but when I re-connected it, the file system did not mount. I had to reboot my Linux system to allow mounting.

The syslog said that the file system was dirty (unmounted properly) and it wouldn’t re-mount it.

I think I read somewhere that the boot block (where the file system status is stored) should be writable even if the file system is read-only. Is it possible Linux is trying to mark the drive and fails?

grymoire · September 27, 2024, 4:55pm

I found this Linux kernel documentation on the dirty bit on FAT file systems.

github.com/torvalds/linux

fat: mark fs as dirty on mount and clean on umount

committed 03:10AM - 28 Feb 13 UTC

olerem

+70 -0

There is no documented methods to mark FAT as dirty. Unofficially MS started to… use reserved Byte in boot sector for this purpose, at least since Win 2000. With Win 7 user is warned if fs is dirty and asked to clean it. Different versions of Win, handle it in different ways, but always have same meaning: - Win 2000 and XP, set it on write operations and remove it after operation was finnished - Win 7, set dirty flag on first write and remove it on umount. We will do it as follows: - set dirty flag on mount. If fs was initially dirty, warn user, remember it and do not do any changes to boot sector. - clean it on umount. If fs was initially dirty, leave it dirty. - do not do any thing if fs mounted read-only. - TODO: leave fs dirty if we found some error after mount. Signed-off-by: Oleksij Rempel <bug-track@fisher-privat.net> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

henrygab · September 28, 2024, 1:21am

At the early posts, I posited that only a few states were fully safe.
Technically, a couple more exist (single-initiator, read-only).
Here’s a full table of the states, and quick notes about why they are safe or not safe.

PLEASE ASK FOR DETAILS IF YOU BELIEVE A STATE MARKED UNSAFE IS ACTUALLY SAFE.

I will then provide more specific timelines of which initiator does what actions. This, however, takes significant work to write (and triple-check), so I would prefer to do it only if necessary.

Host	Firmware	Safe?	Problems
`None`	`None`	Safe	Useful for intermediate states
`R/O`	`None`	Safe
`None`	`R/O`	Safe
`R/O`	`R/O`	Safe	Both firmware and host can read, cache the data, but neither one changes the data
`R/W`	`None`	Safe	Single-initiator
`None`	`R/W`	Safe	single-initiator
`R/W`	`R/W`	Not safe	e.g., host reads/caches the FAT; firmware allocates space; host later allocates that same space for a different file; writes to one file now forever corrupt the other file
`R/W`	`R/O`	Not safe	e.g., host caches some writes for performance, and/or has updated the FAT without updating the directory entry (or vice versa … the changes are not transactional). Result is that firmware reads and/or writes corrupt data.
`R/O`	`R/W`	Not safe	e.g., host caches most data; firmware later makes changes; host does not see changes made by firmware; host reads and/or writes corrupted data

grymoire · September 28, 2024, 1:38pm

I’m not clear exactly what happens. I suspect this is part of the sequence:

Bus Pirate plugged in
Linux mounts /media/$USER/BUS_PIRATE5. Marks file system as dirty by writing to boot block
User connects to BP using serial link
Bus Pirate changes state. (I’m not sure what it does exactly)
User tried to copy file onto /media/$USER/BUS_PIRATE5. This does not complete.
Bus Pirate is unplugged.
Linux unmounts file system.
Bus Pirate is plugged in or reboots via serial interface.
Linux looks at boot block - perhaps seems the dirty bit. Refuses to mount file system.

However, if it’s simply the dirty bit, then step 5 isn’t necessary to cause Linux to refuse to mount the file system. So ???

Perhaps a simpler way to cause the problem is to connect to the BP and type # or $. I’m looking into this. I also see that the udisks daemon is involved. I am monitoring it’s status as I try to dig into the problem. I’ve never worked with udisksd before so I am learning new things here.

grymoire · September 28, 2024, 2:39pm

I’m getting deep in the weeds here. I’ve been trying to help diagnose the problem. I’ve had the case that if I connect to the BP, and type ‘#’, I have to reboot Linux to reconnect.

I don’t know if anyone else has this problem. I have a simple work-around I’m going to submit - a Linux script to connect to the BP. It simply remounts the file system read-only before connecting. I’ll share it in a separate posting.

henrygab · September 28, 2024, 4:00pm

This thread isn’t about the media change detection. That’s a separate problem (see PR #106 by @phdussud). This thread is about whether the current decision to allow two initiators to the same FAT-formatted media (with one or two of them allowed to write to that media) should be revisited.

It’s my strong opinion that, even if the media change detection is fixed, there will be extremely difficult to debug / track corrupt data caused by the current choice.

To refresh, the current choice was:

Host: R/W, Firmware: R/W – When the terminal is not open
Host: R/O, Firmware: R/W – When the terminal is open

Both the above choices are unsafe. The first is less safe, the second still risks the host at least reading corrupt data.

phdussud · September 30, 2024, 5:11pm

The current choice is based on the fact that no firmware Write-IO happens while the terminal is not connected. At least that’s what I believe happens. Am I wrong?

henrygab · October 1, 2024, 2:51am

You may be right. At the same time, there is nothing which prevents the firmware from writing. If such a restriction is being relied upon to keep data coherent, then it should be enforced. Otherwise, folks will break that restriction, have no idea they are doing so, and … DangerousPrototypes will have a new connotation. (If this is the chosen path, then the media should just flip-flop between the host or the firmware having full, exclusive access at any one time.)

For example, while I have not tested it, I believe it’s possible to configure the button to run a script, and that no terminal connection is required. I can easily foresee this being a common setup (just press the button … and BAM! … a log, trace, or dump gets written to the NAND).

In fact, I intend to setup my BP6 in this manner once things stabilize, arming a logic analyzer (and trigger) with a button press, and having the results saved to file.

henrygab · October 1, 2024, 8:33am

For the medium-term, I advocate for BP5 to practice “safe storage”. This means avoiding the above unsafe multi-initiator states, and (at least at first) requiring some manual interaction.

As a strawman, a first step could be:

Define four allowed modes for the storage volume:
a. Exposed only to the firmware (R/W)
b. Exposed only to the host (R/W)
c. Exposed to both firmware and host (R/O for both)
d. Exposed to neither firmware nor host (No Access or Unformatted)
Default to one of those modes at boot, likely: firmware (R/W)
Support the PREVENT_ALLOW_MEDIUM_REMOVAL command, allowing the media to be locked by the host (preventing firmware from switching modes when it may be written by host).
Add a terminal command to switch the current mode of the storage volume.

The longer term solution is to implement MTP, and remove the mode where the host has R/W access to the storage volume. Then, any writes by the firmware would need to go through the media removal sequence to have the host invalidate its cached view of the FAT FS (and some MTP state). MTP is non-trivial, but was designed to resolve this exact type of issue.