Storage architecture discussion

phdussud · October 2, 2024, 7:00pm

@henrygab I would like to come back to the discussion you started on this thread but not finished, explaining scenarios why this situation:
firmware RW and host RO, can lead to corrupted state on the host.
and the situation:
Firmware RO and host RW can lead to corrupted state on the firmware.
I am super interested in the answer.
Here is what I think I know. Please correct me if I am wrong:
Dhara is the common access point to the storage, the NAND itself is abstracted underneath the Dhara API.
Dhara itself has a cache with delayed write so durability may be compromised if Dhara cache isn’t flushed before a power down event. Our hardware do not have the capability of mitigating this threat.
Dhara is now protected by a mutex that insures that concurrent reads and writes are serialized to protect from torn reads and writes.
I am asserting that Dhara has the following properties.

Repeatable reads of the same sector if no intervening writes.
A sector read after a sector write will always return what was written.
On the firmware side, next next layer up is Fatfs.
FatFs as configured has only one cache of 1 sector. This cache, when dirty, is written back to Dhara in the following situations:
Before it needs to cache a different sector.
At the end of file close() when the file was opened in write access mode
At the end of a directory creation, deletion
At the end of SetVolumeLabel
At the end of a file rename (not supported in the firmware)

It should be noted that Dhara cache is flushed when the Fatfs cache is flushed, except in the first situation.
In all but the first situations the Firmware attempts to “reset” the view of the host by ejecting and re-inserting the media. This is done with a sequence of 3 SCSI media sense codes.

I have tried to think of a scenario that can lead to a corruption but I can’t.

henrygab · October 3, 2024, 2:51am

Ok. Let me see if I can try. This would be easier with whiteboard and live interaction.

If notes say \foo → 20…23, 40, this means that the file foo in the root directory has a first allocated cluster of 20, and the full cluster chain is 20->21->22->23->40->End-of-chain.

Example with firmware RW, host RO

Host presumes that sector X, once read, will continue to give the same data for future reads, unless/until the host itself causes the data to change. Read cache is enabled even for hot-plug devices.

time	action	fw view	host view	Notes
0	mount	consistent	consistent	Two files, `\foo`, with one allocated cluster @ cluster 20, `\log` with various allocated clusters
100	host read			Host reads and caches FAT and root directory contents
200	fw deletes file `\foo`	consistent	corrupt	Host is not notified of the deleted file
300	fw extends file `\log`	consistent	corrupt	FW allocates clusters 20…25
350	fw creates file `\foo`	consistent	corrupt	FW allocates cluster 90
400	host reads file `\foo`	consistent	corrupt	Host uses cached copy of FAT and root directory, discoveres file uses cluster 20, and reads cluster 20 (data from the middle of the log file)
500	fw reads file `\foo`	consistent	corrupt	FW determines file is at cluster 90 and reads correct data

The read-cache on the host (which is perfectly legal and valid to have) results in a mismatch between what the host sees vs. what the firmware sees.

Example with firmware RO, host RW

time	action	fw view	host view	Notes
0	mount	consistent	consistent	`\foo` → 20…25 and `\log` → 40…44
100	fw opens `\foo`	consistent	consistent
200	fw reads up to cluster 24	consistent	consistent
300	host truncates `\foo`	corrupt	consistent	`\foo` → 20…23
400	host adds sectors to `\log`	corrupt	consistent	`\log` → 40…44, 24…27
450	optionally, host expands `\foo`	corrupt	consistent	`\foo` → 20…23, 28…31
500	rw reads more data from `\foo`	corrupt	consistent	FW now gets data from middle of `\log`, not from `\foo`

The firmware has an open handle to a file. There is no way to indicate this to the host. The host properly does what it wants with the media. We do NOT want the firmware to have to deal with file handles that suddenly become invalid after they are opened (complexity like that would geometrically explode test cases). When firmware looks at the FAT to discover the next cluster, it looks at the FAT for cluster 24 (where it’s read to). There is a valid entry there, pointing to cluster 25, so that the next cluster of the file … except that the host has changed which file was using cluster 24, and thus the next cluster in the FAT is also for some other file.

This same problem would occur with a full SATA storage that manages its own bad block remapping … in other words, dhara does not appear to be relevant to this discussion, which is viewing the logically exposed space (above dhara).

phdussud · October 3, 2024, 4:39am

@henrygab. Thank you for taking the time to think of these scenarios
About the Firmware RW, host RO:
200: As soon as the file is deleted, the Host is notified of the removal of the medium followed by a medium change and insertion of the new medium. This makes the host aware of the deletion.
300: if log is closed, the same thing as step 2 happens.
350: same as 300
400: since the host view has been reset in 350, it should be in sync with the FW view.

phdussud · October 3, 2024, 4:40am

It is too late for me to look at the second scenario I will do so tomorrow, Pacific time.

ian · October 3, 2024, 10:47am

This is a great discussion, thank you everyone for contributing.

It has gone beyond my ability to keep up at the moment, but if there is anything I can do please let me know.

@phdussud’s two pull requests related to storage issues are in the latest auto build firmware. Can anyone confirm if this has triaged any of the issues under Linux?

phdussud · October 3, 2024, 6:43pm

About the firmware RO, host RW
Right, good point. I understand that concurrent access to files will lead to trouble.
Either too little will be read by the firmware when a file is in the process of growing (by the host writing to it), or garbage if the firmware starts reading a file before this file is truncated by the host while the firmware is still reading it.
Both of these situations are undesirable.
Note that at the moment, I don’t think the firmware reads anything while the host can write but you appropriately noted earlier that it isn’t enforced and could change with the next contributor checkin.

henrygab · October 3, 2024, 9:29pm

Example #2 with firmware RW, host RO

time	action	fw view	host view	Notes
0	mount	consistent	consistent	`\q.txt` → 20…29
100	host read			Host reads and caches FAT and root directory contents
200	host open `\q.txt`			logically only, no data read yet
300	fw opens `\q.txt`			logically only, no data read yet
400	host reads @ 20, 21
500	fw writes @ 21…28			(will soon continue to write…)
700	host reads @ 22		corrupt	sector 21 = old , 22 = new (never saw new sector 21 data)
800	…			…time passes…
900	fw closes `\q.txt`			… too late …

Example #3 with firmware RW, host RO

time	action	fw view	host view	Notes
0	mount	consistent	consistent	`\q.txt` → 20…29
100	host read			Host reads and caches FAT and root directory contents
200	host open `\q.txt`			logically only, no data read yet
300	fw opens `\q.txt`			logically only, no data read yet
400	host reads @ 20
500	host reads @ 21
600	fw writes @ 21…28			(will soon continue to write…)
700	host reads @ 21		corrupt	undefined behavior

Because the device reported different data for the read of cluster 21 at step 500 vs. step 700, the device is violating the basic contract of a read-only media: It must return identical contents for reads of same sector.

The immutability of data from other sources is a basic premise for
local mounted storage. You break that behavioral contract, and you get undefined behavior / hard to track bugs in the host.

How do I know about the host's UB?

There were lots of really odd bugchecks that only occured on machines with certain model hard drives. In other words, LOTS of undefined behavior. I helped build a filter driver to detect when drives reported success for a write, and then later reported different data for that sector. We were then able to bring this data to the drive manufacturers, helping them fix some edge case bugs in their firmware. Consumer drives became measurably more reliable as a result.

Even where dhara has no race conditions, the underlying cause is still having multiple initiators, where one or more can modify the underlying volume.

phdussud · October 3, 2024, 11:45pm

I agree Dhara isn’t interesting for this analysis. The root cause is concurrent access to the same file by 2 initiators with at least one writing to the file.

henrygab · October 4, 2024, 1:51am

Concurrent access to a file is not the only scenario (e.g., writing additional data that extends file size is not atomic: updating FAT, writing new data, updating directory entry to reflect new size, etc.). I give simple examples, but this does not mean those are the only examples.

FYI, we're not the first to try...

I used to “own” part of the storage stack at Microsoft, including cdrom.sys and disk.sys. I’ve seen tons of attempts to “solve” this over the years … including devices “sniffing” all writes, and attempting to “interpret” what a given write to a FAT volume was intended to do. None of those attempted solutions survived.

MTP was created for precisely this reason. It allows the firmware to “own” the storage device, while allowing the host also be able to both read and write files on the device. MTP is still complex … yet it took off like wildfire in consumer electronics. That by itself should be telling.

All my examples have stayed on the simplistic. I did not delve into the non-atomic updates that FAT requires.

There is one solution that (theoretically) could work to enable FW R/W, Host R/O. However, there are no open source implementations, and it’s rare to see such volumes in the “wild” … the use of Transactional FAT (or Transactional exFAT). There’s enough information in the public patent documents to figure out how it works, but … it’d still be a major undertaking to test all the edge cases … and I’m not taking on something that large currently.

One special case exists ...

There is one implementation out there, and it’s a special case.

Adafruit’s CircuitPython exposes the storage volume (which is also R/W for the Python scripts running) to the host as R/W. However, as soon as a write occurs, they stop execution, and after a couple seconds of not receiving additional writes, they reboot the device.

This mechanism was considered for BP5, but rejected as reboots when the host writes was seen as problematic (and rightly so for the BP5’s purposes).

Where there are two writing initiators, there are thousands of interactions which are unsafe. I think we agree on this?

grymoire · October 4, 2024, 12:06pm

I’m going to ask the dumb guy question. Please forgive me Henry. But I’m struggling comprehending the issue.

Yes, if both the host and BP were RW - that would be perfect.

If the host is RO and the BP is RW, and I read a file on the host while writing the same file on the BP, I would expect that reading the file would be unstable (for lack of a better word). I’ve used distributed file systems, and this is acceptable. I can live with this caveat until a perfect solution is implemented (using MTP?)

Normally the same person is doing both operations, so we are aware the file is unstable. Yes, if I am doing a “tail -f logfile” this will be screwy, and that’s a problem. But what files will the BP create?

SPI flashdump - I’d wait until the end before I’d assume it’s a good dump.
Config file - No big problem. I can only type in one terminal at a time.
A sniff/trace/log file - This seems to be the only possible issue.

What other real-world scenarios do we have to deal with?

phdussud · October 4, 2024, 3:02pm

@ian @grymoire
I think you are asking the right question. What are the scenarios that users want to enable? What would be the best user model for accomplishing then?
I think it is time to think from top down (user->system) because we have a good understanding of the low level situation and possible solutions we don’t a proposal for a decent user interface.

phdussud · October 4, 2024, 3:25pm

@henrygab
I agree that it is a minefield. I don’t know how to count the number of mines
Incidently I found a way to make 1 initator work in a fairly similar way the current FW RW, host RO work.
Let’s call it Firmwate RO/RW, host RO/Off
Instead of letting the FW file write operations happen while the medium is still host mounted, then eject/insert the medium, we can eject the medium at the beginning of the file function that performs writes and insert the medium back at the end. It is truly a one initiator scheme with transparent transition from RO->RW on the firmware side and RO->ejected on the host side. For short file operations, It would be the same experience as now.
I don’t have a clue about how to make the other scenario (host in control) palatable to the user. I don’t think I have a good idea about the use cases that need to be supported.

henrygab · October 4, 2024, 5:46pm

I think we are starting to align the vision. This is great!

Firmware RO, Host RO; and Firmware RW, Host None

click to expand

Yes. That uses two of the listed safe states:

Firmware RO, Host RO
Firmware RW, Host None

Yes, ejection of the host media when the BP5 firmware…

Opens a file handle with write permission; -OR-
At first write operation

The ejection of the media must persist until…
3. the last handle with write permission is closed; -OR-
4. the last file handle is closed

Options (1,3) are only usable when the permissions associated with the open file are known at open / tracked for lifetime of handle.

The options can be intermixed. A solution with option (2 & 3) would be the least negative. At a minimum, a solution with option (2 & 4) seems possible with the current firmware.

Transition to Firmware None, Host RW

click to expand

This is also one of the listed safe states.

This state is where the host has full control of the volume, but firmware cannot access the volume. This would be an explicit mode switch (e.g., command via terminal). Ian has suggested this could also be based on button during power-up (in case terminal not easily accessible, I think).

The command to transition to host-only access should:

FAIL if any open handles by the firmware, and print corresponding message as to why it was rejected

A manual command to transition back to firmware-only mode could also exist (or simply require reboot of the BP5 device).

Can you confirm agreement with the command-based (or power-on w/button pressed) transition to Host-only mode?

phdussud · October 4, 2024, 6:10pm

I think 1 or 2, 3 or 4 are equally feasible. f_open clearly identifies the access mode and keeps it in the flags property of the file object.

phdussud · October 4, 2024, 6:14pm

I agree that (or power-on w/button pressed) transition to Host-only mode will work. I am not a good judge of UI design. I think users should chime in.

henrygab · October 5, 2024, 12:07am

Awesome! You did a great job with the Linux updates (tracking down and implementing the various sense codes needed).

Do I understand correctly, that there are only three places where a hook would be needed?

OpenHandle() equivalent – if going with option #1
When actually writing to the NAND – if going with option #2
CloseHandle() equivalent – for either option #3 or #4

Are you up for doing the first part (FW R/O, Host R/O and FW R/W, Host None), and removing the reliance on whether the terminal is connected or not?

(the switch to host-only can be done in a later step)

phdussud · October 5, 2024, 12:41am

I would be glad to do so (I have started on ff.c probably 90% done) but I won’t be near a computer for the next 4 weeks or so.
There are several places in ff.c mkdir, setlabel, f_open, f_close, f_unlink, f_rename.
If you wish, I can commit to my personal github repo, what I have done so far for reference.

phdussud · October 5, 2024, 12:46am

I pushed the changes to my personal repo.
phdussud/BusPirate5-firmware: Bus Pirate v5 Firmware (github.com)

grymoire · October 5, 2024, 2:04pm

If I was using the BP and wanted to write a file onto it from the host, currently I would type “#”. The BP resets, my serial connection disconnects and the file system remounts RW on the host. I copy the file over and reconnect. I’m personally fine with this behavior.

Perhaps the only downside is when I have a specific setup on the BP that would be difficult to recreate. If that was the case, I’d create a file on the host that contains the commands I need to recreate the current state. Normally it’s just a few commands for me. But I could see that it might be complex in some cases. Others should speak up.

Crazy idea time - perhaps the BP could either have the shell equivalent of a history mechanism that logs the commands to a file. Or perhaps there was a command to “save state” when exiting, and “restore state” when restarting.

Finally there could be a way to - on the host - capture the characters on the BP terminal and store them in a file. And then save them and copy them over and store them in a start-up file. I could write a shell script that did that - I think.

mbrugman · October 5, 2024, 3:32pm

This version works well for me.

Plug BP USB cable to Linux host
- The storage is mounted read/write. I could manipulate the filesystem as expected (read, write, append, delete files, make and delete directories)
Open serial IO of BP in minicom
- storage is auto remounted read-only on host
- filesystem changes previously made from host are reflected in the serial terminal
- one caveat – if I open make the serial connection too soon after manipulating the storage from the host, those changes may be lost (the cached writes didn’t finish before the BP yanked the storage from the host). This isn’t a showstopper for me, just need a little patience :

[280971.108455] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[280971.108468] I/O error, dev sdb, sector 260 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[280971.108474] Buffer I/O error on dev sdb1, logical block 2, lost async page write
[280971.108497] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[280971.108500] I/O error, dev sdb, sector 480 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[280971.108503] Buffer I/O error on dev sdb1, logical block 57, lost async page write
[280971.108528] sd 1:0:0:0: [sdb] tag#0 access beyond end of device
[280971.108530] I/O error, dev sdb, sector 16928 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 0
[280971.108533] Buffer I/O error on dev sdb1, logical block 4169, lost async page write
[280971.108536] Buffer I/O error on dev sdb1, logical block 4170, lost async page write
[280971.108538] Buffer I/O error on dev sdb1, logical block 4171, lost async page write
[280971.108541] Buffer I/O error on dev sdb1, logical block 4172, lost async page write

close serial I/O terminal in minicom
- storage is still read-only on the host. Even if I manually unmount/remount from the cmdline it is still read/write. Seems like the only way to get it back to read/write on the host is to reset the BP. Again, not a showstopper

All in all, I would be happy with this functionality. It is predictable and logical. I don’t write to the storage from the host very often; just updated binary to write to flash is the only thing right now.

Personally, I don’t run into this very often. If I do have something like that, I’ve probably already documented it and can just copy/paste the commands back in. Usually a time or two of that and I begin to remember it anyway.

Thanks again for all the work on this; I know there are no simple solutions for this.
@henrygab @grymoire