Linux bug report OS: Linux Ubuntu 24.04.1 LTS, Terminal: tio v2.8 BP 6 Version: latest git - Firmware main branch @ unknown (2024-09-23T23:28:51Z)
Bug1) - An easy one - When connecting to the BP, and being asked about the VT100 mode, typing “n” does not generate a new line: VT100 compatible color mode? (Y/n)> nHiZ>
Bug2) Plug in BP; Connect via tio. Type in “#”. BP restarts but does not mount
filesystem. The OS does not mount /media/$USER/BUS_PIRATE5
Fix: unplug and plug in BP
Bug3) When enabling VT100 color mode, if you exit the BP using the “#” of “$” command, it does not clear the display. The bottom of the terminal still shows the last measurements of the BP. It does not go away when text scrolls by. It is necessary to type the “clear” command to fix this.
Bug4) You cannot copy files to the /media/$USER/BUS_PIRATE5 file system while connected to the BP via tio. Oh boy do things get messed up. I can provide system logs if you like. You can copy files to and from when not connected, or you can access them while connected - One or the other but not both. Perhaps a fix is for the BP to unmount the filesystem when the user is connecting to it.
Nice. Do you want me to log these on GitHub for you? (this forum isn’t a great place to track issues)
Good bug catch.
There’s a few problems being reported with the storage at the moment. If you have time to git bisect to help track down when this started, that would be helpful.
If I understand correctly, you want # and ` to send the VT100 / ANSI codes to disable the non-scrolling region used by the statusbar. Nice catch.
There’s a few problems being reported with the storage at the moment. The way Ian designed this, the exposed storage volume switches to read-only (for the host) when the terminal is connected (by emulating a surprise-removal, then re-insertion of the media w/ write-protect status). Thus, this is currently an intentional design decision.
As for the last one, the proposed solution doesn’t work. I connected to the BP in one terminal, and in a second I copied a file onto the BP. There was no error, and the system log reported all sorts of errors.
@ian – See last issue; May be relevant to storage corruption issue?
Awesome, thanks!
As that directly contradicts the intended way this should work, this should be logged as a bug. This is a serious problem, and likely is very relevant to one of the causes of the storage corruption other folks have reported under Linux. I hit a dead-end remotely debugging it on that other system, and (due to real life) will not be able to look into things for at least a week.
Would you be able to git bisect to find the last commit in which you were prevented from writing to the media?
There’s serious problems with how the host is seeing the exposed storage volume, and it would be really helpful to know if a certain commit made the problem more noticable (or was the source of a bug) … anything to target the search and reduce the complexity of tracking this down would be helpful.
I am not sure I can do a “bisect” - because I think I have to have a good firmware version and a bad version. I first reported this problem in May. But I will submit a bug report.
For the record, I wrote the code that implements the policy that Ian and I agreed on. I feel that it is probably a bug dating from day one of this implementation.
I can repro this on Windows sometimes, but never under the debugger.
I can repro if I use a terminal that auto connects to the BP5 and reset the BP5 (unplug and plug). This points to race condition someplace.
I see a possible race condition in the file msc_disk.c:140 and msc_disk.c:192.
It turns out that both core0 calls line 192 and core1 calls line 140.
I am not sure this is the root cause of the problem but this race condition is not good. It needs to get fixed.
I take that back. I cannot really repro this on Windows. I found out that the instances where the file system was writable and the terminal said it was connected were an error of the terminal. It says it is connected but it isn’t. It does not respond and if I attach a debugger at this point, the code does not go in the path where the terminal is connected.
However I reproed something slightly different. The wrong file system is mounted after a reset. It is the RAM file system when there is no Nand flash available. You can tell because it has only README.TXT, has a small capacity. At this point the interesting thing is that the terminal shows the Nand file system.
The Linux syslog reported this when I connected to the BP
2024-09-25T09:32:31.056263-04:00 laptop kernel: sd 0:0:0:0: [sda] Write Protect is on
2024-09-25T09:32:31.056267-04:00 laptop kernel: sd 0:0:0:0: [sda] Mode Sense: 03 00 80 00
2024-09-25T09:32:31.057192-04:00 laptop kernel: sda: detected capacity change from 191296 to 16
2024-09-25T09:32:31.141142-04:00 laptop kernel: sd 0:0:0:0: [sda] 47824 2048-byte logical blocks: (97.9 MB/93.4 MiB)
2024-09-25T09:32:31.142140-04:00 laptop kernel: sda: detected capacity change from 16 to 191296
The capacity change seems strange to me. Also - I have never (last 40 years) ever seen a Unix file system change its read/write status to read-only. Normally one has to remount the file system - that is - the OS initiates the change, not the hardware. Having the hardware change underneath is very odd. I would normally assume when a write request is made, it would cause an error return - Similar to a disk being full.
Is this what is intended when changing the capacity to 16 bytes? But it doesn’t stay at 16 bytes.
First removable disks can be removed (surprise removal in Windows term). That’s what’s happening here because there is no way to switch from a RO drive to a RW drive without removing / inserting back on.
About the capacity. This is not necessarily intended but it works this way because to fully execute the hand-off of the ownership of the FS from the host to BP, we remove the RW drive, insert it as RO. But after that we need to synchronize the file system of the BP with the NAND storage. We unmount/remount the BP file system. Right after the unmount, a SCSI query comes to the BP from the host and it answers with the RAM toy storage because the BP file system is unmounted.
I believe it would be less confusing to remove the drive from the host, sync the changes with the BP file system and then insert the drive (RO) into the host.
@henrygab
I made 2 significant changes to the implementation of the storage ownership.
One is resolving the race condition. I believe it isn’t a big deal because it does not affect storage corruption. I believe that re ordering the flag latch_ejected with the main side effect solved it.
I also marked all of the variables in the file msc_disk.c that are accessed by both cores as volatile. It is more for the code readers/writers than for the compiler as it can’t falsely elide reads in the current version of the code
The second change concerns about seeing the RAM FS exposed temporarily to the USB host.
My repo with the proposed code changes are here I would be grateful if you could take a look.
The existing design cannot prevent corrupted data. I warned about this months ago. You and Ian determined it was “good enough”. You cannot have a FAT volume that is writable by one initiator (firmware or host), while simultaneously having it readable by the other initiator. There WILL be data corruption on one side or the other, the only question being how difficult it will be to reproduce. If you want full details, let’s start a new thread.
Until there is a decision to avoid that situation, I will not sign off on changes to the storage stack … my name will not be associated with support of the broken architecture.
The following are the ONLY states that should be allowed for the media:
Host
Firmware
Notes
None
None
Useful for intermediate states
RO
RO
Both firmware and host can only read
RW
None
Only host can read or write the media
None
RW
Only firmware can read or write the media
Transitions between those states would be similar to the current hacks that occur when the user opens or closes the terminal. A full state machine diagram (will all transitions) would be created, and would have a media state (from above list) associated with each state. Intermediate states would be necessary to reflect the need to report multiple SCSI SenseKey/ASC/ASCQ sequences for some of those transitions in the media state.
Should you and Ian decide that this transition in functionality is appropriate, let me know and I’ll help.
It wouldn’t be a problem if the disk was unmounted, and then remounted RO. But AFAICT, this doesn’t happen on a Linux system. The disk remains mounted. On my system, the OS plays a note when a system unmounts and another when it remounts. This does not happen when I connect to the BP.
The problem lies with two initiators (e.g., firmware and host) having a view of the file system, while allowing one of those initiators to modify the file system. The other (read-only) initiator may read invalid data.
The problems are even greater if both initiators are allowed to simultaneously modify the file system. Neither is aware of the changes that the other is making, and on FAT, the changes are across multiple clusters, so even a “magic” firmware cannot really synchronize.
Yes, this is one reason MTP was created…
Note: I will likely be offline for the next couple days… not ignoring stuff, just … real life.
Thank you all for the additional information and investigation. I will take a look at the proposed changes in your repo.
@grymoire I didn’t realize that bug was not resolved for you.
It seems like this may not be a valid way of doing things, even if it is seemingly working on on Windows.
Short of implementing a full stack MTP solution, maybe we just need to have a terminal mode and store mode selected by the button at start-up. It’s a pity, but could be the temporary solution that gets everyone going again.
I’ll add this to my to-do list and continue to the storage architecture thread.
What I am doing now is simply remounting the BP file system as read-only before I connect.
It’s a work-around for now. It’s not a crisis.
I still see some strange behaviors - the file system is marked “dirty” and it tells me to fix it using fsck. And sometimes I have to reboot my laptop to reconnect. But it takes time to document exactly what the problem is. I will try when I have some more free time.
As folks following the various threads likely know, the cause of many of the corruption issues was likely due to the firmware not properly emulating all the necessary error codes that occur when media is ejected / re-inserted. As a result, the host did not always detect that the media changed (and thus did not realize it changed from R/W to R/O, or was mixing the 8k fake-disk cached sectors with the real NAND’s sectors).
Kudos and accolades are due for @phdussud. I merged their PR 106. While it won’t prevent all corruption (storage is still multi-initiator), hopefully it will reduce many of the common cases.
Thanks @henrygab but don’t thank me until you merge PR 109 which resolves your issues but more importantly fixes a deadlock introduced in PR106.
Sorry about this folks!