2Gbit / 4Gbit SPI NAND flash (upgrade) chips

ian · July 2, 2024, 9:44am

Oh, and here is a compile if anyone has a 2Gbit board but isn’t setup to build the firmware.

If you’re running under debug, set a break point in spi_nand.c line 446 just after the chip is IDed. Then look at the value of rx_data, it should be 0x2c 0x24.

ian · July 2, 2024, 11:38am

A blast of hot air got the chip properly seated. It’s returning the right codes now, but fatFS is not able to format it.

// public function definitions
DSTATUS diskio_initialize(BYTE drv)
{
    if (drv) return STA_NOINIT;			/* Supports only drive 0 */

    // init flash management stack
    int ret = spi_nand_init();
    if (SPI_NAND_RET_OK != ret) {
        printf("spi_nand_init failed, status: %d.", ret);
        return STA_NOINIT;
    }
    // init flash translation layer
    dhara_map_init(&map, &nand, page_buffer, 4);
    dhara_error_t err = DHARA_E_NONE;
    ret = dhara_map_resume(&map, &err);
    printf("dhara resume return: %d, error: %d", ret, err);
    // map_resume will return a bad status in the case of an empty map, however this just
    // means that the file system is empty

    // TODO: Flag statuses from dhara that do not indicate an empty map
    initialized = true;
    return 0;
}

In nand_ftl_diskio.c:

spi_nand_init() returns success! Yeah!
dhara_map_resume(&map, &err) returns [DHARA_E_TOO_BAD] = “Too many bad blocks”,

The comment notes to expect error on blank chip, though it doesn’t specify the error.

{
	static const char *const messages[DHARA_E_MAX] = {
		[DHARA_E_NONE] = "No error",
		[DHARA_E_BAD_BLOCK] = "Bad page/eraseblock",
		[DHARA_E_ECC] = "ECC failure",
		[DHARA_E_TOO_BAD] = "Too many bad blocks",
		[DHARA_E_RECOVER] = "Journal recovery is required",
		[DHARA_E_JOURNAL_FULL] = "Journal is full",
		[DHARA_E_NOT_FOUND] = "No such sector",
		[DHARA_E_MAP_FULL] = "Sector map is full",
		[DHARA_E_CORRUPT_MAP] = "Sector map is corrupted"
	};

List of possible error codes doesn’t really solve that mystery.

Formatting...
dhara resume return: -1, error: 3dhara resume return: -1, error: 3Error: filesystem not foundError: Format failed...
Error: filesystem not found

Enabled the printf error status in read and write disk, seems to give the same error.

bus_pirate5_rev10-2gbit-debug.zip (179.7 KB)

This firmware has the debug statements enabled. It is totally possible I have all bad block, I did a really nasty job with the hot air. Would be great to see if anyone with one of the professionally replaced 2Gbit chips has the same error/debug codes.

ian · July 2, 2024, 12:14pm

Wow, you have sharp eyes. I was trying to learn a bit more about the cache over lunch, this is the only thing I can find that shows there are independent caches for each plane.

I put a summary of the updates and the cache issue on social media with a link to your pull request. A lot of times someone will chime in with a solution.

BusPirateV5 · July 2, 2024, 5:04pm

Another 2Gb in the wild. Expect more activity here towards the weekend from a few of us. Thanks for the work already put in!

henrygab · July 2, 2024, 7:27pm

Thank you for testing this out!

That is interesting. I did change the bad block detection, but it should have found fewer bad blocks, not more. Would have to dig in a bit more to understand what is actually occurring. Unfortunately, as my hardware is still on order, my role for now will be to answer questions and learn as others play with this more.

Looking forward to learning from what others experience here.

henrygab · July 3, 2024, 4:00pm

OK, I’ve updated the CMakeLists.txt, and even earned bonus points.

ian · July 3, 2024, 4:06pm

You do go the extra mile!

henrygab · July 5, 2024, 11:03pm

This is entirely my error and the reason my branch of code will not work.

For folks following this thread:

I misread the layout of the NAND. Here’s the hierarchical view:

1 page == 2048 bytes + extra (aka “sector”)
1 block == 64 pages (or other 2^N value, but typically 64)
1 plane == 1024 blocks (or other value, typically 2^N)
1 LUN == N planes

Importantly, the erase occurs at the per-block (aka erase block) boundary. I can’t explain why, but my brain was equating the term “block” with “sector”. As a result, the code is much more complex than it should / needs to be, and just does stuff wrong.

Working to fix this now…

henrygab · July 5, 2024, 11:38pm

And now the code is “more correct”. Still no hardware to test on myself, so I’m still flying blind.

The change was quite minimal, mostly because I already had a helper function get_page(row_address_t row). The good news is that my horrible hack for erase can be fully removed, because I think Dhara will “just work” on the multi-plane device.

NOTE: This will require re-initialization of the file system (relative to prior commits), because the page size is no longer being lied about as being 2x as large as it really is.

BusPirateV5 · July 6, 2024, 3:49am

Thanks! If @ian or anyone else doesn’t get to it in the next 12 hours, I’ll load it in the morning and provide feedback to the best of my ability.

*Note: I have not setup any hardware debugging fixtures just yet, but will when necessary. I need to print and assemble a few. I have the filament drying tonight in order to print tomorrow.

Is there any gotcha’s or additional information that I should be aware of before running tests on the hardware in the A.M.? If so, please let me know if you see this in time. I have more to troubleshoot and debug on my desk than i can handle as is. I’d prefer not to troubleshoot a known or speculative high risk/ likelihood of a bug or similar.

Thanks @henrygab & @ian for all the time and efforts you’ve already put into this. I wasn’t expecting a head start when we duscussed evaluating the potential of supporting this chip.

henrygab · July 6, 2024, 4:49am

This code was written without my having any hardware to test it against (my 2Gbit unit is ordered, but not here yet). Therefore, there is still a high likelihood of bugs.

There’s really not much to supporting two planes. The odd erase blocks are on one of the planes, the even on the other plane. For commands that did not already include the erase block being accessed, one extra bit is added to tell the NAND device which plane is the target of the command (e.g., reading from the cache register, etc. … because there are now two cache register).

For initial smoke-testing, no debugger required:

Flash the UF2 to the 2Gbit BP5 Rev10.
format, including erase (y), and confirm (y)
If that worked, then robocopy a tree of files to the BP exposed device (from the host). Make sure some files are >6k.
Unplug the device, then plug it back in.
robocopy that tree of files back to the host.
Compare the old (source) tree vs. the newly copied tree … and hope the files match up

(Note: robocopy.exe is a windows built-in for recursively copying files; use your favorite utility as appropriate… even xcopy would work…)

In another 10 days, I should have the hardware in-hand (if all goes well). At that time, I should be able to do more.

ian · July 6, 2024, 10:10am

Well, look what @henrygab did Congratulations Even without the hardware, that is difficult.

It shows in the info screen, and data persists after restart.

Does it seem feasible to roll up the defined IDs and length/widths into a const struct array and choose the chip based on the SPI_NAND_DEVICE_ID?

henrygab · July 6, 2024, 12:47pm

Woo-hoo! That’s awesome! Not often that there are such successes doing remote work without the hardware.

Maybe… let me look into it.

Unfortunately, when I first looked into doing this dynamically (and incorrectly thought the planes swapped per sector), it would have taken significant changes. I also need to check how much of Dhara requires hard-coded values (#define’d values), vs. being dynamically configurable.

Give me a bit, and I’ll get back to you on this question.

BusPirateV5 · July 6, 2024, 7:41pm

I was just wanting to better understand any areas of high probability that came to mind immediately. Thanks for everything!

henrygab · July 8, 2024, 1:41am

I could use some help setting up debugging in VSCode.

I am using the Pico Debug Kit.
I’ve got OpenOCD v0.12.0 installed and running, and can connect with gdb-multiarch.

In VSCode, I have the following launch config:

{
    "configurations": [
    {
        "name": "(gdb) rev10",
        "type": "cppdbg",
        "request": "launch",
        "program": "${workspaceFolder}/build/bus_pirate5_rev10.elf",
        "MIMode": "gdb",
        "stopAtEntry": true,
        "cwd": "${workspaceFolder}",
        "logging": {
            "engineLogging": true,
        },
        "miDebuggerPath": "/usr/bin/gdb-multiarch",
        // "preLaunchTask": "build-debug-cmake",
        "setupCommands": [
            { "ignoreFailures": true,  "text": "-enable-pretty-printing", },
            { "ignoreFailures": false, "text": "-environment-cd ${workspaceRoot}/build", },
            { "ignoreFailures": false, "text": "-file-exec-and-symbols ${workspaceRoot}/build/bus_pirate5_rev10.elf", },
            { "ignoreFailures": false, "text": "-interpreter-exec console \"target remote localhost:3333\"", },
            { "ignoreFailures": false, "text": "-interpreter-exec console \"monitor reset init\"", },
            { "ignoreFailures": false, "text": "-interpreter-exec console \"monitor halt\"", },
            { "ignoreFailures": false, "text": "-interpreter-exec console \"monitor arm semihosting enable\"", },
            { "ignoreFailures": false, "text": "-target-download", },
            { "ignoreFailures": false, "text": "-environment-cd ${workspaceRoot}", },
        ],
    },
    ]
}

This looks promising, but … it’s not quite right. What I want to happen is for it to flash the binary, reset, and stop at an initial breakpoint. I would expect VSCode to show a call stack, local variables, etc.

The debug shortcut panel appears, and seems to think things are running:

However, attempting to break (pause button) doesn’t do anything, and the following error message keeps appearing in the debug console:
->@"keep_alive() was not invoked in the 1000 ms timelimit. GDB alive packet not sent! (2324 ms).

Would really prefer to have proper debugging, but will continue with printf style debugging for now.

henrygab · July 8, 2024, 3:27am

The use of hard-coded values (#define’d values) for NAND parameters has wound its way into multiple files, so there’s more unwinding required than might first be guessed.

Subdirectory `nand/attic/`?

What’s the purpose of this code?

More specifically, can I ignore the fact that I would break this code, to enable dynamic selection of NAND parameters?

I think I’ve a solution for most cases, but something is awry, so I need to debug. This is going to take longer because I’m currently using printf-style debugging.

ian · July 8, 2024, 8:24am

If you’re using Windows, I recommend the PICO Windows Installer. I managed to get it setup before they had the installer, but it was a bit of a nightmare. Now I use the installer and it’s so much easier.

vscode.zip (3.6 KB)

Here is the contents of my .vscode folder, if that is useful.

/attic/ is unused code. It can be deleted or ignored. I probably kept it around as a reference when integrating the project into the Bus Pirate firmware.

henrygab · July 9, 2024, 12:19am

Well, I figured out the problem, and why the problem was so frustrating to root cause.

Turns out, the struct dhara_nand passed to dhara … well, dhara doesn’t copy the values of this small structure. Instead, it uses the caller’s structure directly (even though a pointer is ~ same size). In my adjusting to dynamic init, I placed this structure on the stack, which of course meant the location was being overwritten quickly, resulting in “random” behavior … of course it was crashing! And any change I made “near” where it crashed last would change the behavior… Now things are better.

If we accept that only two revisions of the board exist, and thus exactly two NAND chips are supported (identical except for number of planes and OOB bytes per sector), then I should have a single-firmware solution fairly soon.

[ Of course, I can still only test the 1Gbit Rev10 board until my 2Gbit Rev 10 board arrives. ]

Future Options for your consideration:

~~Support for 4k-per-page NAND devices~~ (larger change… too much for now)
Stronger validation of the chip parameters (e.g., via parameter page, which lists the full chip ID string)

BusPirateV5 · July 9, 2024, 12:58am

Depending on the time, I may be able to run some tests tonight. If I take some time to figure out push notifications for the forum, then I won’t always be late to the party.

henrygab · July 9, 2024, 2:52am

OK, new PR 72 appears to work on the 1Gbit boards, and should dynamically work for the 2Gbit boards.

Would love to hear it works before I get my 2Gbit board.

2Gbit / 4Gbit SPI NAND flash (upgrade) chips

Subdirectory nand/attic/?

Future Options for your consideration:

Subdirectory `nand/attic/`?