BPIO2 binary mode

ian · July 29, 2025, 3:23pm

Eliminated the intermediate read and write buffers. The protocol (spi at this point) reads and writes directly to the flat buffer. This gave us a ~3KB/s speed increase, and removed a bunch of big buffers from the stack.

I also reduced the verbosity of the debug so there are fewer if(debug) switches in critical paths (not sure it really helped, but take what you can get).

[Send Packet] Length 560
[Send Packet] COBS encoded buffer length: 564
[BPIO] Packet processed in 2986 us

COBs encoding is indeed taking a significant chunk of time.

    time_start = time_us_32();
    cobs_ret_t cobs_result = cobs_encode(buf, len, cobs_buf, sizeof(cobs_buf), &cobs_len);
    time_end = time_us_32();

~3000us just for the single line.

I wonder if it will be similar on the RX side when the packets grow in size (eg program a flash chip). Should probably test that next.

ian · July 29, 2025, 4:33pm

Turns out we have a default 2K heap, which we should probably be using more. flatbuffers is working in that.

henrygab · July 29, 2025, 7:56pm

Something does not add up. Walking through what cobs_encode() does, a back-of-napkin check suggests that 3000us is at least 4x slower than expected for creating 564 encoded bytes.

Encoding is < 90,000 instructions

Relevant part of bus_pirate5_rev10.elf.map, with my commentary in lines starting with #. Note that there is only a single loop … in cobs_encode_inc.

# Code sizes
 .text.cobs_encode_tinyframe
                0x00000000       0x5c CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
 .text.cobs_encode_inc_begin
                0x00000000       0x24 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
 .text.cobs_encode_inc
                0x00000000       0x14 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
 .text.cobs_encode_inc_end
                0x00000000       0x24 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o

# Let's presume this 0xA0 bytes is the loop (executes once per byte to encode)
 .text.cobs_encode_inc.part.0
                0x10004f54       0xa0 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
 .text.cobs_encode
                0x10004ff4       0x50 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
                0x10004ff4                cobs_encode
# 0x50 above is code size; next function is 0x50 bytes later
 .text.cobs_decode
                0x10005044       0xe4 CMakeFiles/bus_pirate5_rev10.dir/lib/nanocobs/cobs.c.o
                0x10005044                cobs_decode

Excluding the loop, this gives:

  0x5c
+ 0x24
+ 0x14
+ 0x24
+ 0x50
======= 
 0x108 instructions == 264 instructions

The loop for a 560 byte buffer:
160 (0xA0) instructions * 560 bytes == 89,600 instructions

If I reviewed the above correctly, this gives an upper maximum of 89,864 instructions to encode a 560 byte buffer.

Estimate one instruction per cycle

cobs_encode() has no floating point, and no division. All instructions are simple increments, memory set, compare, branch, etc.

Branches can take three cycles, conditional branches can take two cycles, but those typically are countered by not executing the jumped-over instructions. Literal loads might take two cycles … but for back-of-napkin checks, I’ll just average everything to one cycle per instruction…

back-of-napkin analysis

An RP2040 at 120MHz has 120 * 10^6 clock cycles per second
Execution of 90 * 10^3 instructions

Calculating the seconds to encode 560 bytes:
90 * 10^3 / 120 * 10^6 == 0.000750 seconds

Since you’re measuring about 4x a worst-case back-of-napkin estimate, it at least suggests something is stalling (memory latency, bus contention, or similar). Proper profiling tools may be helpful to better understand the actual cause … maybe the encoder can be updated to output in native memory size (32-bits) at a time instead of byte-by-byte (e.g., by working in a local size_t variable, then writing it as a single value … with edge case for the last few bytes to be written).

Before going there, however … maybe try a build where the compiler is allowed to fully optimize the code?

ian · July 30, 2025, 12:52pm

[Data Request] Returning read 512 bytes
[Send Packet] Length 560
[Send Packet] COBS encoded buffer length: 564
[BPIO] Packet processed in 2815 us

An optimized release build is a tad faster on average.

[Send Packet] Length 560
[COBS] 7us 146us 1us
[COBS] Encoded in 247us
[Send Packet] COBS encoded buffer length: 564

Woah, hum. I copied the cobs code to the bpio.c file so I could do so me quick timing on it. This is much more appropriate. However, it’s still stuck at 38KB/s.

[BPIO] RX packet 955us
...
[Send Packet] Sent 560 bytes in 3192us
[BPIO] Handler function executed in 5043us

According to the timings, which I no longer really believe, the USB RX TX loops have slowed to a crawl.

It showed the same timing for COBS previously
Previously packets took ~3.5ms (3000us) and now it takes almost double that, this does account for the cut in speed.

It is now 10+ms between packets with 256byte reads.

Previously it was 3.5ms for the same read.

Something seriously strange.

I’m going to roll back before COBS and see if the speed issue persists.

ian · July 30, 2025, 2:19pm

Previous version without COBS still reaches 88KB/s.

[COBS] Encoded in 65us

Moving back to the COBS version. Seem to be getting accurate timing on COBS encoding now.

Eureka! A significant delay was in the Python client. The serial port handling wasn’t great.

            self.serial_port.write(encoded_data)
            self.serial_port.write(b'\x00')  # COBS delimiter

versus

            packet = cobs.encode(data) + b'\x00'
            self.serial_port.write(packet)

I don’t know what Python is doing under the hood, but doing this in two steps made a lot of delay.

                # Read all available data at once
                available = self.serial_port.in_waiting
                if available > 0:
                    chunk = self.serial_port.read(available)
                    resp_encoded.extend(chunk)
                    
                    # Check if we have the complete message (contains delimiter)
                    delimiter_pos = resp_encoded.find(b'\x00')

Similarly this was reading single bytes which was dead slow. Reading all available and searching for 0x00 is much more efficient.

Moving forward…

It seems like we’re taking a 16KB/s hit with the COBS encoded method compared to the 2 byte header, but some may also be due to the python script still being inefficient.

It seems like a substantial hit, especially with the extra optimizations I’d done since the previous version.
Don’t really know where that hit is
RX is slower because we have to search through bytes for 0x00
Next I’ll do a flash write script and see how long RX packets are impacted.

ian · July 30, 2025, 4:01pm

Writing a 16MB flash chip takes about 8:45, the full erase, write, verify took 13 minutes. Looking at the logic analyzer there’s not much we could to to improve on the writes. The first status poll is WIP, the second is finished. If we could poll faster it might shave off some time, but I’m not sure if that is a USB limit or a Python issue.

I’m using flash speed as a proxy for “high throughput data” in general. The goal isn’t really to max out speed on flash chips. If we wanted to do that, we should make a specific mode optimized on flash chips with QSPI read/write if available.

I’m kind of tempted to do a high speed flash mode now that we’ve gone this far. The question is - a new binmode, or a sub-mode of BPIO2? This will need some thought to avoid building a dumpster fire.

henrygab · July 30, 2025, 4:11pm

To be fair, that’s not exactly an apples-to-apples comparison. If there’s a more efficient framing option, I’d be really interested to learn more.

More likely overhead is in flatbuffers. For example, I’ve read threads comparing protobuf vs. flatbuffers, and protobuf had faster encoding, while flatbuffers had faster decoding.

A couple avenues to explore here:

USB Device Descriptors

IIRC, there’s something in the USB Device Descriptor that indicates the maximum polling rate, but I can’t recall the specifics right now. for example, some “high performance” gaming mice set this to small values to indicate their data changes very rapidly. Maybe it was only in a HID descriptor, or on each interface?

I think the default is 10ms … if that matches what you see, this might be a worthwhile area to investigate. Note, however, that this means there will be more overhead from the host asking for USB updates … a profiler could help understand how much overhead those extra pings from the host would add…

Baud rate … really!

Not as crazy as it may sound, have you tried opening the serial port with settings that indicate a much higher baud rate?

henrygab · July 30, 2025, 4:18pm

If you really want high-speed data throughput, there are better options than USB CDC. In fact, it might be relatively easy to change the underlying transport for flatbuffers to use one of those other options … and if limiting packet size to 64k or less, might not even need COBs. The downside would be on the client side … enumeration of USB bulk interfaces (for example) is sometimes a bit more opaque (and certainly less universal) than using CDC.

My understanding is that the bus pirate was first and foremost targeting hackers, so the client usability was the key factor (over raw performance).

If you’re really looking to improve performance, there’s nothing as powerful as hard data on where time is actually spent. Segger’s profiling suite is under $2k, IIRC. Might be worth a time/cost analysis, or evaluation of their (or similar) profiling tool?

robjwells · July 30, 2025, 8:14pm

Not had nearly as much time as I’d expected over the past week to work on the Rust client, sorry about that, but will be able to from tomorrow.

@ian is there anything that would be particularly helpful for you, on the host side, for the client to have implemented and exercised sooner?

ian · July 30, 2025, 8:48pm

No worries at all, this is all for fun.

I’d look for a cobs cargo and try the read and write speed from rust. If you have a flash adapter the time to read an spi flash chip would be interesting. If not I can send you an adapter and chip.

@henrygab I have a reply in progress but won’t be able to finish testing until tomorrow.

AreYouLoco · July 30, 2025, 9:47pm

Someone mentioned this on local HS channel:

C++ but licensed MIT
Oww there is C version

Is this even related?

ian · July 30, 2025, 10:15pm

It is the project of the protobuffer developer. Flat buffer solves a lot of this, but there are a lot of good improvements. The issue is that the tooling isn’t expansive or actively maintained.

masterX244 · July 31, 2025, 6:58am

afaik baudrate doesn’t matter on the Rp2XX since it only pretends to be a Serial port and once data is in the sendbuffer its already received.

ian · July 31, 2025, 2:24pm

USB device descriptors - I didn’t find anything about timing. I saw a USB high speed option which took me down a rabbit hole of realizing that’s actually USB high speed and not a software setting.

Port speed - I had previously tried 3000000 baud, but then set it back to 115200. I tested again, but unfortunately it doesn’t help.

I’m more theorizing on how a high speed spi flash chip read/write might work. A tailored interface instead of a one size fits all generic control protocol.

Bus Pirate is SPI flash aware (through the SFUD library we already use)
Instead of request-wait-response, stream data from host to Bus Pirate buffer (or vice versa) until buffer is full. Then send periodic requests for more data as needed. This removes all the protocol delay between read/writes which would improve speed significantly.
Implement QSPI interface supported by most modern flash. 4x more data, or the same data at a quarter the bus speed.

ian · July 31, 2025, 3:40pm

Currently there is a 500ns delay between write bytes. This is while the next byte is put into SPI peripheral.

Optimizing write to use the full 8 byte SPI buffer drops that down to 208ns.

However, that has absolutely no impact on the total write time Not even 1 second difference for the 16MB chip.

That’s because the real delay is the hit to the status register, write enable, then reloading bytes. If the Bus Pirate could check status register and handle write enable the write time would improve considerably.

Putting the flash stuff on the back burner for now. Will finish clean up on the bpio and docs.

henrygab · July 31, 2025, 8:51pm

Digging into this more, I found a reference to the polling interval that I was searching for.

From STM Communities, they reference:

HID_Handle->poll = phost->device.CfgDesc.Itf_Desc[interface].Ep_Desc[0].bInterval

Decoding this slightly:

The device has a configuration descriptor.
The configuration descriptor has one or more interfaces
Each interface can have one or more endpoints (data in / out).
The EndPoint descriptor can indicate a supported maximum polling interval rate.

Note, however, that the operating system and the USB controller are also critical. If there’s multiple devices on a USB 2 controller, and you have a device that’s set to poll every 1ms (maximum polling rate), that will impact the performance of every other peripheral on that USB2 controller. Plug into a USB2 controller with no other peripherals (or maybe a USB3 controller … verification required) for best results when using low polling times (fast polling).

Oh, and it will cause more processing on the firmware, too … maybe irrelevant, but I have no data on the CPU overhead for simple polling…

robjwells · August 5, 2025, 3:51pm

SPI is now implemented in the Rust client. Here are the numbers for dumping EEPROM and SPI flash, using 512-byte reads, averaged over 10 runs.

25CSM04 (4 Mbit EEPROM, 8 MHz): 6.653s ± 0.022s sd

W25Q32 (32 Mbit nor flash, 62.5 MHz): 44.484s ± 0.340s sd

Edit: 79 kB/s and 94 kB/s respectively for the EEPROM and flash.

Note this is with firmware 070b051, not the latest, as I need to put the Bus Pirate into HiZ mode on connection for important but boring reasons. And bear in mind the wiring setup is: Bus Pirate, probe cable, breadboard, breakout board, SOIC chip .

Boring reasons

The driver struct behaves in different ways depending on the active mode, for instance the I2C methods are only available (literally only implemented) when it is in I2C mode. This is achieved through a generic type parameter (BusPirate<I2c>).

This generic parameter has to be statically known, so on connection the driver puts the Bus Pirate into HiZ mode and the user always gets back a BusPirate<HiZ>.

Unfortunately the type-system requirement that this be statically known means I can’t query the Bus Pirate and just return the struct set to whatever the active mode is.

Second edit: I realise now that the commit message with the latest firmware is “cobs implemented, slow”. Was cobs not implemented before this one? I’ve been sending cobs encoded packets and reading the response bytes with a cobs decoder, and everything has worked as expected.

ian · August 5, 2025, 6:16pm

Depends on the branch you’re on. I merged cobs into main. If you’re on cobs it’s probably the same. I’ll push tweaks to cobs before merging into main.

I need to start a repo for tooling, but I’m a bit slow at the moment.

robjwells · August 5, 2025, 6:28pm

Just the last-but-one main build.

bz2 · August 14, 2025, 4:35pm

I’m trying out the Python BPIO interface and am struggling to get it to work.

Setup:

Connect Bus Pirate 5
Update Bus Pirate 5 to latest firmware
Confirm Bus Pirate firmware:

VT100 compatible color mode? (Y/n)> n
HiZ> i

This device complies with part 15 of the FCC Rules. Operation is subject to the following two conditions:
(1) this device may not cause harmful interference, and
(2) this device must accept any interference received, including interference that may cause undesired operation.

Bus Pirate 5 REV10
https://BusPirate.com/
Firmware main branch @ cfddcf8 (Aug 14 2025 12:26:04)
RP2040 with 264KB RAM, 128Mbit FLASH
S/N: <redacted, probably unnecessarily :D >
Storage:   0.10GB (FAT16 File System)

Configuration file: Loaded
Active binmode: SUMP logic analyzer
Available modes: HiZ 1WIRE UART HDUART I2C SPI 2WIRE 3WIRE DIO LED INFRARED JTAG
Active mode: HiZ

Display format: Auto

HiZ>

Clone BusPirate5-firmware in main branch (current commit cfddcf850c8539a636e686c50589fd5f0e710e64).
cd hacks/flatpy
Execute the following in an interactive python3 session.

> python3
Python 3.13.6 (main, Aug 14 2025, 10:05:59) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bpio_client import BPIOClient
>>> client = BPIOClient("/dev/serial/by-id/usb-Bus_Pirate_Bus_Pirate_5_5buspirate-if02")
>>> client.show_status()
Timeout waiting for response
Failed to get status information.
>>>

Expected result: Some status information

Actual result: Bus Pirate 5’s LEDs turn red, and above timeout message is printed.

Am I missing something arcane and/or fundamental?

Edited to add:
I’m working in a virtualenv, in which I ran:

pip3 install cobs
pip3 install pyserial