Issue #244: Simplify command line input parsing

Simply command line input parsing:
Currently, there is support to write multiple command in a single input line.

Specifically, if there is a ; character anywhere in the command (See also #243), that indicates the end of the current command line. Similarly, if there is a && or || anywhere in the command, that indicates the end of the current command line.

The command line parser does not tokenize the inputs. Therefore, if a command ever needed to accept either the ; character as an input, or the && or || character sequences as inputs, there is currently no way for that to work.


Example scenario

Steps to

  • A USB or serial plank for injecting / automating inputs
  • Commands include xplank type "<string>" to send the string as input keystrokes
    • e.g., xplank type "Hello, World!"
  • Consider a script using the above to recover a forgotten password (brute-force)
  • Initial testing with pattern [a-e0-3@#]{4,8} works
  • Host scripting starts sending commands using pattern [a-zA-Z0-9;:!@#$%^&*\(\)]{1,8}
    • xplank type "Hello, World!"
    • xplank type "abc;123"
    • xplank type "Foo; Bar; Baz"

Expected results:

Three strings are sent via the plank:

  • Hello, World!
  • abc;123
  • Foo; Bar; Bar

Actual results:

The strings sent are:

  • Hello, World! (good)
  • abc (bad … truncated!)
  • Foo (bad … truncated!)

Why these examples break

cmdln_find_next_command(), when parsing the input line, gets to the ; character, and decides that it is the end of the current command line.

In essence, the following three lines of input:

  • xplank type "Hello, World!"
  • xplank type "abc;123"
  • xplank type "Foo; Bar; Baz"

Gets converted into the following six commands:

  • xplank type "Hello, World!" (valid, expected)
  • xplank type "abc (error - truncates output)
  • 123" (error - not a valid command)
  • xplank type "Foo (error - truncates output)
  • Bar (error - not a valid command)
  • Baz" (error - not a valid command)

Next post will be reserved for recommended action.

Issue opened by: henrygab

<to be filled with recommended action, after forum discussions>

Comment by: henrygab

Well… I was going to post the contents of this on forum to get folks’ thoughts. Looks like it was pulled in automagically.

So… I see a few options here, and would be interested to hear other options:

1. Limit input to one command per line

Also would add support for special command prefixes and and or to provide equivalent functionality for conditional chaining via && and ||.

Pros: Easy to implement. Avoids restrictions on characters in input.

Cons: If existing scripts relied on chaining, they would need to be updated?

2. Require prefixing of ;, &, | and \ characters with backslash \.

Pros: Somewhat easy to implement.

Cons: This type of escaping can multiply. For example, sending via netcat may require escaping certain sequences so the shell (e.g., bash or cmd.exe) doesn’t interpret them, and again need escaping at the buspirate parser … so a single \ might become \\\\ to work as expected. This option would also break existing scripts that include those characters.

3. Update command parser to fully tokenize input

This includes tracking opening and closing quotes, embedded quotes with leading backslashes, etc. … and then require commands to use quoted inputs (and escaped quote / backslash).

Pros: Most comprehensive solution

Cons: Complex to implement, lots more state when parsing command inputs.

4. Leave it as-is.

Pros: Simplest solution. :wink:

Cons: Likely just defers the pain until a later date.

My current leaning

I’d lean towards the “one command per line” option, as it’s super-easy to implement, and removes edge cases (rather than adding them). Adding support for command prefixes and and or would also be relatively simple to implement.

1 Like

I lean towards applying a hack to ignore those characters in strings in the parser.

Pros

  • no documentation and demo updates
  • intuitive for Linux users
  • we already have hacky code doing it in the syntax parser
  • no change for end user
  • can stay in place until we are ready to overhaul the command line parser.

Cons:

  • an extra couple hack lines of code, it’ll be dirty but it can hold it together for the time being.

During the scan of the command line input we can just set flag “string active” on the first " and make it false on the next. Ignore the operand switch when the flag is true.

Parser test strings

input line expected tokens notes
cmd1 foo cmd1, foo
cmd1 "foo and bar" cmd1, foo and bar
cmd1 "foo cmd1, foo
cmd1 \"foo bar cmd1, "foo, bar
cmd1 "\"foo" bar cmd1, "foo, bar
cmd1 "\"foo cmd1, "foo
cmd1 foo\ and\ bar" cmd1, foo and bar
cmd1 foo \ bar cmd1, foo, , bar escaped space == token char
cmd1 "foo ; and bar" cmd1, foo ; and bar
cmd1 "foo && bar" cmd1, foo && bar
cmd1 foo\ \;\ and\ bar cmd1, foo ; and bar
cmd1 foo\ \&\& bar cmd1, foo && bar variant 1
cmd1 foo\ \&& bar cmd1, foo && bar variant 2
cmd1 foo\ &\& bar cmd1, foo && bar variant 3
cmd1 \;foo cmd1, ;foo
cmd1 foo; cmd2 bar && cmd3 baz cmd1, foo
cmd2, bar
cmd3, baz
chained
cmd1 "fo;o"; cmd2 ba\;r; cmd3 "b\"az" cmd1, fo;o
cmd2, ba;r
cmd3, b"az
stress test 1
cmd1 "\"foo\" \"and\" \"bar\"" -- fun cmd1, "foo" "and" "bar" stress test 2

Any example above using && should be replicated with ||. Not shown above because some markdown parsers have issues with | characters in tables.

1 Like

looks like there’s more state required when parsing:

  • Is next character escaped? (set to true when current character is not escaped and is \)
  • Within quotes? (toggle is current character when… see options below)

Question: how should the following be parsed:

cmd1 foo"bar and"baz

Option 1: Quotes can start/end anywhere:
gives tokens: cmd1, foobar andbaz

Option 2: Quotes can only start at token boundary
gives tokens: cmd1, foo"bar and"baz

Which is preferred?

1 Like

One more edge case to understand…

Given the following command line:

cmd1 foo ; && cmd3 bar

This should parse into three commands:

  • cmd, with single arg foo
  • an empty command
  • cmd3, with single arg bar

But, for purpose of determining (&&) whether the third command is executed…

  1. Is the prior command’s result based on cmd1, or is it based on the empty command?
  2. Is an empty command considered a successful result or a failed result?

My guesses:

  1. An empty command is still a command, and so the empty command’s result is used.
  2. An empty command is valid. Therefore, there is no error. Therefore, and empty command has a successful result.

???

1 Like

According to historical practice and the docs (:melting_face:) non numerical characters act as a defacto delimiter. So foo"string"bar is three tokens.

One thing missing thing is indeed the escape of " within a string.

These goals seem manageable in the short term.

1 Like

Another option is to allow any arbitrary hex character to be inserted in a string.
A semicolon is hex 3B, so there would need to be a way to indicate a hex character in a string.

The first thought would be to use a backslash, i.e. Like

“Hello\0x3B Goodbye”

But if you already are using it, then that’s a problem.

The advantage of course is that it could be used to allow any character in a string.

2 Likes

I think a first pass will be along the following lines:

  • Allow quoted strings
    • Quoted strings may contain any character
  • Special handling
    • \\ - treated as a single backslash character
    • \" - treated as a single quote character
    • \ followed by anything else (for now) … parsing error
  • Optional:
    • Allow the special handling only in quoted strings (minimize impact on existing scripts)
    • Allow any hex data via \x##, where # there really means hexdigit [0-9A-Fa-f]. Requires two hexdigits per \x.

The command parsing currently … appears to target a much more constrained environment, and is a place that can greatly benefit from a slightly different model…

1 Like

Maybe I should start a sticky thread in development with some milestones and guesses when they will happen? The parser is definitely in need of modernization.

The ability to chain commands appears to have many bugs.

The complexity of the current code adds a number of edge cases, and various code has assumptions that result in unexpected errors when commands are not the first command.

@ian, would you consider allowing the command chaining functionality to be disabled for now? I think the complexity of the current model is hiding even more issues that I’ve not yet found.

@ian, would you support the following minimal hack to greatly simplify the command parsing logic?


Minimal hack to improve API

Three changes:

  1. When adding a character to the command line buffer, check if the character being overwritten at the end is zero (null).
    • If so, overwrite the following contigous non-zero characters (with zero/null), to avoid partially-overwritten history items.
  2. in pirate.c, right after calling ui_process_commands() (or automatically, by wrapping that existing function)…
    • “rotate” the command line buffer so that the read and write offsets are at zero again
    • Determine maximum length of all history items, starting after the first null in the buffer
    • Zero buff[0..(maximum_length-1)]
    • If buff[maximum_length] is not null, also zero until the next null character
  3. Fix prompts to use separate buffer. :slight_smile:
  4. Fix scripts to use separate buffer. :slight_smile:

Why?

  • All strings in the buffer (both current line and historical) are guaranteed to be contiguous.
  • All command history items are “safe” to use to overwrite current line.
  • No command (history or current) will ever cycle from the end of the circular buffer to the start of the buffer.
  • The current command line is always at offset zero
  • When using up/down arrow for history, because each entry is safe, can simply zero the current line’s strlen(), then strcpy() from the history item … there will always be sufficient space.

In short, it greatly simplifies use, with relatively minimal processing cost when adding characters. In exchange, the history is always “safe” to access, and all strings (including for historical lines) are contiguous.


Sample of bugs already found

cmdln_try_discard()

  • This function allowed arbitrarily pushing the read offset past the write offset … e.g., if buffer had two characters remaining, and asked to discard 12, the function succeeds and points 10 characters past the write offset (effectively making the command 500 characters long, with embedded nulls)
  • This function always acted on the full command line, even though various callers presume it acts on the current command.

ui_process_commands()

If there was an empty command, such as two consecutive semicolons, (W 3.3 100;; w), it did not update the success / failure. A blank command is valid, so it should be success. This affects command chaining with || or &&.

ui_process_macro()

  • This fails if there is any leading whitespace (unlike all other commands). e.g., (0) fails, while (0) succeeds.
  • This fails for the same reason when chaining command, where a leading whitespace is more readable. e.g., w; (0) fails, while w;(0) succeeds.
  • Macros inject into the command buffer, and which likely breaks chaining and/or command history.

General

  • commands have no ability to parse custom formats, because the current command (not necessarily the entire command line) is not exposed. Only string, integer and float are possible.
  • Parsing of integers is fine, but parsing of floats is partial and may have errors for small values, especially when nearing precision boundaries. Would be better to use stdio functions… but that requires a contiguous string (not spanning circular buffer’s end)
  • Many variables and structure fields relating to command line parsing have names indicating they are pointers … but they are actually offsets into a circular buffer. This results in constant cognitive disonance when reading the code, making the code harder to understand.

1 Like

Those are good fix suggestions, and I am of course interested in getting the corner cases fixed.

I am really really wary of removing the functionality, even temporarily. First, I use them multiple times a day. Second, several tutorials and demos would have the be rewritten. Third, users might find it odd/confusing to suddenly not have functionality they are used to.

OK … no disabling the chaining functionality. :+1:

I will try out the hack, see if it helps simplify stuff.

1 Like
The command line parsing code has clear indications it was originally targeting a much more resource-constrained microcontroller.

There are some neat hacks. It seems to have been originally built with minimal RAM as a main goal, while ROM was less constrained. As a result, the ring buffer has many helper functions (which go into ROM). The ring buffer might have been based on other structures that would have allowed the ring buffer to be filled asynchronously by a second CPU. However, that functionality is not used in RP2040 code.

The RP2040 changes the cost/benefit analysis of those hacks, primarily due to its (relatively) plentiful RAM.

bpscript parsing is ... complex.

  • Hex string parser gets pointer to hex string, but without the leading 0x.
  • Binary string parser gets pointer to the binary string, but without the leading 0b
  • It’s made more complex because the command line might wrap around the end of a circular buffer …
    • every character access requires going through special functions
    • none of the built-in string parsing functions are usable
    • even incrementing the current character position potentially requires modulo operations
  • No unit testing is easily possible
    • No formal language definition
    • No list of known-good inputs (and how they would parse)
    • No list of almost-good inputs (and why they should fail to parse)
  • Plus all the difficulties in overall UI buffer management
Overall UI buffer management...

  • Global variables are modified from many code paths.
  • Global variables essentially prevent unit-testing of the code in any meaningful way.
  • Assumptions (e.g., command completion buffers) are mostly honored … (except when they are not).
    • Example: command history works … except when it gets corrupted
    • When is the command line buffer supposed to be const vs. mutable?
  • Command line might wrap around end of circular buffer
    • Everything becomes more complex
    • All character access must go through special functions
    • Even a single strlen() is prohibited
    • Pointers to strings in the command line might be non-contiguous
  • Every other code that parses anything from command line (e.g., parameters) also has to use these helper functions … with no options to extend for custom data types …
    • c.f. getting pre-parsed data structure with integers and string extent for any string parameter
    • c.f. contractual guarantee that a string extent is contiguous

This is a non-trivial area of code, which makes it risky to change. I’m starting to make significant headway, and have potential ways forward. I think for a few builds, I will want to have each API call the existing version, then call a newer version … with validation that the results are the same.

Also considering splitting `bpscript` parsing vs. parsing of other commands.
  • bpscript is well-suited to being defined in BNF form (or similar). Defining bpscript via BNF would enable automatic discovery of ambiguities (lex/yacc enables this). Likely there are tools to automatically generate test cases, also.
  • Other commands are well-suited to being defined in terms of parameters. They can then be consistently parsed, allowing for consistent behavior between commands:
    • boolean indicating if a parameter is allowed to be positional
    • boolean indicating if a parameter is optional
    • type of the named parameter (e.g., flag, int, string, binary data…)
    • etc.

Just updating some thoughts. If anyone has a formal (e.g., BNF) description of bpscript, let me know…

1 Like

That sounds like a path forward. It is an absolute hodgepodge of code, parsers, etc.

I am also very much open to using a singular linear buffer for tokenizing input. The circular buffer is “proper” and “cute”, but working with it is quirky and buggy with lots of things to track and the test for roll over every time. All search functions have to be ring buffer aware, so standard C string functions are not helpful.

I would envision still keeping a circular ring buffer for command history. After the user presses <enter> the contents of the linear buffer would be copied to the circular history buffer, which is much simpler to track. It would also go a long way towards not blowing out the history when prompts are used, but there is a better solution for that as well.

@Ian … quick questions.

The documentation says that decimal values for the BusScript commands are prefixed with 0d.

  1. Is it acceptable to allow decimal values without the 0d prefix?
  2. Is is desired to support this 0d prefix for decimal values?
  3. Is it necessary to support this 0d prefix for decimal values?

(My guess is “Yes to all”… but just in case…)

1 Like

I think that is a historical artifact. I don’t think the current parser recognizes the 0d prefix anymore. I’ll test and clean up the docs tomorrow if it isn’t supported.

1 Like