Back to Basics
A Picture is Worth What Now?
If the diagram above looks like gibberish to you, you’re not alone…
If the diagram above looks like something you’ve gratefully forgotten from a dry, boring lecture in a theoretical computer science class, you’re not alone…
If it looks like it ought to be familiar, but you can’t quite put your finger on it, you’re not alone…
If you’ve used (or cursed at) regular expressions, but it still doesn’t make sense to you, you might want to keep reading. You might not…
State Machines
You may have heard the words “state machines,” “finite state machines (FSMs),” or “finite automata” at some point. In many contexts, they map the same concepts.
In terms of what’s above, it depicts a “deterministic” finite automaton in a state-transition diagram. [Back to that in a minute.] The circles are valid “states” in which the program may find itself, and the arrows are “transitions” between states.
Formally, we talk about the machine consuming “symbols” from some “universe” of such. Above, our universe of possible symbols is constrained to the decimal digits, 0 through 9, plus, minus, dot / decimal / period, and the upper- or lower-case letter “e”. Any other “symbol” is forbidden.
The “state” with the big arrow on the left is our “start state.” The “states” with double lines are “accepting” states. The ones with single lines are internal or intermediate states.
Depending on the context, an “accepting state” either accepts at the end of the input or when a forbidden symbol is encountered to tell the “machine” that it’s done. In this case, they need to “accept” or recognize that the “word” is part of the “language” either way … for the most part.
Deterministic
If you look closely, you’ll see that there are no un-labeled “transitions.” They’re all triggered by encountering a legal symbol when in a specific “state.” Similarly, there are no instances of transitions where a “state” has two out-going “transitions” based on a single symbol. As such, all of the functions of this “machine” are literally deterministic. One valid input in that state triggers one valid transition.
In practice, that doesn’t matter as much as one might think, since non-deterministic finite automata can always be converted into deterministic versions. If you feel like getting into that, there are plenty of good books gathering dust on many shelves.
An Aside
If anyone has spare time and tokens, I’d be fascinated to hear what the AIs think of this diagram. Do any of them recognize it immediately and tell you what “language” it recognizes? Here’s an un-clipped/cropped version of the diagram if you feel like up-loading it somewhere…
Not the Diagram
If you’ve spent any time reading Internet standards, this small chunk of text is likely to look all too familiar.
In fact, it’s a textual description of what I hope I captured properly in the diagram. You may have seen references to BNF, EBNF, ABNF, etc. which are based on the Backus-Naur Format for specifying “grammars” for “languages” of certain types. If you’re interested in the types of “languages” they can represent, dusty books blah, blah, blah…
This one was cribbed from the middle of Internet RFC (request for comments – still a standard, go figure!) 8259, which can be found in text format at RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format.
There are a few oddities to note. The “DIGIT” isn’t actually defined there. We can assume that it means one of our normal, decimal digits (0-9) and that the “digit1-9” is specifically called out because it doesn’t contain the zero. Additionally, the 1* versus the * map to “one or more” versus “zero or more”. That’s one of many things that can be specified differently in different formats.
The Language
In this case, the “language” the grammar and its associated automaton, state machine, and state-transition diagram all attempt to capture is “things that should be interpreted as legal ‘numbers’ in a JSON-format data stream.”
You can see that it must start with a digit from 0 through 9 or a “unary” minus sign (not the subtraction operator). The reason a leading zero (or negative zero) goes somewhere other than where the other digits go is that leading zeros are banned by the standard. Hence:
00012.34
Recommended by LinkedIn
would be an illegal “number” in a JSON context even though it has a clear meaning. Or does it? Depending on the context, a leading zero like that might denote that the following digits are not the decimal digits to which we’ve all become accustomed. They might be limited to the octal (0-7) digits (for base 8 arithmetic), and since the example doesn’t contain any digits outside the octal set, it would be tough to know for sure. Ambiguity is the great enemy!
Simple integers and floating-point values are probably obvious from the diagram now. Of course, we wouldn’t want to forget our old friend “scientific notation” which can introduce positive and negative exponents.
The special state (32 in the diagram) for a leading zero becomes necessary, because we have to be able to represent numbers with only a fractional part, like 0.125 or -0.333.
But Why? Why? Why?
Why am I thinking about all this? As usual, “I heard something that got me thinking, and …” Famous last words!
Last week at PyCon 2025 (a really great conference with really great people!), I heard a talk about dealing with exceptionally large amounts of data in the JSON format. There was a spark, a flash, a smell of smoke, and some of my mental gears started turning … slowly. I’ve seen a fair amount of JSON files / data streams that were effectively one, giant “array” with “objects” inside them. Something like:
[
{ “message”: “Go away!” },
{ “message”: “You kids get off my lawn!” }
]
If that’s all it is, couldn’t we (in Python, of course!) ask a language library “yes, I know it’s one, giant list, but please parse, validate, and return each item in the list separately, so you use up much less memory?” In Python, we could write that function as a “generator” that doesn’t (necessarily!) “return” data so much as “yield” data over and over.
But Still Why?
Well, starting from the bottom lets one explore the data, interact with the Internet standards themselves, and see what others have done in their libraries in a new light. In this case, RFC-8259 for JSON explicitly references and relies on RFC-3269 (found at RFC 3629: UTF-8, a transformation format of ISO 10646) for UTF-8 as an encoding or transformation of Unicode / ISO/IEC 10646-1 character data. I guess they’re assuming that we’ve all read those other standards, since RFC-3269 never actually tells us what “UTF” stands for. [Hence, the example JSON above!] Google says it’s “Unicode Transformation Format.” Go figure!
By themselves, RFC-3269 and UTF-8 are interesting standards. They don’t define any printable character set or natural language subset. They only define how to encode or transform inter-/national language text and data.
Of course, there are UTF-8, UTF-16, UTF-32 (based on bit lengths) and other encodings and transformations possible, but RFC-8259 has been revised to effectively outlaw the others. In fact, that’s New And Improved!
I tend to be a “there’s value in the journey” kind of person, so I’m writing some libraries from scratch to experiment and then compare. If they prove useful or interesting to people (or even spark discussion), they’ll get their own write-up and maybe packaging for a larger audience.
Feeling Pythonic?
If you’re feeling more Pythonic, you might have noticed a bunch of things that don’t feel quite right. The Internet standards referenced here don’t allow for NaN (not a number), positive or negative infinity (Inf, -Inf), using a comma where some folks would use a period as the fraction separator, using underscores to break up large numbers, and more.
For some of those, the relevant Python libraries may allow them. For others, they may not.
Interoper-what?
The whole point of standards is to allow for interoperability of programs and systems. That goes double for networking and data interchange standards like these. Sadly, things that start out ad hoc often stay that way and lead to problems down the road.
One fascinating reference for this particular issue is found at https://github.com/nst/JSONTestSuite, and contains a very interesting comparison chart of different JSON libraries and which tests they pass, fail, or just work differently on. Having an (updated) Internet RFC that defines what is and isn’t JSON may be reducing the chaos. It may not. I haven’t done a lot of research yet, but the code I’m working on right now will try to implement the RFCs as written and pass at least the “y” and “n” test suites doing the correct things with those inputs.
Conclusion
It’s a work in progress. There is no conclusion yet unless you’re drawing conclusions about my sanity. I would not blame you if you made it all the way here…
See y’all later!
Very interesting Chris Petersen! But for me I have preferred to add another layer (more general)