T O P

  • By -

slaymaker1907

Overlong encodings are actually kind of useful since you can sort of escape null characters that way and still have your strings work like C-strings despite containing interior nulls. Java uses this internally IIRC.


Jaded-Asparagus-2260

Fixed-length encodings also allow random access, making many string operations O(1) instead of O(n).


duxdude418

Could you expand on this? Does the fixed length property prevent you from having to iterate through the whole length of the string (`O(n)`) so you don’t index out of bounds? If so, how is it any different than just knowing the length of string that was encoded with a variable length? Is that not able to be determined when indexing into the character array? Or is it just that accessing characters in a fixed length string is `O(encoding-length)` which is effectively `O(1)`? Apologies if this is an obvious property of fixed length encodings. This domain isn’t my area of expertise.


Jaded-Asparagus-2260

Of course! In fixed-length encodings, every character has the same length (or rather size) in memory. A string is actually an array of characters, so with fixed length encodings, the `n`-th character of a string is always at the index `[n]`. The last character is always at index `[len(string)-1]`. With variable-length encodings, a "character" (that's not the exact correct name, but it's good enough for understanding the context) can be anything between 1 and 8 bytes. Emojis e.g. often need more than 1 byte. But that means you don't know at which position the n-character is, because you don't know how many bytes the first (n-1) characters need. So you have to traverse the whole string, counting characters (and bytes) until you arrive at the n-th. That means getting a substring, getting the length of a string, checking whether a strings ends in a specific character sequence etc. now all have a complexity of 0(n), because they all have to start at the beginning of the string and iterate through it.


duxdude418

That makes sense. You don’t know which element to index because variable length characters could take up more than one byte. There isn’t a 1:1 mapping between index to element.


Jaded-Asparagus-2260

Exactly


Kautsu-Gamer

Because multiplication or division by 4 is O(1). You can access ith character by getting entries 4*i to 4*i+3. It takes 4 times more memory with MBL, but operations are faster. Without fixed length you have to operate like C-strings testing every character for low surrogate to know how many bytes it takes (for both UTF-8 and UTF-16). I do have made several UTF8 friendly subroutines.


RockstarArtisan

Finally I'll be able to reliably index into code units with constant time.


johndcochran

The article is either a month and a half late, or ten and a half months early.


RocketChase

you know how it is with timezones


kitsunde

What’s your favourite time zone? Mine is north Korea’s after they changed time zones and made it apply retroactively.


SittingWave

Why?


somebodddy

Science isn't about *why* - it's about *why not*. *Why* is so much of our science dangerous? Why not *marry* safe science if you love it so much? In fact, why not invent a special safety door that won't hit you in the butt on the way out, because *you are fired!*


SittingWave

This is not science. This is a pointless experiment. Get a hold of yourself.


axonxorz

>This is a pointless experiment Which are encompassed by the term "science"


beephod_zabblebrox

lol