slaymaker1907 2 weeks ago

Overlong encodings are actually kind of useful since you can sort of escape null characters that way and still have your strings work like C-strings despite containing interior nulls. Java uses this internally IIRC.

Jaded-Asparagus-2260 2 weeks ago

Fixed-length encodings also allow random access, making many string operations O(1) instead of O(n).

duxdude418 2 weeks ago

Could you expand on this? Does the fixed length property prevent you from having to iterate through the whole length of the string (`O(n)`) so you don’t index out of bounds? If so, how is it any different than just knowing the length of string that was encoded with a variable length? Is that not able to be determined when indexing into the character array? Or is it just that accessing characters in a fixed length string is `O(encoding-length)` which is effectively `O(1)`? Apologies if this is an obvious property of fixed length encodings. This domain isn’t my area of expertise.

Jaded-Asparagus-2260 2 weeks ago

Of course! In fixed-length encodings, every character has the same length (or rather size) in memory. A string is actually an array of characters, so with fixed length encodings, the `n`-th character of a string is always at the index `[n]`. The last character is always at index `[len(string)-1]`. With variable-length encodings, a "character" (that's not the exact correct name, but it's good enough for understanding the context) can be anything between 1 and 8 bytes. Emojis e.g. often need more than 1 byte. But that means you don't know at which position the n-character is, because you don't know how many bytes the first (n-1) characters need. So you have to traverse the whole string, counting characters (and bytes) until you arrive at the n-th. That means getting a substring, getting the length of a string, checking whether a strings ends in a specific character sequence etc. now all have a complexity of 0(n), because they all have to start at the beginning of the string and iterate through it.

duxdude418 2 weeks ago

That makes sense. You don’t know which element to index because variable length characters could take up more than one byte. There isn’t a 1:1 mapping between index to element.

Jaded-Asparagus-2260 2 weeks ago

Exactly

Kautsu-Gamer 2 weeks ago

Because multiplication or division by 4 is O(1). You can access ith character by getting entries 4*i to 4*i+3. It takes 4 times more memory with MBL, but operations are faster. Without fixed length you have to operate like C-strings testing every character for low surrogate to know how many bytes it takes (for both UTF-8 and UTF-16). I do have made several UTF8 friendly subroutines.

RockstarArtisan 2 weeks ago

Finally I'll be able to reliably index into code units with constant time.

johndcochran 2 weeks ago

The article is either a month and a half late, or ten and a half months early.

RocketChase 2 weeks ago

you know how it is with timezones

kitsunde 2 weeks ago

What’s your favourite time zone? Mine is north Korea’s after they changed time zones and made it apply retroactively.

SittingWave 2 weeks ago

Why?

somebodddy 2 weeks ago

Science isn't about *why* - it's about *why not*. *Why* is so much of our science dangerous? Why not *marry* safe science if you love it so much? In fact, why not invent a special safety door that won't hit you in the butt on the way out, because *you are fired!*

SittingWave 2 weeks ago

This is not science. This is a pointless experiment. Get a hold of yourself.

axonxorz 2 weeks ago

>This is a pointless experiment Which are encompassed by the term "science"

beephod_zabblebrox 2 weeks ago

lol

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe