mirror of
https://github.com/BoredDevNL/BoredOS.git
synced 2026-05-15 10:48:38 +00:00
doc: Add UTF-8 byte structure section and resources (#10)
Added a section on UTF-8 byte structure with a diagram and a recommended video for further understanding.
This commit is contained in:
parent
7a480b44b9
commit
8d0e744991
1 changed files with 12 additions and 23 deletions
|
|
@ -176,6 +176,15 @@ Invalid sequences are replaced with:
|
||||||
- code point: `0xFFFD`
|
- code point: `0xFFFD`
|
||||||
- UTF-8 encoding: `0xEF 0xBF 0xBD`
|
- UTF-8 encoding: `0xEF 0xBF 0xBD`
|
||||||
|
|
||||||
|
---
|
||||||
|
### UTF-8 Byte Structure
|
||||||
|
|
||||||
|
The following diagram illustrates how UTF-8 bytes are structured, including
|
||||||
|
ASCII, continuation bytes, and multi-byte sequence headers:
|
||||||
|
|
||||||
|
<img width="815" height="1003" alt="image" src="https://github.com/user-attachments/assets/0d289a94-6037-4039-87a3-125c0c0e83d0" />
|
||||||
|
<sub>Source: <a href="https://www.youtube.com/watch?v=vpSkBV5vydg">Nic Barker — "UTF-8, Explained Simply"</a> (YouTube)</sub>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Control Signals
|
### Control Signals
|
||||||
|
|
@ -233,30 +242,10 @@ Example:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Limitations
|
## Also worth watching
|
||||||
|
|
||||||
- No full UTF-8 validation (overlong, surrogates not fully rejected)
|
If you want to dive deeper or simply get a better intuitive understanding of UTF-8, the video below is highly recommended:
|
||||||
- No grapheme cluster handling
|
|
||||||
- No Unicode normalization
|
|
||||||
|
|
||||||
---
|
[Nic Barker — "UTF-8, Explained Simply"](https://www.youtube.com/watch?v=vpSkBV5vydg)
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
- Never iterate UTF-8 strings byte-by-byte
|
|
||||||
- Always use provided helpers for navigation
|
|
||||||
- Separate byte length from character count
|
|
||||||
- Handle invalid sequences safely
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
This module provides essential UTF-8 primitives for userland applications.
|
|
||||||
|
|
||||||
It should be used whenever an application needs to safely:
|
|
||||||
|
|
||||||
- decode UTF-8
|
|
||||||
- encode Unicode
|
|
||||||
- traverse text
|
|
||||||
- handle user input correctly
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue