Added a section on UTF-8 byte structure with a diagram and a recommended video for further understanding.
4.3 KiB
UTF-8 Library — Application Development Guide
Overview
The userland libc provides a lightweight UTF-8 utility module located in:
- src/userland/libc/utf-8.c
- src/userland/libc/utf-8.h
This module is designed for direct use in applications requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
It is intended for:
- text rendering
- terminal input/output
- cursor movement
- string processing at the character level
Synopsis
#include "utf-8.h"
uint32_t text_decode_utf8(const char *s, int *advance);
int text_encode_utf8(uint32_t cp, char *out);
const char* text_next_utf8(const char *s);
const char* text_prev_utf8(const char *start, const char *s);
int text_strlen_utf8(const char *s);
API Reference
text_decode_utf8
uint32_t text_decode_utf8(const char *s, int *advance);
Decodes a UTF-8 sequence into a Unicode code point.
s: pointer to current position in a UTF-8 stringadvance: receives number of bytes consumed
Returns:
- decoded Unicode code point (
uint32_t) 0if input is null or empty0xFFFDfor invalid sequences
text_encode_utf8
int text_encode_utf8(uint32_t cp, char *out);
Encodes a Unicode code point into UTF-8.
cp: Unicode code pointout: buffer receiving encoded bytes
Returns:
- number of bytes written (1–4)
- writes replacement character if
cpis invalid
text_next_utf8
const char* text_next_utf8(const char *s);
Advances to the next UTF-8 character.
Returns a pointer to the next character boundary.
text_prev_utf8
const char* text_prev_utf8(const char *start, const char *s);
Moves backward to the previous UTF-8 character.
start: beginning of the buffers: current position
Used for reverse traversal and cursor movement.
text_strlen_utf8
int text_strlen_utf8(const char *s);
Counts UTF-8 characters (code points), not bytes.
Usage Examples
Iterating over UTF-8 characters
const char *p = text;
while (*p) {
int adv;
uint32_t cp = text_decode_utf8(p, &adv);
/* process cp */
p += adv;
}
Cursor movement
cursor = text_next_utf8(cursor);
cursor = text_prev_utf8(buffer_start, cursor);
Encoding a character
char out[4];
int len = text_encode_utf8(0x20AC, out);
Backspace handling
char *prev = (char*)text_prev_utf8(buffer, cursor);
cursor = prev;
Implementation Notes
UTF-8 Encoding
The implementation supports:
- 1 byte:
0x00 – 0x7F - 2 bytes:
0x80 – 0x7FF - 3 bytes:
0x800 – 0xFFFF - 4 bytes:
0x10000 – 0x10FFFF
Replacement Character
Invalid sequences are replaced with:
- code point:
0xFFFD - UTF-8 encoding:
0xEF 0xBF 0xBD
UTF-8 Byte Structure
The following diagram illustrates how UTF-8 bytes are structured, including ASCII, continuation bytes, and multi-byte sequence headers:
Control Signals
Some decoded code points correspond to control signals instead of printable characters.
ASCII control range:
0x00 – 0x1F
Examples:
0x08→ Backspace0x09→ Tab0x0A→ Line Feed0x0D→ Carriage Return0x1B→ Escape
These are typically interpreted by:
- terminal logic
- shell input handling
- system interfaces
Non-ASCII Characters
Characters outside the ASCII range (0x00 – 0x7F) are encoded using multi-byte UTF-8 sequences.
Examples:
- 'é' →
0xC3 0xA9 - '€' →
0xE2 0x82 0xAC
Decoded values:
- 'é' →
U+00E9 - '€' →
U+20AC
Modifiers and Layout
Character output depends on:
- keyboard layout
- modifier keys (Shift, Ctrl, AltGr)
Example:
KEY_E→ 'e'KEY_E + SHIFT→ 'E'KEY_E + AltGr→ '€'
Also worth watching
If you want to dive deeper or simply get a better intuitive understanding of UTF-8, the video below is highly recommended: