doc: Add UTF-8 Library Application Development Guide

This document provides an overview and API reference for a UTF-8 utility module in the userland libc, including functions for decoding, encoding, and traversing UTF-8 strings.
This commit is contained in:
Lluciocc 2026-04-23 22:13:32 +02:00 committed by GitHub
parent 81ea21e746
commit c11d4a8a00
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -0,0 +1,262 @@
# UTF-8 Library — Application Development Guide
## Overview
The userland libc provides a lightweight UTF-8 utility module located in:
- src/userland/libc/utf-8.c
- src/userland/libc/utf-8.h
This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
It is intended for:
- text rendering
- terminal input/output
- cursor movement
- string processing at the character level
---
## Synopsis
```c
#include "utf-8.h"
uint32_t text_decode_utf8(const char *s, int *advance);
int text_encode_utf8(uint32_t cp, char *out);
const char* text_next_utf8(const char *s);
const char* text_prev_utf8(const char *start, const char *s);
int text_strlen_utf8(const char *s);
```
---
## API Reference
### text_decode_utf8
```c
uint32_t text_decode_utf8(const char *s, int *advance);
```
Decodes a UTF-8 sequence into a Unicode code point.
- `s`: pointer to current position in a UTF-8 string
- `advance`: receives number of bytes consumed
Returns:
- decoded Unicode code point (`uint32_t`)
- `0` if input is null or empty
- `0xFFFD` for invalid sequences
---
### text_encode_utf8
```c
int text_encode_utf8(uint32_t cp, char *out);
```
Encodes a Unicode code point into UTF-8.
- `cp`: Unicode code point
- `out`: buffer receiving encoded bytes
Returns:
- number of bytes written (14)
- writes replacement character if `cp` is invalid
---
### text_next_utf8
```c
const char* text_next_utf8(const char *s);
```
Advances to the next UTF-8 character.
Returns a pointer to the next character boundary.
---
### text_prev_utf8
```c
const char* text_prev_utf8(const char *start, const char *s);
```
Moves backward to the previous UTF-8 character.
- `start`: beginning of the buffer
- `s`: current position
Used for reverse traversal and cursor movement.
---
### text_strlen_utf8
```c
int text_strlen_utf8(const char *s);
```
Counts UTF-8 characters (code points), not bytes.
---
## Usage Examples
### Iterating over UTF-8 characters
```c
const char *p = text;
while (*p) {
int adv;
uint32_t cp = text_decode_utf8(p, &adv);
/* process cp */
p += adv;
}
```
---
### Cursor movement
```c
cursor = text_next_utf8(cursor);
cursor = text_prev_utf8(buffer_start, cursor);
```
---
### Encoding a character
```c
char out[4];
int len = text_encode_utf8(0x20AC, out);
```
---
### Backspace handling
```c
char *prev = (char*)text_prev_utf8(buffer, cursor);
cursor = prev;
```
---
## Implementation Notes
### UTF-8 Encoding
The implementation supports:
- 1 byte: `0x00 0x7F`
- 2 bytes: `0x80 0x7FF`
- 3 bytes: `0x800 0xFFFF`
- 4 bytes: `0x10000 0x10FFFF`
---
### Replacement Character
Invalid sequences are replaced with:
- code point: `0xFFFD`
- UTF-8 encoding: `0xEF 0xBF 0xBD`
---
### Control Signals
Some decoded code points correspond to control signals instead of printable characters.
ASCII control range:
- `0x00 0x1F`
Examples:
- `0x08` → Backspace
- `0x09` → Tab
- `0x0A` → Line Feed
- `0x0D` → Carriage Return
- `0x1B` → Escape
These are typically interpreted by:
- terminal logic
- shell input handling
- system interfaces
---
### Non-ASCII Characters
Characters outside the ASCII range (`0x00 0x7F`) are encoded using multi-byte UTF-8 sequences.
Examples:
- 'é' → `0xC3 0xA9`
- '€' → `0xE2 0x82 0xAC`
Decoded values:
- 'é' → `U+00E9`
- '€' → `U+20AC`
---
### Modifiers and Layout
Character output depends on:
- keyboard layout
- modifier keys (Shift, Ctrl, AltGr)
Example:
- `KEY_E` → 'e'
- `KEY_E + SHIFT` → 'E'
- `KEY_E + AltGr` → '€'
---
## Limitations
- No full UTF-8 validation (overlong, surrogates not fully rejected)
- No grapheme cluster handling
- No Unicode normalization
---
## Best Practices
- Never iterate UTF-8 strings byte-by-byte
- Always use provided helpers for navigation
- Separate byte length from character count
- Handle invalid sequences safely
---
## Summary
This module provides essential UTF-8 primitives for userland applications.
It should be used whenever an application needs to safely:
- decode UTF-8
- encode Unicode
- traverse text
- handle user input correctly