doc: Add UTF-8 Library Application Development Guide

This document provides an overview and API reference for a UTF-8 utility module in the userland libc, including functions for decoding, encoding, and traversing UTF-8 strings.
2026-05-15 10:48:38 +00:00 · 2026-04-23 22:13:32 +02:00 · 2026-04-23 22:13:32 +02:00 · c11d4a8a00
commit c11d4a8a00
parent 81ea21e746
1 changed files with 262 additions and 0 deletions
--- a/docs/appdev/inputs_api_(utf8).md
+++ b/docs/appdev/inputs_api_(utf8).md
@ -0,0 +1,262 @@
 # UTF-8 Library — Application Development Guide
 ## Overview
 The userland libc provides a lightweight UTF-8 utility module located in:
 - src/userland/libc/utf-8.c
 - src/userland/libc/utf-8.h
 This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
 It is intended for:
 - text rendering
 - terminal input/output
 - cursor movement
 - string processing at the character level
 ---
 ## Synopsis
 ```c
 #include "utf-8.h"
 uint32_t text_decode_utf8(const char *s, int *advance);
 int text_encode_utf8(uint32_t cp, char *out);
 const char* text_next_utf8(const char *s);
 const char* text_prev_utf8(const char *start, const char *s);
 int text_strlen_utf8(const char *s);
 ```
 ---
 ## API Reference
 ### text_decode_utf8
 ```c
 uint32_t text_decode_utf8(const char *s, int *advance);
 ```
 Decodes a UTF-8 sequence into a Unicode code point.
 - `s`: pointer to current position in a UTF-8 string
 - `advance`: receives number of bytes consumed
 Returns:
 - decoded Unicode code point (`uint32_t`)
 - `0` if input is null or empty
 - `0xFFFD` for invalid sequences
 ---
 ### text_encode_utf8
 ```c
 int text_encode_utf8(uint32_t cp, char *out);
 ```
 Encodes a Unicode code point into UTF-8.
 - `cp`: Unicode code point
 - `out`: buffer receiving encoded bytes
 Returns:
 - number of bytes written (1–4)
 - writes replacement character if `cp` is invalid
 ---
 ### text_next_utf8
 ```c
 const char* text_next_utf8(const char *s);
 ```
 Advances to the next UTF-8 character.
 Returns a pointer to the next character boundary.
 ---
 ### text_prev_utf8
 ```c
 const char* text_prev_utf8(const char *start, const char *s);
 ```
 Moves backward to the previous UTF-8 character.
 - `start`: beginning of the buffer
 - `s`: current position
 Used for reverse traversal and cursor movement.
 ---
 ### text_strlen_utf8
 ```c
 int text_strlen_utf8(const char *s);
 ```
 Counts UTF-8 characters (code points), not bytes.
 ---
 ## Usage Examples
 ### Iterating over UTF-8 characters
 ```c
 const char *p = text;
 while (*p) {
    int adv;
    uint32_t cp = text_decode_utf8(p, &adv);
    /* process cp */
    p += adv;
 }
 ```
 ---
 ### Cursor movement
 ```c
 cursor = text_next_utf8(cursor);
 cursor = text_prev_utf8(buffer_start, cursor);
 ```
 ---
 ### Encoding a character
 ```c
 char out[4];
 int len = text_encode_utf8(0x20AC, out);
 ```
 ---
 ### Backspace handling
 ```c
 char *prev = (char*)text_prev_utf8(buffer, cursor);
 cursor = prev;
 ```
 ---
 ## Implementation Notes
 ### UTF-8 Encoding
 The implementation supports:
 - 1 byte: `0x00 – 0x7F`
 - 2 bytes: `0x80 – 0x7FF`
 - 3 bytes: `0x800 – 0xFFFF`
 - 4 bytes: `0x10000 – 0x10FFFF`
 ---
 ### Replacement Character
 Invalid sequences are replaced with:
 - code point: `0xFFFD`
 - UTF-8 encoding: `0xEF 0xBF 0xBD`
 ---
 ### Control Signals
 Some decoded code points correspond to control signals instead of printable characters.
 ASCII control range:
 - `0x00 – 0x1F`
 Examples:
 - `0x08` → Backspace
 - `0x09` → Tab
 - `0x0A` → Line Feed
 - `0x0D` → Carriage Return
 - `0x1B` → Escape
 These are typically interpreted by:
 - terminal logic
 - shell input handling
 - system interfaces
 ---
 ### Non-ASCII Characters
 Characters outside the ASCII range (`0x00 – 0x7F`) are encoded using multi-byte UTF-8 sequences.
 Examples:
 - 'é' → `0xC3 0xA9`
 - '€' → `0xE2 0x82 0xAC`
 Decoded values:
 - 'é' → `U+00E9`
 - '€' → `U+20AC`
 ---
 ### Modifiers and Layout
 Character output depends on:
 - keyboard layout
 - modifier keys (Shift, Ctrl, AltGr)
 Example:
 - `KEY_E` → 'e'
 - `KEY_E + SHIFT` → 'E'
 - `KEY_E + AltGr` → '€'
 ---
 ## Limitations
 - No full UTF-8 validation (overlong, surrogates not fully rejected)
 - No grapheme cluster handling
 - No Unicode normalization
 ---
 ## Best Practices
 - Never iterate UTF-8 strings byte-by-byte
 - Always use provided helpers for navigation
 - Separate byte length from character count
 - Handle invalid sequences safely
 ---
 ## Summary
 This module provides essential UTF-8 primitives for userland applications.
 It should be used whenever an application needs to safely:
 - decode UTF-8
 - encode Unicode
 - traverse text
 - handle user input correctly