doc: Add UTF-8 Library Application Development Guide

This document provides an overview and API reference for a UTF-8 utility module in the userland libc, including functions for decoding, encoding, and traversing UTF-8 strings.
2026-05-15 10:48:38 +00:00 · 2026-04-23 22:13:32 +02:00 · 2026-04-23 22:13:32 +02:00 · c11d4a8a00
commit c11d4a8a00
parent 81ea21e746
1 changed files with 262 additions and 0 deletions
--- a/docs/appdev/inputs_api_(utf8).md
+++ b/docs/appdev/inputs_api_(utf8).md
@ -0,0 +1,262 @@
+# UTF-8 Library — Application Development Guide
+
+## Overview
+
+The userland libc provides a lightweight UTF-8 utility module located in:
+
+- src/userland/libc/utf-8.c
+- src/userland/libc/utf-8.h
+
+This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
+
+It is intended for:
+
+- text rendering
+- terminal input/output
+- cursor movement
+- string processing at the character level
+
+---
+
+## Synopsis
+
+```c
+#include "utf-8.h"
+
+uint32_t text_decode_utf8(const char *s, int *advance);
+int text_encode_utf8(uint32_t cp, char *out);
+
+const char* text_next_utf8(const char *s);
+const char* text_prev_utf8(const char *start, const char *s);
+
+int text_strlen_utf8(const char *s);
+```
+
+---
+
+## API Reference
+
+### text_decode_utf8
+
+```c
+uint32_t text_decode_utf8(const char *s, int *advance);
+```
+
+Decodes a UTF-8 sequence into a Unicode code point.
+
+- `s`: pointer to current position in a UTF-8 string
+- `advance`: receives number of bytes consumed
+
+Returns:
+
+- decoded Unicode code point (`uint32_t`)
+- `0` if input is null or empty
+- `0xFFFD` for invalid sequences
+
+---
+
+### text_encode_utf8
+
+```c
+int text_encode_utf8(uint32_t cp, char *out);
+```
+
+Encodes a Unicode code point into UTF-8.
+
+- `cp`: Unicode code point
+- `out`: buffer receiving encoded bytes
+
+Returns:
+
+- number of bytes written (1–4)
+- writes replacement character if `cp` is invalid
+
+---
+
+### text_next_utf8
+
+```c
+const char* text_next_utf8(const char *s);
+```
+
+Advances to the next UTF-8 character.
+
+Returns a pointer to the next character boundary.
+
+---
+
+### text_prev_utf8
+
+```c
+const char* text_prev_utf8(const char *start, const char *s);
+```
+
+Moves backward to the previous UTF-8 character.
+
+- `start`: beginning of the buffer
+- `s`: current position
+
+Used for reverse traversal and cursor movement.
+
+---
+
+### text_strlen_utf8
+
+```c
+int text_strlen_utf8(const char *s);
+```
+
+Counts UTF-8 characters (code points), not bytes.
+
+---
+
+## Usage Examples
+
+### Iterating over UTF-8 characters
+
+```c
+const char *p = text;
+
+while (*p) {
+    int adv;
+    uint32_t cp = text_decode_utf8(p, &adv);
+
+    /* process cp */
+
+    p += adv;
+}
+```
+
+---
+
+### Cursor movement
+
+```c
+cursor = text_next_utf8(cursor);
+cursor = text_prev_utf8(buffer_start, cursor);
+```
+
+---
+
+### Encoding a character
+
+```c
+char out[4];
+int len = text_encode_utf8(0x20AC, out);
+```
+
+---
+
+### Backspace handling
+
+```c
+char *prev = (char*)text_prev_utf8(buffer, cursor);
+cursor = prev;
+```
+
+---
+
+## Implementation Notes
+
+### UTF-8 Encoding
+
+The implementation supports:
+
+- 1 byte: `0x00 – 0x7F`
+- 2 bytes: `0x80 – 0x7FF`
+- 3 bytes: `0x800 – 0xFFFF`
+- 4 bytes: `0x10000 – 0x10FFFF`
+
+---
+
+### Replacement Character
+
+Invalid sequences are replaced with:
+
+- code point: `0xFFFD`
+- UTF-8 encoding: `0xEF 0xBF 0xBD`
+
+---
+
+### Control Signals
+
+Some decoded code points correspond to control signals instead of printable characters.
+
+ASCII control range:
+
+- `0x00 – 0x1F`
+
+Examples:
+
+- `0x08` → Backspace
+- `0x09` → Tab
+- `0x0A` → Line Feed
+- `0x0D` → Carriage Return
+- `0x1B` → Escape
+
+These are typically interpreted by:
+
+- terminal logic
+- shell input handling
+- system interfaces
+
+---
+
+### Non-ASCII Characters
+
+Characters outside the ASCII range (`0x00 – 0x7F`) are encoded using multi-byte UTF-8 sequences.
+
+Examples:
+
+- 'é' → `0xC3 0xA9`
+- '€' → `0xE2 0x82 0xAC`
+
+Decoded values:
+
+- 'é' → `U+00E9`
+- '€' → `U+20AC`
+
+---
+
+### Modifiers and Layout
+
+Character output depends on:
+
+- keyboard layout
+- modifier keys (Shift, Ctrl, AltGr)
+
+Example:
+
+- `KEY_E` → 'e'
+- `KEY_E + SHIFT` → 'E'
+- `KEY_E + AltGr` → '€'
+
+---
+
+## Limitations
+
+- No full UTF-8 validation (overlong, surrogates not fully rejected)
+- No grapheme cluster handling
+- No Unicode normalization
+
+---
+
+## Best Practices
+
+- Never iterate UTF-8 strings byte-by-byte
+- Always use provided helpers for navigation
+- Separate byte length from character count
+- Handle invalid sequences safely
+
+---
+
+## Summary
+
+This module provides essential UTF-8 primitives for userland applications.
+
+It should be used whenever an application needs to safely:
+
+- decode UTF-8
+- encode Unicode
+- traverse text
+- handle user input correctly