diff --git a/docs/appdev/inputs_api_(utf8).md b/docs/appdev/inputs_api_(utf8).md new file mode 100644 index 0000000..0c267a7 --- /dev/null +++ b/docs/appdev/inputs_api_(utf8).md @@ -0,0 +1,262 @@ +# UTF-8 Library — Application Development Guide + +## Overview + +The userland libc provides a lightweight UTF-8 utility module located in: + +- src/userland/libc/utf-8.c +- src/userland/libc/utf-8.h + +This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely. + +It is intended for: + +- text rendering +- terminal input/output +- cursor movement +- string processing at the character level + +--- + +## Synopsis + +```c +#include "utf-8.h" + +uint32_t text_decode_utf8(const char *s, int *advance); +int text_encode_utf8(uint32_t cp, char *out); + +const char* text_next_utf8(const char *s); +const char* text_prev_utf8(const char *start, const char *s); + +int text_strlen_utf8(const char *s); +``` + +--- + +## API Reference + +### text_decode_utf8 + +```c +uint32_t text_decode_utf8(const char *s, int *advance); +``` + +Decodes a UTF-8 sequence into a Unicode code point. + +- `s`: pointer to current position in a UTF-8 string +- `advance`: receives number of bytes consumed + +Returns: + +- decoded Unicode code point (`uint32_t`) +- `0` if input is null or empty +- `0xFFFD` for invalid sequences + +--- + +### text_encode_utf8 + +```c +int text_encode_utf8(uint32_t cp, char *out); +``` + +Encodes a Unicode code point into UTF-8. + +- `cp`: Unicode code point +- `out`: buffer receiving encoded bytes + +Returns: + +- number of bytes written (1–4) +- writes replacement character if `cp` is invalid + +--- + +### text_next_utf8 + +```c +const char* text_next_utf8(const char *s); +``` + +Advances to the next UTF-8 character. + +Returns a pointer to the next character boundary. + +--- + +### text_prev_utf8 + +```c +const char* text_prev_utf8(const char *start, const char *s); +``` + +Moves backward to the previous UTF-8 character. + +- `start`: beginning of the buffer +- `s`: current position + +Used for reverse traversal and cursor movement. + +--- + +### text_strlen_utf8 + +```c +int text_strlen_utf8(const char *s); +``` + +Counts UTF-8 characters (code points), not bytes. + +--- + +## Usage Examples + +### Iterating over UTF-8 characters + +```c +const char *p = text; + +while (*p) { + int adv; + uint32_t cp = text_decode_utf8(p, &adv); + + /* process cp */ + + p += adv; +} +``` + +--- + +### Cursor movement + +```c +cursor = text_next_utf8(cursor); +cursor = text_prev_utf8(buffer_start, cursor); +``` + +--- + +### Encoding a character + +```c +char out[4]; +int len = text_encode_utf8(0x20AC, out); +``` + +--- + +### Backspace handling + +```c +char *prev = (char*)text_prev_utf8(buffer, cursor); +cursor = prev; +``` + +--- + +## Implementation Notes + +### UTF-8 Encoding + +The implementation supports: + +- 1 byte: `0x00 – 0x7F` +- 2 bytes: `0x80 – 0x7FF` +- 3 bytes: `0x800 – 0xFFFF` +- 4 bytes: `0x10000 – 0x10FFFF` + +--- + +### Replacement Character + +Invalid sequences are replaced with: + +- code point: `0xFFFD` +- UTF-8 encoding: `0xEF 0xBF 0xBD` + +--- + +### Control Signals + +Some decoded code points correspond to control signals instead of printable characters. + +ASCII control range: + +- `0x00 – 0x1F` + +Examples: + +- `0x08` → Backspace +- `0x09` → Tab +- `0x0A` → Line Feed +- `0x0D` → Carriage Return +- `0x1B` → Escape + +These are typically interpreted by: + +- terminal logic +- shell input handling +- system interfaces + +--- + +### Non-ASCII Characters + +Characters outside the ASCII range (`0x00 – 0x7F`) are encoded using multi-byte UTF-8 sequences. + +Examples: + +- 'é' → `0xC3 0xA9` +- '€' → `0xE2 0x82 0xAC` + +Decoded values: + +- 'é' → `U+00E9` +- '€' → `U+20AC` + +--- + +### Modifiers and Layout + +Character output depends on: + +- keyboard layout +- modifier keys (Shift, Ctrl, AltGr) + +Example: + +- `KEY_E` → 'e' +- `KEY_E + SHIFT` → 'E' +- `KEY_E + AltGr` → '€' + +--- + +## Limitations + +- No full UTF-8 validation (overlong, surrogates not fully rejected) +- No grapheme cluster handling +- No Unicode normalization + +--- + +## Best Practices + +- Never iterate UTF-8 strings byte-by-byte +- Always use provided helpers for navigation +- Separate byte length from character count +- Handle invalid sequences safely + +--- + +## Summary + +This module provides essential UTF-8 primitives for userland applications. + +It should be used whenever an application needs to safely: + +- decode UTF-8 +- encode Unicode +- traverse text +- handle user input correctly