mirror of
https://github.com/BoredDevNL/BoredOS.git
synced 2026-05-15 10:48:38 +00:00
doc: Add UTF-8 Library Application Development Guide
This document provides an overview and API reference for a UTF-8 utility module in the userland libc, including functions for decoding, encoding, and traversing UTF-8 strings.
This commit is contained in:
parent
81ea21e746
commit
c11d4a8a00
1 changed files with 262 additions and 0 deletions
262
docs/appdev/inputs_api_(utf8).md
Normal file
262
docs/appdev/inputs_api_(utf8).md
Normal file
|
|
@ -0,0 +1,262 @@
|
||||||
|
# UTF-8 Library — Application Development Guide
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The userland libc provides a lightweight UTF-8 utility module located in:
|
||||||
|
|
||||||
|
- src/userland/libc/utf-8.c
|
||||||
|
- src/userland/libc/utf-8.h
|
||||||
|
|
||||||
|
This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
|
||||||
|
|
||||||
|
It is intended for:
|
||||||
|
|
||||||
|
- text rendering
|
||||||
|
- terminal input/output
|
||||||
|
- cursor movement
|
||||||
|
- string processing at the character level
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Synopsis
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include "utf-8.h"
|
||||||
|
|
||||||
|
uint32_t text_decode_utf8(const char *s, int *advance);
|
||||||
|
int text_encode_utf8(uint32_t cp, char *out);
|
||||||
|
|
||||||
|
const char* text_next_utf8(const char *s);
|
||||||
|
const char* text_prev_utf8(const char *start, const char *s);
|
||||||
|
|
||||||
|
int text_strlen_utf8(const char *s);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Reference
|
||||||
|
|
||||||
|
### text_decode_utf8
|
||||||
|
|
||||||
|
```c
|
||||||
|
uint32_t text_decode_utf8(const char *s, int *advance);
|
||||||
|
```
|
||||||
|
|
||||||
|
Decodes a UTF-8 sequence into a Unicode code point.
|
||||||
|
|
||||||
|
- `s`: pointer to current position in a UTF-8 string
|
||||||
|
- `advance`: receives number of bytes consumed
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
- decoded Unicode code point (`uint32_t`)
|
||||||
|
- `0` if input is null or empty
|
||||||
|
- `0xFFFD` for invalid sequences
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### text_encode_utf8
|
||||||
|
|
||||||
|
```c
|
||||||
|
int text_encode_utf8(uint32_t cp, char *out);
|
||||||
|
```
|
||||||
|
|
||||||
|
Encodes a Unicode code point into UTF-8.
|
||||||
|
|
||||||
|
- `cp`: Unicode code point
|
||||||
|
- `out`: buffer receiving encoded bytes
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
- number of bytes written (1–4)
|
||||||
|
- writes replacement character if `cp` is invalid
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### text_next_utf8
|
||||||
|
|
||||||
|
```c
|
||||||
|
const char* text_next_utf8(const char *s);
|
||||||
|
```
|
||||||
|
|
||||||
|
Advances to the next UTF-8 character.
|
||||||
|
|
||||||
|
Returns a pointer to the next character boundary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### text_prev_utf8
|
||||||
|
|
||||||
|
```c
|
||||||
|
const char* text_prev_utf8(const char *start, const char *s);
|
||||||
|
```
|
||||||
|
|
||||||
|
Moves backward to the previous UTF-8 character.
|
||||||
|
|
||||||
|
- `start`: beginning of the buffer
|
||||||
|
- `s`: current position
|
||||||
|
|
||||||
|
Used for reverse traversal and cursor movement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### text_strlen_utf8
|
||||||
|
|
||||||
|
```c
|
||||||
|
int text_strlen_utf8(const char *s);
|
||||||
|
```
|
||||||
|
|
||||||
|
Counts UTF-8 characters (code points), not bytes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
### Iterating over UTF-8 characters
|
||||||
|
|
||||||
|
```c
|
||||||
|
const char *p = text;
|
||||||
|
|
||||||
|
while (*p) {
|
||||||
|
int adv;
|
||||||
|
uint32_t cp = text_decode_utf8(p, &adv);
|
||||||
|
|
||||||
|
/* process cp */
|
||||||
|
|
||||||
|
p += adv;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Cursor movement
|
||||||
|
|
||||||
|
```c
|
||||||
|
cursor = text_next_utf8(cursor);
|
||||||
|
cursor = text_prev_utf8(buffer_start, cursor);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Encoding a character
|
||||||
|
|
||||||
|
```c
|
||||||
|
char out[4];
|
||||||
|
int len = text_encode_utf8(0x20AC, out);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Backspace handling
|
||||||
|
|
||||||
|
```c
|
||||||
|
char *prev = (char*)text_prev_utf8(buffer, cursor);
|
||||||
|
cursor = prev;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
### UTF-8 Encoding
|
||||||
|
|
||||||
|
The implementation supports:
|
||||||
|
|
||||||
|
- 1 byte: `0x00 – 0x7F`
|
||||||
|
- 2 bytes: `0x80 – 0x7FF`
|
||||||
|
- 3 bytes: `0x800 – 0xFFFF`
|
||||||
|
- 4 bytes: `0x10000 – 0x10FFFF`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Replacement Character
|
||||||
|
|
||||||
|
Invalid sequences are replaced with:
|
||||||
|
|
||||||
|
- code point: `0xFFFD`
|
||||||
|
- UTF-8 encoding: `0xEF 0xBF 0xBD`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Control Signals
|
||||||
|
|
||||||
|
Some decoded code points correspond to control signals instead of printable characters.
|
||||||
|
|
||||||
|
ASCII control range:
|
||||||
|
|
||||||
|
- `0x00 – 0x1F`
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- `0x08` → Backspace
|
||||||
|
- `0x09` → Tab
|
||||||
|
- `0x0A` → Line Feed
|
||||||
|
- `0x0D` → Carriage Return
|
||||||
|
- `0x1B` → Escape
|
||||||
|
|
||||||
|
These are typically interpreted by:
|
||||||
|
|
||||||
|
- terminal logic
|
||||||
|
- shell input handling
|
||||||
|
- system interfaces
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Non-ASCII Characters
|
||||||
|
|
||||||
|
Characters outside the ASCII range (`0x00 – 0x7F`) are encoded using multi-byte UTF-8 sequences.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
- 'é' → `0xC3 0xA9`
|
||||||
|
- '€' → `0xE2 0x82 0xAC`
|
||||||
|
|
||||||
|
Decoded values:
|
||||||
|
|
||||||
|
- 'é' → `U+00E9`
|
||||||
|
- '€' → `U+20AC`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Modifiers and Layout
|
||||||
|
|
||||||
|
Character output depends on:
|
||||||
|
|
||||||
|
- keyboard layout
|
||||||
|
- modifier keys (Shift, Ctrl, AltGr)
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
- `KEY_E` → 'e'
|
||||||
|
- `KEY_E + SHIFT` → 'E'
|
||||||
|
- `KEY_E + AltGr` → '€'
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
- No full UTF-8 validation (overlong, surrogates not fully rejected)
|
||||||
|
- No grapheme cluster handling
|
||||||
|
- No Unicode normalization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
- Never iterate UTF-8 strings byte-by-byte
|
||||||
|
- Always use provided helpers for navigation
|
||||||
|
- Separate byte length from character count
|
||||||
|
- Handle invalid sequences safely
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
This module provides essential UTF-8 primitives for userland applications.
|
||||||
|
|
||||||
|
It should be used whenever an application needs to safely:
|
||||||
|
|
||||||
|
- decode UTF-8
|
||||||
|
- encode Unicode
|
||||||
|
- traverse text
|
||||||
|
- handle user input correctly
|
||||||
Loading…
Reference in a new issue