boredos_mirror/docs/appdev/inputs_api_(utf8).md
Lluciocc 8d0e744991
doc: Add UTF-8 byte structure section and resources (#10)
Added a section on UTF-8 byte structure with a diagram and a recommended video for further understanding.
2026-04-25 00:51:54 +02:00

4.3 KiB
Raw Blame History

UTF-8 Library — Application Development Guide

Overview

The userland libc provides a lightweight UTF-8 utility module located in:

  • src/userland/libc/utf-8.c
  • src/userland/libc/utf-8.h

This module is designed for direct use in applications requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.

It is intended for:

  • text rendering
  • terminal input/output
  • cursor movement
  • string processing at the character level

Synopsis

#include "utf-8.h"

uint32_t text_decode_utf8(const char *s, int *advance);
int text_encode_utf8(uint32_t cp, char *out);

const char* text_next_utf8(const char *s);
const char* text_prev_utf8(const char *start, const char *s);

int text_strlen_utf8(const char *s);

API Reference

text_decode_utf8

uint32_t text_decode_utf8(const char *s, int *advance);

Decodes a UTF-8 sequence into a Unicode code point.

  • s: pointer to current position in a UTF-8 string
  • advance: receives number of bytes consumed

Returns:

  • decoded Unicode code point (uint32_t)
  • 0 if input is null or empty
  • 0xFFFD for invalid sequences

text_encode_utf8

int text_encode_utf8(uint32_t cp, char *out);

Encodes a Unicode code point into UTF-8.

  • cp: Unicode code point
  • out: buffer receiving encoded bytes

Returns:

  • number of bytes written (14)
  • writes replacement character if cp is invalid

text_next_utf8

const char* text_next_utf8(const char *s);

Advances to the next UTF-8 character.

Returns a pointer to the next character boundary.


text_prev_utf8

const char* text_prev_utf8(const char *start, const char *s);

Moves backward to the previous UTF-8 character.

  • start: beginning of the buffer
  • s: current position

Used for reverse traversal and cursor movement.


text_strlen_utf8

int text_strlen_utf8(const char *s);

Counts UTF-8 characters (code points), not bytes.


Usage Examples

Iterating over UTF-8 characters

const char *p = text;

while (*p) {
    int adv;
    uint32_t cp = text_decode_utf8(p, &adv);

    /* process cp */

    p += adv;
}

Cursor movement

cursor = text_next_utf8(cursor);
cursor = text_prev_utf8(buffer_start, cursor);

Encoding a character

char out[4];
int len = text_encode_utf8(0x20AC, out);

Backspace handling

char *prev = (char*)text_prev_utf8(buffer, cursor);
cursor = prev;

Implementation Notes

UTF-8 Encoding

The implementation supports:

  • 1 byte: 0x00 0x7F
  • 2 bytes: 0x80 0x7FF
  • 3 bytes: 0x800 0xFFFF
  • 4 bytes: 0x10000 0x10FFFF

Replacement Character

Invalid sequences are replaced with:

  • code point: 0xFFFD
  • UTF-8 encoding: 0xEF 0xBF 0xBD

UTF-8 Byte Structure

The following diagram illustrates how UTF-8 bytes are structured, including ASCII, continuation bytes, and multi-byte sequence headers:

image Source: Nic Barker — "UTF-8, Explained Simply" (YouTube)

Control Signals

Some decoded code points correspond to control signals instead of printable characters.

ASCII control range:

  • 0x00 0x1F

Examples:

  • 0x08 → Backspace
  • 0x09 → Tab
  • 0x0A → Line Feed
  • 0x0D → Carriage Return
  • 0x1B → Escape

These are typically interpreted by:

  • terminal logic
  • shell input handling
  • system interfaces

Non-ASCII Characters

Characters outside the ASCII range (0x00 0x7F) are encoded using multi-byte UTF-8 sequences.

Examples:

  • 'é' → 0xC3 0xA9
  • '€' → 0xE2 0x82 0xAC

Decoded values:

  • 'é' → U+00E9
  • '€' → U+20AC

Modifiers and Layout

Character output depends on:

  • keyboard layout
  • modifier keys (Shift, Ctrl, AltGr)

Example:

  • KEY_E → 'e'
  • KEY_E + SHIFT → 'E'
  • KEY_E + AltGr → '€'

Also worth watching

If you want to dive deeper or simply get a better intuitive understanding of UTF-8, the video below is highly recommended:

Nic Barker — "UTF-8, Explained Simply"