mirror of https://github.com/BoredDevNL/BoredOS.git synced 2026-05-15 10:48:38 +00:00

doc: Add UTF-8 byte structure section and resources (#10 )

Added a section on UTF-8 byte structure with a diagram and a recommended video for further understanding.

2026-04-25 00:51:54 +02:00

4.3 KiB

Raw Blame History

UTF-8 Library — Application Development Guide

Overview

The userland libc provides a lightweight UTF-8 utility module located in:

src/userland/libc/utf-8.c
src/userland/libc/utf-8.h

This module is designed for direct use in applications requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.

It is intended for:

text rendering
terminal input/output
cursor movement
string processing at the character level

Synopsis

#include "utf-8.h"

uint32_t text_decode_utf8(const char *s, int *advance);
int text_encode_utf8(uint32_t cp, char *out);

const char* text_next_utf8(const char *s);
const char* text_prev_utf8(const char *start, const char *s);

int text_strlen_utf8(const char *s);

API Reference

text_decode_utf8

uint32_t text_decode_utf8(const char *s, int *advance);

Decodes a UTF-8 sequence into a Unicode code point.

s: pointer to current position in a UTF-8 string
advance: receives number of bytes consumed

Returns:

decoded Unicode code point (uint32_t)
0 if input is null or empty
0xFFFD for invalid sequences

text_encode_utf8

int text_encode_utf8(uint32_t cp, char *out);

Encodes a Unicode code point into UTF-8.

cp: Unicode code point
out: buffer receiving encoded bytes

Returns:

number of bytes written (1–4)
writes replacement character if cp is invalid

text_next_utf8

const char* text_next_utf8(const char *s);

Advances to the next UTF-8 character.

Returns a pointer to the next character boundary.

text_prev_utf8

const char* text_prev_utf8(const char *start, const char *s);

Moves backward to the previous UTF-8 character.

start: beginning of the buffer
s: current position

Used for reverse traversal and cursor movement.

text_strlen_utf8

int text_strlen_utf8(const char *s);

Counts UTF-8 characters (code points), not bytes.

Usage Examples

Iterating over UTF-8 characters

const char *p = text;

while (*p) {
    int adv;
    uint32_t cp = text_decode_utf8(p, &adv);

    /* process cp */

    p += adv;
}

Cursor movement

cursor = text_next_utf8(cursor);
cursor = text_prev_utf8(buffer_start, cursor);

Encoding a character

char out[4];
int len = text_encode_utf8(0x20AC, out);

Backspace handling

char *prev = (char*)text_prev_utf8(buffer, cursor);
cursor = prev;

Implementation Notes

UTF-8 Encoding

The implementation supports:

1 byte: 0x00 – 0x7F
2 bytes: 0x80 – 0x7FF
3 bytes: 0x800 – 0xFFFF
4 bytes: 0x10000 – 0x10FFFF

Replacement Character

Invalid sequences are replaced with:

code point: 0xFFFD
UTF-8 encoding: 0xEF 0xBF 0xBD

UTF-8 Byte Structure

The following diagram illustrates how UTF-8 bytes are structured, including ASCII, continuation bytes, and multi-byte sequence headers:

_{Source: Nic Barker — "UTF-8, Explained Simply" (YouTube)}

Control Signals

Some decoded code points correspond to control signals instead of printable characters.

ASCII control range:

0x00 – 0x1F

Examples:

0x08 → Backspace
0x09 → Tab
0x0A → Line Feed
0x0D → Carriage Return
0x1B → Escape

These are typically interpreted by:

terminal logic
shell input handling
system interfaces

Non-ASCII Characters

Characters outside the ASCII range (0x00 – 0x7F) are encoded using multi-byte UTF-8 sequences.