# SIQ — SIS Identifier and Qualifier in: [Protocols and Tech Stuff](protocols) ## Why SIQ? I needed unique, compact, decentralized, reproducible and sortable identifiers for my applications. Something I could reliably use as database key, as long as being fit for my purposes, in the context of a larger project, [a federated protocol](/protocols/sis.html). ## Why not… * **Serial numbers**? They are relative. If they needed to be absolute, they would have to be issued by a single central authority for everyone else. Unacceptable for a decentralized protocol. * **Username-domain identifiers**? Despite them being in use in other decentralized protocols (such as ActivityPub and Matrix), they are immutable and bound to a single domain. It means, the system sees different domains or usernames as different users. Users can't change their username after registration, therefore forcing them to carry an unpleasant or cringe handle for the rest of their life. * **UUID**'s? UUIDs are unreliable. Most services use UUIDv4's, which are just opaque sequences of random bytes, and definitely not optimal as database keys. Other versions exist (such as the timestamp-based [UUIDv7](https://uuidv7.org)), however they still miss something needed for cross-domain uniqueness. In any case, UUIDs need to waste some bits to specify their "protocol". * **Snowflake**s? Snowflakes would be a good choice, and are the inspiration for SIQ themselves. However, 64 bits are not enough for our use case, and Snowflake is *already making the necessary sacrifices* to ensure everything fits into 64 bits (i.e. the epoch got significantly moved forward). * **Content hashes**? They are based on content, therefore they require content to be immutable and undeletable. Also: collisions. * **PLC**'s (i.e. the ones in use at BlueSky)? [The implementation is cryptic](https://github.com/did-method-plc/did-method-plc). Moreover, it requires a central authority, and BlueSky is, as of now, holding the role of the sole authority. The resulting identifier as well is apparently random, therefore unorderable. * **ULID**'s? They are just UUIDv4's with a timestamp. Sortable? Yes. Predictable? No, random bits rely on the assumption of being generated on a single host — i.e. centralization. Think of them as yet another attempt to UUIDv7's. ## Anatomy of a SIQ SIQ's are **112 bit** binary strings. Why 112? Why not 128? Idk, felt like it. Maybe to save space. Maybe because I could fit it into UUID some day — UUID already reserves some bits for the protocol. Those 112 bits split up into: * 56 bits of **timestamp**; * 8 bits of process (“**shard**”) information; * 32 bits of **domain** hash; * 16 bits of **serial** and **qualifier**. Here is a graph of a typical SIQ layout: ``` 0: tttttttt tttttttt tttttttt tttttttt tttttttt 40: uuuuuuuu uuuuuuuu ssssssss dddddddd dddddddd 80: dddddddd dddddddd nnnnnnnn nnqqqqqq where: t : timestamp -- seconds u : timestamp -- fraction seconds s : shard d : domain hash n : progressive q : qualifier (variable width, in fact) ``` ## Timestamp SIQ uses 56 bits for storing timestamp: - **40 bits** for **seconds**; - **16 bits** for **fraction seconds**. There is no need to explain [why I need no less than 40 bits for seconds](https://en.wikipedia.org/wiki/Year_2038_problem). Most standards — including Snowflake and ULID — store timestamp in *milliseconds*. It means the system needs to make a division by 1000 to retrieve second value. But 1000 is almost 1024, right? So the last ten bits can safely be ignored and we easily obtain a UNIX timestamp by doing a right shi-  wait. It's more comfortable to assume that 1024 is nearly 1000. *Melius abundare quam deficere*. And injective mapping is there. But rounding? Truncation? Here comes the purpose of the 6 additional trailing bits: precision control. Bits from dividing milliseconds o'clock are different from those from rounding microseconds. Yes, most systems can't go beyond milliseconds for accuracy — standard Java is like that. But detecting platform accuracy is beyond my scope. There are other factors to ensure uniqueness: *domain* and *shard* bits. ## Domain, shard The temporal uniqueness is ensured by timestamp. However, in a distributed, federated system there is the chance for the same ID to get generated twice by two different subjects. Therefore, *spacial* uniqueness must be enforced in some way. Since SIQ's are going to be used the most in web applications, a way to differentiate *spacially* different applications is via the **domain name**. I decided to reserve **32 bits** for the domain hash. The algorithm of choice is **SHA-256** for its well-known diffusion and collision resistance. However, 256 bits are too much to fit into a SIQ! So, the last 4 bytes are taken. ... Development and testing environments may safely set all the domain bits to 0. ## Qualifiers The last 16 bits are special, in a way that makes those identifiers unique, and you can tell what is what just by looking at them. Inspired by programming language implementations, such as OCaml and early JavaScript, a distinguishing bit affix differentiates among types of heterogeneous entities: * terminal entities (leaves) end in `1`. This includes content blobs, array elements, and relationships; * non-leaves end in `0`. The full assigment scheme (managed by me) looks like this: | Suffix | Usage | |----------|-------| | `x00000` | user account | | `x10000` | application (e.g. API, client, bot, form) | | `x01000` | event, task | | `x11000` | product, subscription | | `x00100` | user group, membership, role | | `x10100` | collection, feed | | `x01100` | invite | | `x11100` | *unassigned* | | `x00010` | tag, category | | `x10010` | *unassigned* | | `x01010` | channel (guild, live chat, forum, wiki~) | | `x11010` | *unassigned* | | `xx0110` | thread, page | | `xx1110` | message, post, revision | | `xxx001` | 3+ fk relationship | | `xxx101` | many-to-many, hash array element | | `xxx011` | array element (one to many) | | `xxx111` | content | ... The leftover bits are used as progressive serials, incremented as generation continues, and usually reset when timestamp is incremented. Like with snowflakes and ULID's, if you happen to run out with serials, you need to wait till timestamp changes. Usually around 15 microseconds. ## Storage It is advised to store in databases as *16 byte binary strings*. - In MySQL/MariaDB, it's `VARBINARY(16)`. The two extra bytes are to ease alignment, and possible expansion of timestamp range — even though it would not be an issue until some years after 10,000 CE. It is possible to fit them into UUID's (specifically, UUIDv8's — custom ones), taking advantage from databases and libraries implementing a UUID type — e.g. PostgreSQL. Unfortunately, nobody wants to deal with storing arbitrarily long integers — lots of issues pop up by going beyond 64. 128 bit integers are not natively supported in most places. Let alone 112 bit ones. ## Format The recommended format for transmission is a brand new DID scheme — namely, `did:siq:`. SIQ's formatted this way are prefixed with a binary checksum — i.e. the Python CRC-32 implementation. The control digits are prefixed, instead of suffixed, to give an illusion of randomness. It also helps getting people more responsibility about typing by hand. The blob — checksum first, SIQ following — is eventually encoded in lowercase Base32, the one used in URL schemes. (Technically, SIQ's are URN's, since they don't *locate* a resource.) ... ## Implementation This is a sample implementation (the code snippet below is [released "as is" into public domain](https://unlicense.org)) in Python of a SIQ generator — Python 3.10+ required. SIQ's are generated as integers — Python allows them to be arbitrarily long. ```python from __future__ import annotations import enum import base64, binascii from functools import cached_property import hashlib import time import os from typing import Iterable, Callable class SiqType(enum.Enum): # - leaves - CONTENT = 15 # xxx111 MULTI = 11 # xxx011 MANYTOMANY = 13 # xxx101 TERNARY = 9 # xxx001 # - non-leaves, may have children TAG = 34 # x00010 CHANNEL = 42 # x01010 #??? = 50 # x10010 #??? = 58 # x11010 THREAD = 22 # xx0110 MESSAGE = 30 # xx1110 # - non-leaves, may NOT have children USER = 32 # x00000 GROUP = 36 # x00100 EVENT = 40 # x01000 INVITE = 44 # x01100 APPLICATION = 48 # x10000 COLLECTION = 52 # x10100 PREMIUM = 56 # x11000 #??? = 60 # x11100 # - helpers - value: int @cached_property def n_bits(self) -> int: val, lsh = self.value, 0 while val > 1: val, lsh = val >> 1, lsh + 1 return lsh def prepend(self, other: int) -> int: return (other << self.n_bits) | (self.value % (1 << self.n_bits)) def make_domain_hash(domain: str, local_id: int | None = None) -> int: """ Compute a domain hash for SIQ. If local_id ("datacenter/machine ID") is specified and nonzero, it replaces the leading byte with the passed value. Note: local_id must be nonzero and it may replace only exactly 8 bits. To get all of them zeroed out, pass a local_id of 256. """ if not domain or domain == '0': return 0 h = int.from_bytes(hashlib.sha256(domain.encode('ascii')).digest()[-4:], 'big') # a local_id of 256 is required to get 0. if local_id: h = ((local_id % 256) << 24) | (h % (1 << 24)) return h class SiqGen: __slots__ = ('domain_hash', 'last_gen_ts', 'counters', 'shard_id', '__weakref__') domain_hash: int last_gen_ts: int shard_id: int counters: dict[SiqType, int] def __init__(self, domain: str, last_siq: int = 0, local_id: int | None = None, shard_id: int | None = None): self.domain_hash = make_domain_hash(domain, local_id) self.last_gen_ts = min(last_siq >> 56, self.cur_timestamp()) self.counters = dict() self.shard_id = (shard_id or os.getpid()) % 256 def cur_timestamp(self) -> int: return int(time.time() * (1 << 16)) def generate(self, /, typ: SiqType, n: int = 1) -> Iterable[int]: """ Generate one or more SIQ's. The generated ids are returned as integers. Bulk generation is supported. Returns as an iterator, to allow generation “on the fly”. To get a scalar or a list, use .generate_one() or next(), or .generate_list() or list(.generate()), respectively. Warning: the function may block. """ now = self.cur_timestamp() if now < self.last_gen_ts: time.sleep((self.last_gen_ts - now) / (1 << 16)) elif now > self.last_gen_ts: self.counters[typ] = 0 while n: idseq = typ.prepend(self.counters[typ]) if idseq >= (1 << 16): while (now := self.cur_timestamp()) <= self.last_gen_ts: time.sleep(1 / (1 << 16)) with Lock(): self.counters[typ] %= 1 << (16 - typ.n_bits) # XXX the lock is here "just in case", MULTITHREADED GENERATION IS NOT ADVISED! with Lock(): siq = ( (now << 56) | ((self.shard_id % 256) << 48) | ((self.domain_hash % (1 << 32)) << 16) | (idseq % (1 << 16)) ) n -= 1 self.counters[typ] += 1 yield siq def generate_one(self, /, typ: SiqType) -> int: """ Generate one SIQ, and return it as an integer. """ return next(self.generate(typ, 1)) def generate_list(self, /, typ: SiqType, n: int = 1) -> list[int]: """ Syntactic sugar for list(.generate()). Return the generated SIQ's as a list. """ return list(self.generate(typ, n)) ``` © 2025 Sakuragasaki46 - __ask permission before use__