~iany/ Math

Power of Monoid, Beauty of Simplicity

me@iany.me (Ian Yang) — Fri, 20 Feb 2026 00:00:00 +0800

A monoid is one of the smallest useful abstractions in algebra: a set closed under an associative binary operation, with an identity element. That simplicity is exactly why it shows up everywhere—from summing numbers and concatenating strings to powering divide-and-conquer algorithms and elegant data structures like finger trees. This post walks through what monoids are, why they give you “compute power” for free when you can phrase a problem in terms of them, and how to think about choosing the right monoid and predicate when you do.

What is a monoid?

A monoid is a set $S$ equipped with a binary operator $\bullet$ and an identity element $e$ (Wikipedia).

The operator is closed on $S$ . For all $a, b \in S$ , the result $a \bullet b$ is also in $S$ .
The operator is associative: For all $a,b,c \in S$ , $(a \bullet b) \bullet c = a \bullet (b \bullet c)$
The identity element $e$ satisfies $e \bullet a = a \bullet e = a$ for all $a \in S$ .

For example, The integer numbers with the operator addition (+) is a monoid, where the identity element is 0. The integer numbers with the operator multiplication (x) is also a monoid with the identity element 1.

The set of finite lists with the operator concatenation is a monoid since:

The operator is closed because concatenation of two finite lists is also a finite list.
The operator is associative, because both $(a \bullet b) \bullet c$ and $a \bullet (b \bullet c)$ result in a new list by placing elements of $a, b, c$ consecutively.
The identity element is the empty list.

The integer numbers with the operator max is a counterexample. The operator is closed and associative, but there is no identity element. Given any integer $e$ , there’s always a smaller integer $a$ such that $e \bullet a = e \ne a$ . However, max on the integer set with a lower bound is a monoid, such as the non-negative integers where the identity element is the lower bound 0.

Divide and conquer: why associativity matters

At first glance, associativity may seem too trivial to be useful in programming. However, associativity is what enables powerful divide-and-conquer strategies, where problems can be split into parts, solved independently, and then safely recombined.

Exponentiation by Squaring

Let’s begin with a simple application: repeatedly applying the binary operator to the same element.

 $\underbrace{a \bullet a \bullet \cdots \bullet a}_{a \text{ appears } n \text{ times}}$

Instead of applying the binary operator $n-1$ times sequentially, we exploit associativity to group every two instances of $a$ together recursively. This gives us a smaller problem when n is even:

 $\underbrace{(a \bullet a) \bullet \cdots \bullet (a \bullet a)}_{(a \bullet a) \text{ appears } \frac{n}{2} \text{ times}}$

When n is odd:

 $a \bullet \underbrace{(a \bullet a) \bullet \cdots \bullet (a \bullet a)}_{(a \bullet a) \text{ appears } \frac{n-1}{2} \text{ times}}$

We only need to compute $a \bullet a$ once to turn the problem of size $n$ to size $n/2$ . Repeating the process on $(a \bullet a)$ gives the Exponentiation by Squaring algorithm, which requires at most $\displaystyle 2 \lfloor \log _{2}n\rfloor$ computations that is more efficient than $n-1$ when $n$ is greater than 4.

For integers or real numbers under multiplication (×), exponentiation by squaring is an efficient algorithm to compute positive integer powers. Since the set of elliptic curve points under point addition forms a monoid, this same method can also be used to compute Elliptic Curve Scalar Multiplication.

General Divide-and-Conquer Search Algorithm

We can generalize the divide-and-conquer method to search an element in a sequence based solely on associativity.

The result of applying the monoid operator to a sequence from left to right serves as a summary of that sequence. If a predicate can determine whether a given element is in the sequence based solely on this summary, we can devise a general divide-and-conquer search algorithm.

Let’s say we want to search for a target element in the sequence $t_1, \ldots, t_n$ , where each $t_i$ belongs to a monoid $(S, \bullet, e)$ . Here, $S$ is the underlying set, $\bullet$ is the binary operation, and $e$ is the identity element.

We don’t know which kind of predicates works for the search algorithm. Let’s give a best guess that the predicate $p$ is a function of the monoid “summary” that $p(t_1 \bullet \cdots \bullet t_n)$ is true if and only if $t$ is in the sequence $t_1, \dots, t_n$ .

Assume that $p(t_1 \bullet \cdots \bullet t_n)$ is true, thus the target element $t$ is in the sequence. We divide the sequence into two halves: $t_1,\ldots,t_k$ and $t_{k+1},\ldots,t_n$ , where $1 \le k \le n$ . We then evaluate $p(t_1 \bullet \cdots \bullet t_k)$ to determine whether the target element lies in the first or second half, and continue the search.

Based on this observation, we can deduce the following property of the predicate: there exists an index $x$ such that

 $p(t_1 \bullet \cdots \bullet t_k) := \begin{cases} \text{false} & \text{if } k < x, \\ \text{true} & \text{if } k \ge x. \end{cases}$

$t_x$ is the target element if such $x$ exists; otherwise, the target element does not exist in the sequence.

Intuitively, the target element is the turning point at which the predicate on the running summary changes from false to true.

Note that $p$ makes sense only on the summary of any prefix of the sequence. If we need to continue the search in the second half, we must remember the summary of the scanned prefix.

Now we can define the search algorithm $\mathrm{Search}(p, s, \{t_i,\ldots,t_j\})$ where

$s$ is the summary of scanned prefix $t_1 \bullet \cdots \bullet t_{i-1}$ when $i > 1$ or the identity element $e$ otherwise.
$t_i, \ldots, t_j$ is the sub-range to search next.
$p$ is the predicate as defined above

The algorithm proceeds as follows:

If $p(s \bullet t_i \bullet \cdots \bullet t_j$ ) is false, the target element does not exist. The algorithm aborts with an error.
Otherwise, if there’s only one element ( $i = j$ ), $t_i$ is the target element. The algorithm aborts with the found result.
Otherwise, choose a pivot index $i \le m \lt j$ to split the sequence into two nonempty halves: $t_i, \ldots, t_m$ and $t_{m+1},\ldots,t_j$ . Test $p(s \bullet t_i \bullet \cdots \bullet t_m)$ that
- If it is true, continue the search in the first half: $\mathrm{Search}(p, s, \{t_i,\ldots,t_m\})$
- Otherwise, continue the search in the second half: $\mathrm{Search}(p, s \bullet (t_i \bullet \cdots \bullet t_m),\{t_{m+1},\ldots,t_j\})$

The algorithm starts with $\mathrm{Search}(p, e, \{t_1, \ldots, t_k\})$ .

Application: Random-Access Sequence

An application of the search algorithm is accessing the nth element in the sequence.

We initialize the sequence to all 1s and use the monoid of non-negative integers with addition $(\mathbb{N},+,0)$ :

 $\underbrace{1, \ldots, 1}_{n \text{ times}}$

The predicate to find the i-th (starting from 0) element is:

 $\[ p_i(s) := s > i \]$

It may seem silly to search for the i-th 1 in a sequence of 1s, but we can store any data in the sequence and attach the monoid values as annotations to guide the search algorithm.

Application: Max-Priority Queue

Another application is finding the element with the max priority.

We use the monoid of non-negative integers with operator max $(\mathbb{N},\mathrm{max},0)$ and assume that the maximum value has the maximum priority.

The predicate to find the element with the max priority is

 $\[ p(s) := s = m \]$

Where $m$ is the monoid summary of the entire sequence—that is, the maximum value in the sequence. The predicate checks whether the summary equals to $m$ .

Annotated Search Tree

A natural way to support the divide-and-conquer search is an annotated binary tree. Store the sequence elements at the leaves, and at each node store the monoid summary of the subtree—e.g. the sum of lengths or the maximum priority in that subtree. The predicate can then be evaluated on the left subtree’s annotation to decide whether to descend left or right, and the prefix summary is updated when going right by combining it with the left subtree’s summary.

A plain binary tree can degenerate to a list in the worst case, so operations may become linear. A more advanced structure, the finger tree¹, keeps the tree balanced and supports efficient access at both ends and in the middle; each node carries a monoidal “measure” of its subtree, and the same search strategy applies. In Haskell, Data.Sequence from the containers library implements sequences as finger trees with size (length) as the measure, giving $O(\log n)$ indexing, splitting, and concatenation.

Utility of the Identity Element

The general divide-and-conquer algorithm does not require a monoid—only a semigroup. A semigroup is a fancy word for a set equipped with a closed, associative binary operator but lacking an identity element. The presence of an identity element makes monoids convenient to work with.

The identity element serves as a natural default value or starting point for algorithms. For instance, in the search algorithm, the summary of the scanned prefix is initialized to $e$ . Without an identity element, we need an additional flag to indicate whether any prefix has been scanned, and the algorithm would have to branch conditionally based on that flag.

The art of choosing monoid and predicate

In the random-access example we used $(\mathbb{N}, +, 0)$ and annotated each position with $1$ —the summary of a segment is its length, and the predicate $s > i$ tells us whether the $i$ -th element lies in the prefix we have so far. In the max-priority queue we used $(\mathbb{N}, \max, 0)$ (or a bounded variant): the summary is the maximum value in the segment, and the predicate $s = m$ identifies the segment that contains the global maximum. In both cases, the monoid was chosen so that the combined summary over a range is exactly what the predicate needs to decide where to go next.

The flip side is that finding both the right monoid and the right predicate can be tricky. At each step the search has access only to the monoid summary of the prefix (or segment) seen so far, so the predicate must be decided from that summary alone. The monoid must be rich enough to supply the information the predicate needs. Sometimes the natural summary (e.g. sum or max) suggests the predicate (e.g. $s > i$ or $s = m$ ). Sometimes you must try a different carrier or operation, or encode extra information into the monoid (e.g. pairs or custom types), so that the predicate can be expressed. There is no universal recipe—it is a matter of design and experimentation. Reframe the problem as: “What do I need to know about a segment to decide the next step?” Then choose a monoid that can represent that knowledge and a predicate that uses it.

Hinze, R., & Paterson, R. (2006). Finger trees: A simple general-purpose data structure. Journal of Functional Programming, 16(2), 197–217. Cambridge University Press. https://www.cs.ox.ac.uk/ralf.hinze/publications/FingerTrees.pdf

Study on Quotient Spaces

me@iany.me (Ian Yang) — Tue, 18 Nov 2025 21:23:25 +0800

I’m reading Linear Algebra Done Right by Axler and found the section on quotient spaces difficult to understand, so I researched and took these notes.

Definitions

3.95 notion: $v + U$

Suppose $v \in V$ and $U \subseteq V$ . Then $v + U$ is the subset of $V$ defined by

 $v + U = \{v + u : u \in U\}.$

Also called a translate. Attention that a translate is a set.

3.97 definition: translate

Suppose $v \in V$ and $U \subseteq V$ , the set $v + U$ is said to be a translate of $U$ .

Quotient space is a set of all translates (set of sets):

3.99 definition: quotient space, $V/U$

Suppose $U$ is a subspace of $V$ . Then the quotient space $V/U$ is the set of all translates of $U$ . Thus

 $V/U = \{v + U : v \in V\}.$

Quotient space is a set of sets. There are duplicates for each $v \in V$ because for some $v_1, v_2 \in V$ , $v_1 + U$ and $v_2 + U$ can be identical set.

A quotient space $V/U$ is formed by “collapsing” a subspace $U$ to zero within a larger vector space $V$ . This construction is based on an equivalence relation where two vectors $x, y \in V$ are considered equivalent if their difference lies in $U$ —that is, $x \sim y$ if and only if $x - y \in U$ . wikipedia

Lemmas

3.101 two translates of a subspace are equal or disjoint

Suppose $U$ is a subspace of $V$ and $v, w \in V$ . Then

 $v - w \in U \iff v + U = w + U \iff (v + U) \cap (w + U) \neq \emptyset$

If two translates are not disjoint (the union set is not empty), they must be equal. So they are equal or disjoint.

All distinct translates of a subspace are disjoint. Given any $v \in V$ , it belongs to only one translate.

Since the quotient space $V/U$ is a set of translates of a subspace, it is like a disjoint partition of values in $V$ . By using the definition of quotient map

3.104 definition: quotient map, $\pi$

Suppose $U$ is a subspace of $V$ . The quotient map $\pi : V \to V/U$ is the linear map defined by

 $\pi(v) = v + U$

for each $v \in V$ .

We can write that

 $\pi(v_1) = \pi(v_2) \iff v_1 - v_2 \in U$

The quotient map has two essential properties:

The null space of $\pi$ is exactly the subspace $U$ , because $v+U=0+U \iff v-0 \in U \iff v \in U$
The range of $\pi$ is the entire quotient space $V/U$

Quotient Space Is a Vector Space

First define the addition and scalar multiplication operations:

3.102 definition: addition and scalar multiplication on $V/U$

Suppose $U$ is a subspace of $V$ . Then addition and scalar multiplication are defined on $V/U$ by

 $\begin{align*} (v + U) + (w + U) &= (v + w) + U \\ \lambda(v + U) &= (\lambda v) + U \end{align*}$

for all $v, w \in V$ and $\lambda \in \mathbf{F}$ .

$v+U$ is not the unique way to represent a member in $V/U$ , because there may exist $v'\ne v$ that $u + U = v' + U$ . The operations make sense only when the choice of $v$ to represent a translate makes no differences.

Specifically, suppose $v_1, v_2, w_1, w_2 \in V$ such that

 $v_1 + U = v_2 + U \quad\textrm{and}\quad w_1 + U = w_2 + U$

From the addition definition:

 $\begin{align*} (v_1+U) + (w_1+U) &= (v_1 + w_1) + U \\ (v_2+U) + (w_2+U) &= (v_2 + w_2) + U \end{align*}$

The left side of the two equations indeed are the different representation of the same equation, so we must show that the right side equal: $(v_1 + w_1)+U=(v2+w2)+U$ .

This applies to scalar multiplication as well:

 $\begin{align*} \lambda(v_1 + U) &= (\lambda v_1) + U \\ \lambda(v_2 + U) &= (\lambda v_2) + U \end{align*}$

We must show that $(\lambda v_1) + U = (\lambda v_2) + U$ .

Dimension

The dimension of the quotient space is given by a simple subtraction, relating the dimension of $V/U$ to the “lost” dimension of $U$ :

3.105 dimension of quotient space

Suppose $V$ is finite-dimensional and $U$ is a subspace of V. Then

 $\text{dim } V/U = \text{dim }V - \text{dim }U.$

Linear Map from V/(null T) to W

3.106 notation: $\widetilde{T}$

Suppose $T \in \mathcal{L}(V, W)$ . Define $\widetilde{T}: V/(\text{null } T) \to W$ by

 $\widetilde{T}(v + \text{null } T) = Tv.$

Think of merging inputs having the same output. These inputs will be the same input in the quotient space $V/(\text{null } T)$ .

For any $v_1, v_2 \in V$ that $Tv_1 = Tv_2$ , $v_1 + \mathrm{null}\, T$ and $v_2 + \mathrm{null}\, T$ are the same value in $V/(\mathrm{null}\, T)$ . This makes $\widetilde{T}$ injective. Because $\mathrm{range}\,\widetilde{T}=\mathrm{range}\, T$ , $\widetilde{T}$ is also surjective on to $\mathrm{range}\, T$ .

3.63 invertibility $\iff$ injectivity and surjectivity

A linear map is invertible if and only if it is injective and surjective.

3.63 shows us that $\widetilde{T}$ is invertible, and according to the definition of isomorphic, $V/(\mathrm{null}\, T)$ and $\mathrm{range}\,T$ are isomorphic vector spaces and $\widetilde{T}$ is their isomorphism.

3.69 definition: isomorphism, isomorphic

An isomorphism is an invertible linear map.
Two vector spaces are called isomorphic if there is an isomorphism from one vector space onto the other one.

One of the key uses of $\widetilde{T}$ is demonstrating a canonical isomorphism. For any linear map $T \in \mathcal{L}(V, W)$ , the quotient space $V/(\text{null } T)$ is isomorphic to the image space $\text{range } T$ . This shows that the quotient space $V/(\text{null } T)$ serves as a way to “mod out” the non-injective part of $T$ .

Study on Alias Method

me@iany.me (Ian Yang) — Sat, 29 May 2010 00:00:00 +0000

@miloyip has published a post recently which motioned the Alias Method to generate a discrete random variable in O(1). After some research, I find out that it is a neat and clever algorithm. Following are some notes of my study on it.

What is Alias Method

Alias method is an efficient algorithm to generate a discrete random variable with specified probability mass function using a uniformly distributed random variable.

Let $Z$ be the discrete random variable which has n possible outcomes $z_0,z_1,\ldots,z_{n-1}$ . To make the discussion below simple, we study another variable $Y$ , where $P\{Y=i\}=P\{Z=z_i\}$ . And when $Y$ takes on value $i$ , let $Z$ be $z_i$ . So $Z$ can be generated from $Y$ .

Random variable $X$ is uniformly distributed in $(0, n)$ , which probability density function is

 $f(x) = \left\{ \begin{array}{rl} 1/n & \text{if } 0 < x < n\\ 0 & \text{otherwise}\\ \end{array} \right.$

Now generate a variable $Y'$ that

 $Y' = \left\{ \begin{array}{rl} \lfloor x \rfloor & \text{if } (x - \lfloor x \rfloor) < F(\lfloor x \rfloor)\\ A(\lfloor x \rfloor) & \text{otherwise}\\ \end{array} \right.$

$A(i)$ is the alias function. When $x$ falls in range $[i, i + 1)$ ( $i$ is an integer), $y$ has the probability $F(i)$ to be $i$ , and probability $1 - F(i)$ to be $A(i)$ . Because $x$ is uniformly distributed,

 $\begin{aligned} P\{x \in [i, i + F(i))\} &= \displaystyle\int_i^{i+F(i)}\frac{1}{n}dx\\ &= (i + F(i) - i) \times 1/n\\ &= F(i)/n,\\ \\ P\{x \in [i + F(i), i + 1)\} &= \displaystyle\int_{i+F(i)}^{i+1}\frac{1}{n}dx\\ &= (i + 1 - (i + F(i))) \times 1/n\\ &= (1-F(i))/n \end{aligned}$

Let’s denote the set of values $j$ that satisfies $A(j) = i$ as $A^{-1}(i)$ . The generated variable $Y'$ has following probability mass function:

 $P\{Y' = i\} = F(i)/n + \sum_{j \in A^{-1}(i)}\frac{1-F(j)}{n}$

Alias method is the algorithm to construct $A$ and $F$ so that $P\{Y' = i\}$ equals to $P\{Y = i\}$ for all $i$ . Because the domain of both $A$ and $F$ are integers $0,1,\ldots,n-1$ , they can be stored in array and values can be looked up in O(1), where the space efficiency is in O(n).

In miloyip’s implementation, $A$ and $F$ are stored in std::vector<AliasItem> mAliasTable, where $A$ ’s values are stored in AliasItem::index and $F$ ’s values are AliasItem::prob.

Algorithm

Construct Steps

Initialize the set $S$ to be ${0,1,\ldots,n-1}$ and n variables $p_i$ that with values:

 $p_i = P\{Y=i\}, i \in S$

Denote the number of elements in $S$ as $\|S\|$ . We have a important invariant that

 $\sum_{i \in S}{p_i} = \|S\| / n$

At the beginning of the algorithm, the invariant holds because the sum of all probabilities must equal to 1.

The algorithm is performed using following steps.

If there is an element $i$ in set $S$ such that $p_i < 1/n$ , there must be a $j$ in set $S$ such that $p_j > 1/n$ .¹ Let $A(i) = j$ and $F(i) = p_i / (1/n) = p_i \times n$ . Remove $i$ from $S$ and subtract $n/1 - p_i$ from $p_j$ . It is easy to verify that the invariant still holds after these changes.²
Repeat step 1 until $S$ is empty or there is no more elements $i$ in $S$ that $p_i < 1/n$ . If $S$ is empty, the algorithm finishes. Otherwise for all remaining $i$ in $S$ , we must have $p_i = 1/n$ .³ Let $A(i)=i$ and $F(i)=p_i\times n=1$ for all remaining $i$ , and remove them from the set $S$ .

The algorithm finishes when $S$ becomes empty, and an element is removed only when its corresponding $A$ and $F$ has been determined, so all values of $A$ and $F$ has been generated.

In miloyip’s implementation, $p_i$ is stored in AliasItem::prob before $i$ is removed from the set. When $i$ is removed from the set, AliasItem::prob is set to $F(i)$ .

Correctness

The invariant holds at the beginning and at the end of each step, it guarantees that the algorithm can finish. It is easy to prove it using mathematical induction. So we only need to prove $P\{Y'=i\}=P\{Y=i\}$ for any $i$ , i.e.,

 $P\{Y = i\} = F(i)/n + \sum_{j \in A^{-1}(i)}\frac{1-F(j)}{n}$

Denote $p'_i$ as the value of $p_i$ when $i$ is removed from set $S$ . Check the construction steps again, we get following properties:

No $p_i$ can increase. Thus $p_i <= P\{Y=i\}$ in all steps and $p'_i <= P\{X=i\}$ .
$p_i$ decreases only when its initial value $P\{Y=i\}>1/n$ . So if $P\{Y=i\}<=1/n$ , $p_i = P\{Y=i\}$ throughout the algorithm and $p'_i=P\{Y=i\}$ .
$F(i) = p'_i \times n$
$i$ is removed only when $p_i \leq 1/n$ , i.e., $p'_i \leq 1/n$ , thus $F(i)=p'_i \times n \leq 1$ .
$A(j)$ is set to a value $i \neq j$ only if $p_i > 1/n$ (see step 1), i.e., $P\{Y=i\}>1/n$ .

Now consider value $i$ when $P\{Y=i\}<1/n$ , $P\{Y=i\}=1/n$ and $P\{Y=i\}>1/n$ .

P{Y=i} < 1/n

If $P\{Y=i\} < 1/n$ , from property 2 and property 3, $F(i) = p'_i \times n = P\{Y=i\} \times n$ .

Apparently $A^{-1}(i) = {}$ , because $A$ is either set to value $j$ where $p_j>1/n$ in step 1 or $k$ where $p_k = 1/n$ in step 2.

Thus

 $\begin{aligned} &F(i)/n + \sum_{j \in A^{-1}(i)}\frac{1-F(j)}{n}\\ =&F(i)/n\\ =&P\{Y=i\} \times n / n\\ =&P\{Y=i\} \end{aligned}$

which completes the proof.

P{Y=i} = 1/n

If $P\{Y=i\} = 1/n$ , apparently $A(i) = i$ . If there’s another value $j\neq~i$ also satisfies $A(j) = i$ , from property 4, $P\{Y=i\} > 1/n$ , conflict with the condition. So $A^{-1}(i) = {i}$

Thus

 $\begin{aligned} &F(i)/n + \sum_{j \in A^{-1}(i)}\frac{1-F(j)}{n}\\ =&F(i)/n + (1-F(i))/n\\ =&1/n \end{aligned}$

which completes the proof.

P{Y=i} > 1/n

When $P\{Y=i\} > 1/n$ , apparently i is not in $A^{-1}(i)$ .

Consider each value $j$ in set $A^{-1}(i)$ . Once $j$ is removed from $S$ , $A(j)$ is set to $i$ and $1/n - p'_j$ is subtracted from $p_i$ . Thus

 $p'_i = P\{Y=i\} - \sum_{j \in A^{-1}(i)}(1/n - p'_j)$

Then

 $\begin{aligned} &F(i)/n + \sum_{j \in A^{-1}(i)}\frac{1-F(j)}{n}\\ =&p'_i \times n / n + \sum_{j \in A^{-1}(i)}\frac{1-(p'_j \times~n)}{n}\\ =&P\{Y=i\} - \sum_{j \in A^{-1}(i)}(1/n - p'_j)\ + \sum_{j \in A^{-1}(i)}(1/n - p'_j)\\ =&P\{Y=i\} \end{aligned}$

For all $i$ , $P\{Y'=i\} = P\{Y=i\}$ , the proof completes.

Intuitive Presentation

The algorithm can be presented in intuitive meaning. The range $(0, n]$ is split into n consecutive sub ranges $(i, i + 1]$ for $i = 0, 1, \ldots, n - 1$ . The probability of $X$ falls into any range is $(i + 1 - i) \times 1/n = 1/n$ .

For $P\{Y=i\} = 1/n$ , we can allocate the whole slot $i$ to it. Let $Y=i$ when $x$ falls in $(i, i + 1]$ which has the probability $1/n$ .

If $P\{Y=i\} < 1/n$ , we can allocate the starting part $(i,i+n\times~P\{Y=i\}]$ in $(i,i+1]$ . Let $Y = i$ when $x$ falls in $(i, i + n\times P\{Y=i\}]$ , where the probability is $n\times~P\{Y=i\}\times(1/n)=P\{Y=i\}$ .

If $P\{Y=i\} > 1/n$ , we can allocate unused ranges in $(j + n\times P\{Y=j\}, j + 1]$ for any $j$ that $P\{Y=j\} < 1/n$ . However, unused range is not allowed to be split again.

See the figure below, which demonstrates how to generate $Y$ with probability mass function $n = 5$

Alias Method

$P\{Y=0\} = 0.16$
$P\{Y=1\} = 0.1$
$P\{Y=2\} = 0.32$
$P\{Y=3\} = 0.22$
$P\{Y=4\} = 0.2$

$P\{Y=4\}=1/n$ , so let $Y = 4$ only when $x$ falls in $(4, 5]$ , which probability is $(5-4)\times 0.2 = 0.2$ .

$P\{Y=0\}=0.16<0.2$ , so let $Y = 0$ only when $x$ falls in $(0,0.16\times~5]$ , i.e., $(0,0.8]$ , which probability is $(0.8-0)\times~0.2=0.16$ . $(0.8,1]$ is unused.

$P\{Y=1\}$ is the same. $(1,1.5]$ is allocated and $(1.5,2]$ is unused.

$P\{Y=2\} = 0.32 > 0.2$ , it needs ranges with total length $0.32\times~5=1.6$ . We allocate the range $(0.8, 1]$ and $(1.5, 2]$ . The remaining length $1.6-0.2-0.5=0.9<1$ , then we can allocate a part of its own slot. Finally, three ranges have been allocated, $(0.8,1]$ , $(1.5,2]$ and $(2,2.9]$ . $(2.9,3]$ is unused.

Follow the same step to handle $Y=3$ . The final allocation is depicted in $D$ . The allocation is not unique, $F$ depicts another solution.

References

If all $j$ except $i$ that $p_j \leq 1/n$ , Sum up both end of the inequalities for all $j$ and $p_i < 1/n$ , we can get $\sum_{i \in S}{p_i} < \|S\| / n$ which is conflict with the invariant.
The right side has decreased $1/n$ because $\|S\|$ has decreased 1. The left side has decreased $p_i + (n/1 - p_i) = 1/n$ , because $i$ is removed from the set and $(n/1 - p_i)$ is subtracted from $p_j$ . Thus both side decrease the same amount, the equality still holds.
Because no $p_i < 1/n$ , then $p_i \geq 1/n$ . To satisfy the invariant, no $p_i$ can be larger then $1/n$ . Thus for all $i$ in $S$ , $p_i = 1/n$ .