Power Laws, Extremistan, and Non-Ergodicity
The previous essay ended with fractals: objects whose structure survives repeated zooming. This essay turns that geometric statement into a statistical one.
If an object has no characteristic length scale, then counting it often produces a power law. If a distribution has a power-law tail, then extremes stop being negligible. If extremes stop being negligible, then averages become unstable. If the process is multiplicative or has absorbing ruin states, then the average over many possible worlds can become irrelevant to the trajectory one person actually lives.
That is the path:
$$\text{scale invariance}\to\text{power laws}\to\text{fat tails}\to\text{non-ergodicity}.$$
This is where Mandelbrot, Taleb, Peters, Kelly, Wilson, Bak, and the central limit theorem start belonging to the same conversation.
#The Minimum Vocabulary
A distribution tells you how likely different outcomes are. For a continuous variable, the density $p(x)$ describes probability per unit of $x$. The survival function:
$$P(X>x)$$
asks for the probability of seeing an outcome larger than $x$.
A tail is the far end of a distribution: rare large values. A thin-tailed distribution, like a Gaussian, makes very large events disappear extremely fast. A fat-tailed distribution makes them disappear slowly enough that extremes remain structurally important.
A power law is a scaling relationship:
$$P(X>x)\sim Cx^{-\alpha}.$$
Multiplying $x$ by a constant multiplies the probability by another constant. There is no preferred scale.
An ensemble average averages across many possible worlds or many parallel copies of a process. A time average follows one process through time and averages along that single trajectory.
A process is ergodic when the time average and ensemble average agree in the relevant long-run sense. It is non-ergodic when they do not. Multiplicative wealth, ruin, and path-dependent systems are often non-ergodic.
#From Scale Invariance to Power Laws
Now zoom into a fractal.
It looks similar.
Zoom again.
Still similar.
Again.
Still similar.
There is no characteristic scale. That is scale invariance.
This section is sometimes summarized as “fractals create power laws,” but that sentence is too crude. The real implication is:
$$\text{scale invariance}\Rightarrow\text{power-law form}.$$
Fractals are one geometric source of scale invariance. They are not the only source, and not every fractal measurement gives a clean power law. Markets, cities, earthquakes, languages, and networks can produce scaling through mechanisms that are not literally fractal geometry: preferential attachment, multiplicative growth, criticality, mixtures of regimes, and renormalization. The common object is not “fractal” but “no preferred scale.”
A characteristic scale is a typical size. Human adult height has a characteristic scale. You can talk about a normal height. Earthquake size, city size, wealth, firm size, and market crashes often do not behave that way. There are many small events, fewer medium events, and rare enormous events, with no single size that organizes the whole distribution.
A power law is a relationship where multiplying the input by a fixed amount multiplies the output by a fixed amount. The simplest form is:
$$y=Cx^{-\alpha}.$$
If you double $x$, then:
$$y(2x)=C(2x)^{-\alpha}=2^{-\alpha}Cx^{-\alpha}=2^{-\alpha}y(x).$$
So every doubling of size reduces frequency by the same factor. That is what “no characteristic scale” means in practice. The distribution does not care whether you move from $10$ to $20$, $100$ to $200$, or $1000$ to $2000$. The multiplicative relationship is the same.
Scale invariance has a simple mathematical form. Suppose a distribution $P(x)$ satisfies:
$$P(\lambda x)=\lambda^{-\alpha}P(x).$$
The solutions are power laws:
$$P(x)=Cx^{-\alpha}.$$
Here is the more general derivation. Scale invariance first says that rescaling $x$ changes $P$ only by a scale-dependent multiplier:
$$P(\lambda x)=c(\lambda)P(x).$$
Now rescale twice. Scaling by $\mu$ and then by $\lambda$ must agree with scaling once by $\lambda\mu$:
$$P(\lambda\mu x)=c(\lambda)c(\mu)P(x).$$
But the same left-hand side is also:
$$P(\lambda\mu x)=c(\lambda\mu)P(x).$$
So:
$$c(\lambda\mu)=c(\lambda)c(\mu).$$
The regular solutions of this multiplicative equation are:
$$c(\lambda)=\lambda^{-\alpha}.$$
Therefore:
$$P(\lambda x)=\lambda^{-\alpha}P(x),$$
and the compatible shapes are power laws:
$$P(x)=Cx^{-\alpha}.$$
So the power law does not come from visual roughness by itself. It comes from consistency under repeated rescaling.
Here I am using $P(x)$ informally for a scale-dependent quantity. In probability, one has to distinguish two related exponents. If the density behaves like $p(x)\sim Cx^{-\beta}$, then the survival function behaves like:
$$P(X>x)\sim C’x^{-(\beta-1)}.$$
The survival exponent is often called the tail index. In the next section, when I write:
$$P(X>x)\sim x^{-\alpha},$$
$\alpha$ is the tail index, not the density exponent. This convention is common in discussions of Pareto tails, but it is worth making explicit because the two exponents differ by one.
You can see this by taking logarithms. If:
$$P(x)=Cx^{-\alpha},$$
then:
$$\log P(x)=\log C-\alpha \log x.$$
So on log-log axes, the distribution becomes a straight line.
#Graph: Power Law Versus Gaussian
=
= ** -1.7
=
#Simulation: Power-Law Sampling
Compare running sample means from Gaussian and Pareto draws.
=
= 50_000
=
= + 1
= /
= /
What the reader should see: the Gaussian mean stabilizes. The Pareto mean keeps jumping when a new extreme appears. More data does not necessarily make a fat-tailed average feel calm.
Fractals are geometric scale invariance. Power laws are statistical scale invariance.
Same phenomenon.
Different viewpoint.
The sixth lesson:
Power laws are the probability version of scale invariance. Fractals are one geometric route to that invariance, not the whole story.
There is another way to see why scale invariance forces power laws. Let:
$$Q(t)=\log P(e^t).$$
The scaling law:
$$P(\lambda x)=\lambda^{-\alpha}P(x)$$
becomes, with $\lambda=e^s$ and $x=e^t$:
$$Q(t+s)=Q(t)-\alpha s.$$
The only continuous solutions are affine:
$$Q(t)=C-\alpha t.$$
Returning to $x=e^t$ gives:
$$P(x)=e^C x^{-\alpha}.$$
So a power law is not an arbitrary curve. It is the unique distributional shape compatible with translation invariance in log-space. Zooming becomes shifting. Self-similarity becomes linearity.
That is why power laws are so often the statistical shadow of renormalization. If the system looks the same after coarse-graining, its observables often become eigenfunctions of a scaling operator. The exponent $\alpha$ is the eigenvalue written in statistical form.
If that sentence feels too compressed, read it this way: when zooming out leaves the system with the same form, the quantities you measure must transform predictably under zooming. The simplest predictable transformation is multiplication by a constant. Power laws are exactly the functions that do that.
The canonical physics example is Wilson’s renormalization group for second-order phase transitions. Near a critical point, a magnet, fluid, or lattice model can look statistically similar after coarse-graining. Details of the microscopic system wash out; critical exponents remain. This is the same kind of universality that Feigenbaum found in period doubling, now appearing in equilibrium statistical physics.
Self-organized criticality is another route. In Bak, Tang, and Wiesenfeld’s sandpile model, grains are added slowly until the system organizes itself near a critical state. Avalanches of many sizes appear. The point is not that every power law comes from a sandpile, but that repeated local rules can create scale-free statistics without an external planner tuning the system by hand.
Preferential attachment is a third route, and it is probably the one readers meet most often. If new links, people, capital, or attention attach preferentially to nodes that already have many links, people, capital, or attention, then the large get larger faster. This is the Yule-Simon or rich-get-richer mechanism. It appears in city sizes, word frequencies, citation networks, wealth distributions, internet links, and firm sizes.
The toy rule is simple. If node $i$ has size $k_i$, then the probability that the next unit attaches to it is proportional to $k_i$:
$$P(i)=\frac{k_i}{\sum_j k_j}.$$
That one reinforcement rule can create a heavy-tailed distribution without invoking fractal geometry or a sandpile. The connection to the ladder is still the same: a repeated local transformation changes the distribution, and the long-run distribution approaches a scale-free fixed shape.
There is also a direct probability version of renormalization: the central limit theorem.
Take two independent copies of a random variable, add them, and rescale:
$$X\mapsto \frac{X_1+X_2}{\sqrt{2}}.$$
Repeat this operation. For distributions with finite variance, the Gaussian is the attracting fixed point. That is why sums of many small independent effects so often look normal.
But if the variance is infinite, the Gaussian is no longer the right attractor. The fixed points become Lévy stable laws, many of which have power-law tails. Fat tails are not merely “failed Gaussians.” They belong to a different basin of attraction in distribution space.
The rescaling exponent is the tell. To keep a finite-variance sum fixed you divide by $\sqrt{n}=n^{1/2}$; a stable law with tail index $\alpha$ instead rescales like $n^{1/\alpha}$. That exponent is not new. It is the same $n^{1/\alpha}$ that governs how fast the maximum grows in the next section. The way fat-tailed sums refuse to shrink and the way fat-tailed maxima refuse to stay small are one fact wearing two faces.
This is one of the cleanest ways to understand Extremistan. A Gaussian world and a power-law world are governed by different renormalization fixed points.
The bridge from fractals to power laws is measurement.
A fractal is scale invariance seen as shape. A power law is scale invariance seen as counting. If a coastline, river network, fault system, market cascade, or city system has structure across many scales, then counting “how many things of size at least $x$?” often produces a power law.
The important connection is that both are fixed points of a rescaling operation.
For a fractal, rescale the picture and the shape is statistically similar.
For a power law, rescale the variable and the distribution changes only by a multiplicative factor:
$$P(\lambda x)=\lambda^{-\alpha}P(x).$$
This is not a metaphor. It is the same mathematical form: an object is transformed by zooming, and its essential structure survives. The zoom operation is the transformation; the fractal or power-law distribution is the invariant object.
The geometry and the probability are not separate mysteries. They are two ways of observing the same lack of characteristic scale.
#Power Laws Create Extremistan
This is where Taleb enters.
In a Gaussian world:
- averages matter,
- variance is finite,
- large events are rare enough to ignore most of the time.
In a power-law world:
- extremes dominate,
- variance may not exist,
- averages can remain unstable for a very long time.
For a Pareto tail:
$$P(X>x)\sim x^{-\alpha}.$$
This means: the probability that $X$ exceeds $x$ falls like a power of $x$. The symbol $\sim$ means “asymptotically proportional to.” It does not say the equality is exact at every size. It says the tail behaves like that for large values.
The mean exists only when:
$$\alpha>1.$$
The variance exists only when:
$$\alpha>2.$$
So if $1<\alpha\leq 2$, the average exists but the variance is infinite. If $\alpha\leq 1$, even the mean does not exist. This is not a small technicality. It changes what evidence means.
In a thin-tailed world, more samples stabilize your estimate quickly. In a fat-tailed world, one new observation can dominate the entire history.
One earthquake dominates a century.
One company dominates a market.
One city dominates a country.
One idea dominates an era.
The seventh lesson:
Power laws move importance from the average to the extreme.
The operational difference can be seen in the maximum.
For $n$ Gaussian observations, the maximum grows slowly, roughly like:
$$\sqrt{2\log n}.$$
For $n$ Pareto observations with tail index $\alpha$, the maximum grows like:
$$n^{1/\alpha}.$$
That is the whole Extremistan difference. In Mediocristan, the maximum grows logarithmically. In Extremistan, it grows as a power of sample size. More observations do not merely refine the average. They create room for a new dominant event.
This is why the sample mean can look stable for a while and then jump. It was not converging in the way your Gaussian-trained intuition expected. It was waiting for a new maximum.
The bridge from power laws to Extremistan is dominance.
In a Gaussian world, no single observation is allowed to matter too much. The central limit theorem is the mathematical expression of this. Many small independent contributions add up, and the aggregate becomes stable. Individual terms disappear into the average.
In a power-law world, the largest observation can be the story. The aggregate is not a democratic sum of comparable pieces. It is often an aristocracy of extremes.
This changes what it means to understand a system. In Mediocristan, the typical case is informative. If you understand the average human height, you understand a lot about human height. In Extremistan, the typical case can be almost irrelevant. The typical startup does not explain venture returns. The typical earthquake does not explain geological damage. The typical word does not explain language frequency. The typical city does not explain urban concentration.
So instead of asking what the ordinary case looks like, you ask:
$$\frac{\text{largest few observations}}{\text{total mass}}.$$
If that ratio is large, then averages become summaries of extremes, not summaries of typicality.
This is the statistical version of the earlier attractor story. In dynamics, the attractor organizes trajectories. In Extremistan, the tail organizes the sample. The center is no longer sovereign. The edge is.
#Extremistan Breaks Ergodicity
Now introduce time.
An ensemble average asks what happens across many parallel worlds. A time average asks what happens to one system as it moves through time.
These are not the same question.
An ergodic system is one where, roughly, watching one typical trajectory for a long time gives you the same statistics as looking at many copies of the system at one time.
A clean mathematical version is this. Suppose a system evolves by a transformation $T$, and suppose $g(x)$ is some quantity you measure at state $x$. The time average along one trajectory is:
$$\frac{1}{N}\sum_{n=0}^{N-1}g(T^n x).$$
The ensemble average is:
$$\int g(x),d\mu(x),$$
where $\mu$ is the probability distribution over states.
Ergodicity says that, under the right conditions:
$$\lim_{N\to\infty}\frac{1}{N}\sum_{n=0}^{N-1}g(T^n x)=\int g(x),d\mu(x).$$
Do not worry about the measure-theory details. The practical meaning is simple: one long-lived path eventually samples the space fairly.
This is the content of Birkhoff’s ergodic theorem. It is one of the theorems that makes the phrase “time average equals ensemble average” mathematically precise.
Non-ergodicity means this fails. One path through time is not equivalent to many parallel samples.
There is another fixed-point object nearby. In a Markov chain, a stationary distribution $\pi$ satisfies:
$$\pi=\pi P,$$
where $P$ is the transition matrix. So the long-run distribution is a fixed point of the operator that pushes distributions forward one step.
For deterministic chaotic maps, the analogous object is an invariant measure. Instead of asking where one point goes, ask how a whole density of points moves. The transfer, or Perron-Frobenius, operator sends today’s density to tomorrow’s density. An invariant density $\rho$ satisfies:
$$\mathcal{P}\rho=\rho.$$
For the logistic map at $r=4$, the natural invariant density is:
$$\rho(x)=\frac{1}{\pi\sqrt{x(1-x)}}.$$
This connects chaos back to ergodicity. A chaotic map can be unpredictable point by point while still having a stable statistical distribution. The point trajectory is unstable; the measure is the fixed object.
Suppose wealth evolves multiplicatively:
$$W_{t+1}=W_t(1+R_t).$$
The ensemble object is:
$$\mathbb{E}[W_t].$$
The time-growth object is:
$$\lim_{T\to\infty}\frac{1}{T}\log\left(\frac{W_T}{W_0}\right)=\mathbb{E}[\log(1+R)].$$
The logarithm appears because repeated multiplication becomes repeated addition in log space.
If the system has fat tails, ruin states, or rare dominating events, a single trajectory does not sample all possibilities in any useful way. The ensemble average becomes a bad guide to the lived path.
That is non-ergodicity:
$$\text{time average}\neq \text{ensemble average}.$$
This is why Taleb, Peters, Kelly, and ergodicity economics care so much about fat tails. A strategy can look good in expectation and still be fatal through time.
#Simulation: Ensemble Versus Time
=
= 20_000
= 300
# Mostly small gains, rare large losses.
=
=
=
=
= /
Plot the ensemble mean and the median path. They separate. The average gets pulled by winners. The typical path is governed by survival and multiplicative compounding.
To make the graph explicit:
What the reader should see: the ensemble mean can rise while the typical path stagnates or dies. In multiplicative systems, the average path and the lived path are different mathematical objects.
The eighth lesson:
Power laws often make systems non-ergodic because rare events dominate long-run outcomes.
A minimal two-outcome example makes the ensemble/time split sharper.
Suppose a gamble multiplies wealth by $1.5$ with probability $1/2$ and by $0.6$ with probability $1/2$.
The ensemble-average multiplier is:
$$\mathbb{E}[M]=\frac{1.5+0.6}{2}=1.05.$$
Across parallel worlds, average wealth rises by 5 percent per round. But the time-average growth rate is:
$$g=\frac{1}{2}\log(1.5)+\frac{1}{2}\log(0.6)=\log\sqrt{0.9}<0.$$
One person repeating the gamble goes broke exponentially almost surely, even though the ensemble average grows. The fixed point has moved again: it is now the long-run growth rate of a repeated multiplicative process, and the logarithm is the coordinate system that reveals it.
This is why repeated transformation is the right primitive. A one-shot gamble and a repeated gamble are different mathematical objects. Expected value answers the first badly enough in some cases. It answers the second catastrophically when multiplication, ruin, and fat tails enter.
The bridge from Extremistan to non-ergodicity is the single path.
Power laws tell you that extremes dominate the ensemble. Non-ergodicity asks whether one trajectory gets to experience the ensemble in the right proportions. Often it does not.
This is a deeper shift than it first appears. Probability theory often starts by imagining many possible outcomes side by side. But life is not side by side. Life is sequential. You do not get to average over parallel versions of yourself after ruin. You move through one path, in order.
That order matters. Multiplicative systems remember losses differently from gains. A 50 percent loss followed by a 50 percent gain is not zero:
$$1.0 \times 0.5 \times 1.5 = 0.75.$$
The arithmetic average return is zero, but the path lost 25 percent. The logarithm sees this because logarithms turn multiplication into addition:
$$\log(ab)=\log a+\log b.$$
So the correct invariant for repeated wealth dynamics is not expected return. It is expected log growth, survival probability, and drawdown structure.
If the rare event kills the process, the process stops before it can average anything. If the rare event creates a giant winner, the ensemble mean may be dominated by paths you will almost surely not live. Time is not a neutral sampling device. It has order, survival, and path dependence.
This connects back to attractors. In an ergodic system, a single long trajectory eventually explores the relevant space in the right proportions. In a non-ergodic system, the trajectory gets trapped, ruined, amplified, or path-dependent. The basin you fall into matters more than the ensemble average over all basins.
The invariant summary of this essay is:
$$ P(\lambda x)=\lambda^{-\alpha}P(x) \quad\Rightarrow\quad P(x)\propto x^{-\alpha}. $$
Scale invariance does not always come from fractals. It can come from renormalization, multiplicative growth, preferential attachment, criticality, or self-organized criticality. The shared object is the absence of a characteristic scale.
Once those scale-free distributions enter repeated multiplicative processes, the relevant invariant is not the ensemble expectation:
$$ \mathbb{E}[W_t], $$
but the time-average growth rate:
$$ \lim_{T\to\infty}\frac{1}{T}\log\frac{W_T}{W_0}. $$
The next essay asks why recurrence and self-reference force arithmetic into the story.
#Further Reading
- Benoit Mandelbrot, The Fractal Geometry of Nature. The classic source for fractals and scaling.
- Mark Newman, Power laws, Pareto distributions and Zipf’s law (2005). A careful first paper on power-law distributions.
- Aaron Clauset, Cosma Shalizi, and Mark Newman, Power-law distributions in empirical data (2009). The statistical cautionary reference.
- Kenneth Wilson, The renormalization group and critical phenomena (1983 Nobel lecture). The clean conceptual source for renormalization and critical exponents.
- Ole Peters, The ergodicity problem in economics (2019). A direct route into time averages, ensemble averages, and multiplicative wealth.
- John Kelly, A new interpretation of information rate (1956). The original Kelly criterion paper.