Deriving the Gibbs State

Suppose you have a set of observables X_1\cdots X_n over a set of orthonormal, normalized basis states, S_1\cdots S_n. You want to add the constraint that for each i, we have <X_i> = x_i, where x_i is a scalar. What would this tell us about the relative probabilities of our states S_1\cdots S_n?

This does not fix the state, but it does skew the probability distribution of possible states. Before we added our constraints, any distribution over the state had the same entropy. Now the state has a distribution with maximal entropy.

The precise way to define a “probability distribution over the state” is the density matrix \rho, chosen such that if S_i and S_j are orthonormal, S_i\rho S_j= 0, and S_i\rho S_i is the probability density of state S_i. By linearity, this gives a complete description of the density matrix, and one can check that it is basis independent.

For any observable A, we have <A> = tr(\rho A). So our constraints can be written x_i = tr(\rho X_i). The entropy of the density matrix is S = -tr(\rho\ln(\rho)). We also require tr(\rho) = 1. At this point the problem is practically screaming for Lagrange Multipliers. Define the function L(\rho, \lambda_1, \cdots, \lambda_n) by

L = S - \sum_i \lambda_i tr(\rho X_i) - \lambda_0 tr(\rho) + \lambda_0 + \sum_i \lambda_i x_i

We achieve maximum entropy when \frac{dL}{d\rho} = 0, so in this case

\frac{dL}{d\rho} = -\ln(\rho) -I +I -\lambda^i X_i

\ln(\rho) = e^{-(\lambda^iX_i)}

So our maximum entropy density matrix will turn out to take form \rho = e^{-(\lambda^i X_i)}. The mixed state corresponding to this density matrix is called the Gibbs State.



Legendre Transformations in Physics

Helmholtz Free Energy is the amount of work W that can be extracted from a system surrounded by a heat bath. If the internal energy of the system is U, temperature of the heat bath is T, and entropy of the system is S, the Helmholtz Energy is

F = U - TS

This is the way you usually see the Helmholtz Energy written. As we will see later in this post, though, it makes more sense to talk about the Helmholtz Energy as the negative of its usual definition, F = ST - U. This is the definition we will use from now on – instead of thinking of energy that can be extracted from the system, we are thinking of energy the system can lose.

The free energy plays an important role in Chemistry, as a reaction will only occur if it causes energy to be extracted from the system.

Energy conservation guarantees

\delta U_{bath} + \delta U + W = 0

Since we consider the system to have a fixed volume, Q_{bath} = \delta U_{bath}, so

W = -Q_{bath} -\delta U

The heat bath stays at constant temperature, so \delta S_{bath} = \frac{Q_{bath}}{T} = \frac{W - \delta U}{T}. Then the total entropy change is

\delta S_{bath} + \delta S = \frac{W - \delta U + T\delta S}{T}

Note that the left side is the total entropy change of the process. There is nothing in principle stopping this change from being negative, just as there is nothing stopping all the air molecules in a box from lining up one side. In fact, the chemical interactions in some processes, like the freezing of water, conspire to decrease entropy by making molecules more ordered. But in most systems we can assume entropy will never decrease, so that

0 \le \frac{W - \delta U + T\delta S}{T}

\delta(U - TS) \le W

\delta F \ge -W

In other words, the Helmholtz Energy limits the amount of useful work a system can do on its environment.

Interestingly, the Helmholtz Energy is related to internal energy by a Legendre Transformation, the same transformation that relates Hamiltonians and Lagrangians. In other words,

F(T, V) = \sup_S (ST - U(S, V))

Where we can see that the right side will give the right value of S for a given T by noting that at a critical point,

\frac{d}{dS} (U(S, V) - ST) = 0

T = \frac{dU}{dS}

Which is the definition of temperature.

Here the analogy Helmholtz Energy:Internal Energy::Lagrangian:Hamiltonian becomes quite clear. We think of Internal Energy as a function of entropy, and we think of Helmholtz Energy as a function of entropy’s conjugate variable, temperature. We think of Hamiltonians as a function of velocity, and we think of Lagrangians as a function of velocity’s conjugate variable, canonical momentum. Here are the important equations for the two relationships, side by side.

T = \frac{dU}{dS}

F = TS - U

p = \frac{dL}{dv}

H = pv - L

The Einstein-Hilbert Action

R_{ij} - \frac{1}{2}Rg_{ij} + \Lambda g_{ij} = \frac{8\pi G}{c^4}T_{ij}

These are the Einstein field equations, a set of differential equations for the Riemannian Metric g of spacetime.R_{uv} is the Ricci curvature of spacetime and R is its scalar curvature. \Lambda is the cosmological constant.

T is the stress-energy tensor, a symmetric and coordinate-invariant 2-tensor which describes the energy flux through a surface in spacetime. Intuitively, T_{ij} = \frac{\partial S}{\partial g_{ij}}, so that T controls how the action S varies with the metric. As an example from Special Relativity, if you take a particle beam of density \rho_0 in some rest in frame S, and do a Lorentz transformation with Lorentz factor \gamma to frame S’, you end up with a beam of density \gamma\rho_0. So density is not coordinate-invariant, and this is why we are interested in the stress-energy tensor. Flux is a density times a velocity, so in this case T = \rho_0v, which is coordinate invariant because the four-velocity is invariant and \rho_0 is a constant.

Riemannian metrics like g are normally required to be positive definite, or in other words have a signature of (n,0). But here we allow g to be a Lorentz metric with signature (n-1, 1). We will work in local coordinates rather than an orthonormal frame, so we can take advantage of the fact that [dx^i, dx^j] = 0.

Suppose we have a vacuum, so that T vanishes. Then the field equations give

R_{ij} - \frac{1}{2}Rg_{ij} + \Lambda g_{ij} = 0

R_{ij} - \frac{n}{2}R_{ij}  +\Lambda g_{ij} = 0

R_{ij} = \frac{2\Lambda}{n-2}g_{ij}

In particular, R = \frac{2n\Lambda}{n-2} is a constant over the manifold. Note that physically, the presence of the cosmological constant means that even if space had no matter or energy, it would still not be flat. In other words, even empty space would have some inherent energy in it. Physicists have for a while expected that the cosmological constant is related to the fact that a quantized electromagnetic field has a nonzero minimum energy as well. Unfortunately, the comparison doesn’t seem to work experimentally. Measured values of the cosmological constant put it at the vanishingly small value of 1.19\times 10^{-52} m^{-2}, with a correspondingly small energy density. The zero-point energy density from quantized electromagnetic fields is not nearly so small. Explaining why the cosmological constant is so small is in fact still an open problem in physics. One possible explanation is anthropic – it might be that if the cosmological constant took a different value, life could not form in the universe.

If you additionally make the approximation \Lambda = 0, then R_{ij} = 0 \to R = 0. This does not mean that the curvature tensor R_{ijkl} vanishes, however. Consider the Schwarzschild Solution to the field equations, which gives the metric of spacetime in the exterior of a rotationally symmetric gravitational potential (like a star or black hole). If R_{ijkl} vanished, then spacetime with this metric would be locally isometric to R^n. Geodesics in R^n are straight lines. But the curved orbit of a planet around a star is a geodesic in spacetime, so there must be some R_{ijkl}  that do not vanish in the Schwarzschild Solution.

On a manifold with metric g, we define the trace of a tensor T_{ij} to be \sum_{i,j} g^{ij}T_{ij}. We also define the traceless part of T_{ij} to be U_{ij} = T_{ij} - \frac{Tr(T)}{n}g_{ij}. If your metric is the identity, then this corresponds to taking U = T - \frac{Tr(T)}{n}T. It is a good exercise to verify that Tr(U) = 0.

Note that if the stress-energy tensor T_{ij} = \frac{\partial S}{\partial g_{ij}} is traceless, then a scaling of the metric given by g'_{ij} = ag_{ij} for some constant a leaves the action invariant.

The Einstein Field Equations require that the traceless part of R_{ij} vanishes, so that R_{ij} = \frac{R}{n}g_{ij}.

The interplay between physics and math in these equations is quite interesting. It turns out that any two-tensor F_{ab} can be decomposed as F_{ab} = S_{ab} + E_{ab} + C_{ab}, where S_{ab} is a scalar, E_{ab} is semi-traceless, and C_{ab} is traceless.

Physicists call the traceless part of the curvature the Weyl Curvature – this is the component of gravity that can exist without the presence of matter and energy, and gives rise to gravitational waves. It turns that if spacetime had less than four dimensions, the Weyl Curvature would vanish – there would be no gravitational waves!

Let’s try to answer some questions about this equation. First, why does only the Ricci Curvature appear, and not the full Riemannian Curvature tensor?

An Einstein Manifold satisfies the equation Ric = kg, where k is a constant. g is called the Einstein Metric. As one might expect, Einstein Manifolds are solutions to the Einstein Field Equations. It turns out that Einstein Manifolds minimize a certain functional on 4-dimensional spacetimes, called the Hilbert Functional, and this is one explanation for why realistic spacetimes satisfy Ric = kg.

Exercise: Prove that all compact 2-d manifolds without boundary are Einstein Manifolds.

Before diving into the Hilbert Fuctional, let’s remember a more basic application of the Least Action Principle. In classical mechanics, we give a dynamical system coordinates (q_i(t), \dot{q_i}(t), t), and we define the Lagrangian L(q_i(t), \dot{q_i}(t), t) to be such that

\frac{d}{dt}\frac{\partial L}{\partial \dot{q_j}} = \frac{\partial L}{\partial q_j}

To give some intuition from classical mechanics, if we let L = T - U where T is kinetic energy and U is potential energy, then we get \frac{\partial L}{\partial v} = p, while \frac{\partial L}{\partial x} = F. So the equations are saying that \dot{p} = F.

It is important which derivatives are partial and which are total. It turns out that if x_i(t) describes the evolution of the system, then x_i(t), \dot{x}_i(t) must be a critical point of the action S of the system, defined by S(L) = \int_{t_0}^{t_1} L(q_i, \dot{q_i}, t). Note that if x_i is a critical point of S(L), then it is also a critical point of S(L') where L' = L + \frac{df(q_i, \dot{q_i}, t)}{t}, since the action of \frac{df}{dt} is just a constant. In particular, Lagrangians cannot be uniquely defined.

Maybe we should also mention what it means for x_i(t) to be a critical point of the action. In general this depends on what family of functions x_i(t), \dot{x_i(t)} we are allowing, but let’s say we’re allowing some family such that all the x_i(t) and \dot{x_i}(t) take the same value at the endpoints t_0, t_1. Then x_i(t), \dot{x_i(t)} is a critical point if x_i(t) + \epsilon_i f_i(t), \dot{x_i}(t) + \epsilon_i \dot{f_i}(t) always has a larger action than x_i(t), \dot{x_i}(t) for small \epsilon and f_i, \dot{f_i} zero at the endpoints.

The Hilbert Fuctional is defined as the total scalar curvature of the manifold – in other words, H(g)=Vol(g)^{-\frac{2}{n}}\int_M RdV. We call this a functional because it is invariant under diffeomorphisms – H(\varphi^*g) = H(g). Hilbert showed that for dimension n\ge 3, the critical points of H are exactly the Einstein Metrics. Further, the normalization by volume ensures that the metrics which are critical for H under conformal transformations are exactly the metrics of constant scalar curvature.

In the case n=2, we have H(g) = \int_M RdV. If we have a set of orthonormal vectors E_1, E_2, \ldots, E_n, then R = \sum_{j\neq k}K(E_j, E_k), where K is the sectional curvature. So into two dimensions, R = 2K(E_1, E_2) = 2K by Gauss’s Theorema Egregium, where K is the Gaussian Curvature. But by the Gauss-Bonnet Theorem, this is just \int_M K dV = 2\pi \chi(M), where \chi denotes the Euler Characteristic. So H(g) = 4\pi\chi(M), and H does not give us much useful information about the metric for  n = 2.

The Second Bianchi Identity for Riemannian Curvature states that

R_{abcd;e} + R_{abec;d} + R_{abde;c} = 0

(The semicolon denotes a derivation with respect to the torsion-free connection). This is quite messy to prove, so we’ll take it for granted for now. This symmetry of the full curvature tensor also gives a symmetry involving the Ricci and Scalar curvatures, called the Contracted Second Bianchi Identity:

R_{;m} = 2R_{jm;}^j

Unfortunately everyone seems to have their own favorite way of writing this identity, and translating between them takes a bit of work. We’ll see two other ways to write this by raising and lowering indices:

R_{;m} = 2R^i_{m;i}

g^{jm}R_{;m} = 2R^{ij}_{;i}

g^{ji}R_{;i} = 2R^{ij}_{;i}

This is one possible way to express it, and we can also show that the equation \sum_{j,k} g^{jk}(R_{ij} - \frac{1}{2}R(g)g_{ij})_{;k} = 0 is equivalent to the original Contracted Bianchi Identity:

2\sum_k R^k_{i;k} = \sum_k R_{;k}\delta^{k}_i

2R_{ki;}^k = R_{;i}

R_{;m} = 2R_{jm;}^j

The Einstein Hilbert-Function functional, which sends metrics on M to \mathbb{R}, is defined by A(g) = Vol(g)^{-1}\int_M R_g dV_g, where dV_g is the volume form corresponding to g.

KL-Divergence and the Fisher Information Metric

One of the most important ideas from Riemannian Geometry is that zeroth and second order metrics are equivalent. In other words, given a distance function for points on a manifold, you can find a Riemannian metric, and vice-versa. The relationship between the KL divergence and the Fisher Information Metric is a special case of this observation.

To show the general case, we derive a distance function from the Riemannian metric by defining

d(p, q) = \inf \{L(\gamma) | \gamma(0) = p, \gamma(1) = q\}

Where we require \gamma to be a geodesic under a metric g, and L(\gamma(t)) = \int_0^1 g_{jk}(\dot{\gamma}_j, \dot{\gamma}_k). It is intuitive that this distance function is symmetric, nonnegative and satisfies the triangle equality.

Alternatively, we can get a Riemannian metric from a distance function by noting that if \gamma is a smooth curve with \gamma(0) = p, then f_p(q) = d(p, q) will have vanishing first derivatives at q = p, and in particular will have Taylor Series in local coordinates

f_p(q) = \frac{1}{2}q^iq^jH_{ij}(p)

Where H(p) is the hessian of f_p. It seems natural, then, to define the Riemannian metric by g(p) = H(p). To show that this definition is smooth, we would just need to make use of the fact that d is smooth in both of its arguments.

Now that we’ve seen the principle in general, we can derive the relationship between the KL Divergence (zero order) and the Fisher Information Metric (second order).

Defined by KL(P, Q) = \int_{X} \log (\frac{P}{Q}) P, the KL Divergence is a measure of distance between probability distributions that are parametrized by some real-valued vector \theta.

The KL Divergence is always nonnegative, which requires a bit of work to show. One way is to work in the discrete case, so that instead of distributions P and Q we have sequences P_i and Q_i. Then we show that for j such that \frac{P_j}{Q_j} is as large as possible, \frac{\partial KL}{\partial Q_j} \ge 0, and \frac{\partial KL}{\partial Q_j} \le 0 when \frac{P_j}{Q_j} is as small as possible. However, the KL Divergence is not a true distance function, since in general KL(P, Q) \neq KL(Q, P). So you can’t get to the KL Divergence from the Fisher Information Metric. However, you can go the other way – taking the Hessian of the KL Divergence gives the Fisher Information Metric, which is a true Riemannian Metric.

Let P and Q be drawn from some underlying family of distributions R, so that P = R(\theta_1) and Q = R(\theta_0). Now we’ll set \theta_1 = \theta_0 + \triangle \theta, so that

P = Q + \triangle\theta^j\frac{dR}{d\theta^j}

Now we simply plug this into our KL-divergence formula to obtain

KL(P, Q) = \int_{X}(R + \triangle\theta^j\frac{dR}{d\theta^j})(\log(1 + \frac{\triangle\theta^j\frac{dR}{d\theta^j}}{R}))

Since we are only interested in the second order term, we can taylor series expand the log factor to obtain

KL(P, Q) = \int_{X}(R + \triangle\theta^j\frac{dR}{d\theta^j})(\frac{1}{R}\triangle\theta^j\frac{dR}{d\theta^j} - \frac{1}{2R^2}(\triangle\theta^j\frac{dR}{d\theta^j})^2)

Now we can see that the second order term within the integral will be


So that the Hessian H of the KL Divergence is

H_{ij} = \int_X \frac{1}{R}\frac{dR}{d\theta^i}\frac{dR}{d\theta^j}

This is indeed the Fisher Information Metric. The Fisher Information Metric is more commonly written in the form

H_{ij} = \int_X \frac{d\log R}{d\theta^i}\frac{d\log R}{d\theta^j}R

But it is easy to see that this is equivalent to what we have obtained.

It is important to note that if Q is a model distribution and P is an empirical distribution we are trying to fit, then minimizing the KL divergence corresponds to finding the maximum likelihood Q model.

It may feel a bit unsatisfying to use the KL-divergence, since it’s not a true distance function. It turns out that there’s an easy way to symmetrize the KL-divergence, called the Jensen-Shannon Divergence, which is defined by

J(P, Q) = \frac{1}{2}KL(P, M) + \frac{1}{2}KL(Q,M)

Where M = \frac{1}{2}(P + Q).

Lie Groups

A Lie Group G is a smooth manifold endowed with a group structure on all its points such that the multiplication map m(g, h) = gh and inversion map i(g) = g^{-1} are both smooth. For example, you can think of the points on a circle as forming a Lie Group – the group operation would just add the angles of any two points, and we would say that the group is parametrized by the rotation angles. The General Linear group is also a Lie Group (what does its group action represent)?

Lie groups can transform other manifolds, even manifolds that themselves do not have a group structure. Think of manifolds in the X-Y plane, for example. A 2D blob can be quite messy and lack a group structure, but you could still use the Lie Group of the circle to transform its points by rotation.

Formally, if G is a Lie Group and M a smooth manifold, a left action of G on M is a map G\times M\to M by (g, p)\to g\dot p satisfying g_e \dot p = p and g_1\dot (g_2\dot p) = (g_1g_2)\dot p. A Lie Group action is said to be effective if only g_e, the identity element of G,  acts as the identity on M.

homogeneous space for a Lie Group G is a manifold on which G acts transitively.

A vector field X on a Lie Group G is left-invariant if L_{g*}X(x) = X(gx).

The Lie Algebra Lie(G) of a Lie Group G is the set of smooth, left-invariant vector fields on G.

Exercise: Prove that Lie(G) is isomorphic to T_eG, so that dim(G) = dim(Lie(G)).

A Lie Group is abelian if it is commutative as a group. The intuition for abelian lie groups is that commutativity forces the Lie Bracket on the Lie Algebra to be zero, and that curvature tends to mess with commutativity. The most natural abelian Lie Groups are all flat (have curvature everywhere zero). For example, all connected and abelian Lie Groups are isomorphic to (S^1)^m\times R^n.

The radical of a Lie Group is the maximal normal solvable subgroup of the group.

It turns out that Lie Algebra is applied all over the place in quantum mechanics, which I never realized until I studied Lie Algebra. This makes sense, because quantum mechanics has to talk about non-commutativity in things like the Heisenberg Uncertainty Principle, and the Lie Bracket is the natural tool for studying non-commutativity. You also have to use a bit of representation theory to rotate quantum systems, since classical rotations don’t always do what you want – rotating a particle by 2\pi might make its spin go negative, for example.

While studying QM, I heard it said that “hermitian operators  generate unitary operators”. Operators, by the way, mean linear operators on a complex Hilbert Space, which we’ll just take to mean complex-valued matrices for now. If we have a unitary operator that is very slightly different from I, we can write it as I + i\lambda U, where U is some operator, and \lambda\in R is so small that \lambda^2 = 0. Unitarity gives us

(I+i\lambda U)^t = (I+i\lambda U)^{-1}

I - i\lambda U^t = I - i\lambda U

U^t = U

So infinitesimal unitary operators differ from the identity by a hermitian operator times some constants. That’s the sense in which hermitian operators generate unitary operators.

In other words, the Lie Group of unitary operators is generated by the Lie Algebra of hermitian operators. That’s a pretty good reason to think of hermitian matrices as interacting additively, and unitary matrices as interacting multiplicatively.

Exercise 1: Check that the sum of hermitian matrices is hermitian, but not necessarily the product. Check that the product of unitary matrices is unitary, but not necessarily the sum.

Exercise 2: Figure out what the Lie Algebra for the Lie Group of invertible matrices should be.

Exercise 3: Prove that for a matrix A, we have det(\exp(A)) = e^{Tr(A)}

It turns out that Lie Algebras and Lie Groups are related by an exponential map. There are a few signs pointing in this direction. The domain of an exponential map is additive, while the range is multiplicative, just like a Lie Algebra is additive, while a Lie Group is multiplicative. From exercise 3, if \exp(A) has determinant 1, then Tr(A) = 0. This corresponds to the fact that the unitary matrices, a Lie Group of determinant 1, has a traceless Lie Algebra.

Exercise 4: Prove that if A and B are matrices, then Tr([A, B]) = 0


Riemannian Geometry: Coordinate Invariant Differentiation

If you need a refresher on any differential geometry definitions, see here.

Connections are a kind of differentiation operator that were introduced to give a coordinate-invariant definition of acceleration. In particular, connections give us a way to make the “zero acceleration” definition of geodesics coordinate-free. To see why this might be necessary, notice that in polar coordinates, the line r=2 is a geodesic. But in R^2, a circle with radius 2 is clearly not a geodesic. The naive definitions of geodesics (length minimizing curves, or curves with zero acceleration) become inconsistent if you have to deal with different coordinate systems.

Let E be a vector field. For a smooth section X of the tangent bundle and smooth section Y of E, the connection \nabla_XY is tensorial in X and a derivation in Y. A connection is called linear if E is the tangent bundle. Every linear connection and smooth curve \gamma: I\to M determines a unique operator D_t that maps the space of smooth vector fields along \gamma to itself. If we are able to extend V a little beyond \gamma to V', then D_tV = \nabla_{\dot{\gamma}(t)}V'. Just as \nabla_XY acts as a derivation in Y, D_t acts as a derivation in V. D_t is the way we define differentiation along curves – we say that \gamma is a geodesic, a zero-acceleration curve, if D_t \dot{\gamma} = 0.

Connections allow us to define the covariant derivative \nabla w of a differential form w as \nabla w(V_1,\ldots, V_n, X) = Xw(V_1,\ldots,V_n) - \sum_i w(V_1,\ldots,\nabla_XV_i,\ldots,V_n)

Note that the X(w\ldots)) term is well-formed, since w(\ldots) is just a member of C^{\infty}(M) – each V_i takes a new value at each point of the manifold, so w(\ldots) takes a new real value at each point of the manifold. You can think of the subtracted terms on the right hand side as the correction terms that make the derivative coordinate invariant. They account for the fact that it’s not just the vector fields that change when you move a little bit on the manifold – the manifold itself will change.

A differential form w is parallel if \nabla w = 0.

A connection on a  Riemannian Manifold (M, g) is called metric-compatible if \nabla_Xg(Y, Z) = g(\nabla_XY, Z) + g(X, \nabla_YZ)

Recall that the metric gives an inner product on the tangent space, g(X, Y) = <X, Y>. This in turn induces a metric on \Lambda^k(TM) – if a = A^1\wedge\ldots\wedge A^k and b = B^1\wedge\ldots\wedge B^k, then <a, b> = det(<A^i, B^j>)

On a local frame E_1\ldots E_n, define the Christoffel Symbols \Gamma^k_{ij} by \nabla_{E_i}E_j = \Gamma^k_{ij}E_k.

We can write the Christoffel Symbols locally in terms of the metric as \Gamma^k_{ij} = g^{kl}(g_{il;j} + g_{jl;i} - g_{ij;l}). This shows that the Christoffel Symbols are scale-invariant – \Gamma^k_{ij}(\lambda g) = \Gamma^k_{ij}(g) if \lambda is a positive constant. It follows that the Levi-Cevita Connection is scale-invariant as well.

Let M be an immersed sub manifold of (N, g) with immersion map f:M\to N. We can essentially use g to define a metric g_M on M, called the induced metric, such that g_M(X, Y) = g(f_*X, f_*Y). Note that with the induced metric, the tangent space of the sub manifold takes its structure from the tangent space of the parent manifold, so the first order properties of the two manifolds are all the same. The second-order properties, like curvature, can still be quite different. The sphere S^n is an immersed sub manifold of R^n, for example.

Suppose U\subset R^n is open. If X:U\to R^n is a local parametrization of some submanifold M\subset R^m, the induced metric on U in local coordinates (u^1, \ldots, u^n) is g = \sum_{i = 1}^m (dX^i)^2 = \sum_{i=1}^m (\frac{\partial X^i}{\partial u^j}du^j)^2.

The length L of a geodesic with domain on [a,b] is L(\gamma) = \int_a^b \sqrt{g(\dot{\gamma}(t), \dot{\gamma}(t))} dt.

A space M is called connected if it cannot be written as the disjoint union of two open sets. A connected space is also path-connected, meaning that any for any two points in M there is a continuous function f:R\to M with both points in its range.

We can define a distance metric d between two points on a manifold as the infimum of the lengths of the piecewise regular curves connecting the points.

A minimizing geodesic \gamma :I\to M is such that for any t_0, t_1 \in I, we have d(\gamma(t_0), \gamma(t_1)) = |t_0 - t_1|.

maximal geodesic is one which cannot be extended to have a larger domain.

A Riemannian manifold is geodesically complete if every maximal geodesic is defined over all of R. Note that geodesic completeness does not imply connectedness – the disjoint union of two copies of S^1 is geodesically complete since each copy is geodesically complete, but it is not connected.

Hopf-Rinow Theorem: A connected manifold (M,g) is geodesically complete if and only if it is complete as a metric space.

Since geodesic and metric completeness are equivalent, we combine them into one property called completeness. It is easier to answer global questions about a manifold that is complete.

Fundamental Lemma of Riemannian Geometry: If (M, g) is a riemannian manifold, there is a unique linear connection that is compatible with g and torsion free.

The Exponential Map

If you specify a point on a manifold and an initial velocity, you uniquely determine a maximal geodesic (maximal meaning containing as many points as possible). Very conveniently, a point and an initial velocity is just an element V of the tangent bundle. This allows us to define the exponential map: if \gamma_V is the geodesic associated with V, then

exp(V) = \gamma_V(1)

That we evaluate the geodesics at 1 is an arbitrary choice, since for t\in R, \gamma_{tV}(1) = \gamma_V(t).

The important point here is that given a p\in M, the exponential map exp_p exists in every direction – every direction is the start of some nonzero-length zero-acceleration curve. More formally,

Normal Neighborhood Lemma: For any p\in M, there is a neighborhood V of the origin in T_pM and a neighborhood U of p such that exp_p: V\to U is a diffeomorphism. (Such neighborhoods V are called called normal neighborhoods).

If exp_p is a diffeomorphism on the ball B_r(0)\subset T_pM for some r>0 (where distance is defined by g), the image is is called a geodesic ball. The normal neighborhood lemma shows that we can construct a small geodesic ball around any point on a manifold.

Exercise: Prove that if \varphi is an isometry and p\in M has a geodesic ball of radius r, then so does \varphi(p).

If E_1\ldots E_n is an orthonormal basis for T_pM, then we can define an isomorphism E:R^n\to T_pM by E(x) = x^iE_i. Pairing this with the exponential map turns a normal neighborhood U centered at p into a coordinate chart, with \varphi = E^{-1}\circ exp_p^{-1}: T_pM\to R^n.

The name “exp” is not coincidental – the exponential map is studied more generally in Lie Algebra, where the connection to the usual exponential map exp:R\to C by x\to exp(ix) is more clear. Here, if M is geodesically complete, we can give the set of geodesics passing through p\in M a group structure by identifying each geodesic with its initial velocity, and letting the group operation be addition on T_pM. Then exp(aV+bW) = exp(aV)\circ exp(bW), which looks familiar.

Given a vector field V on manifold M, an integral curve is a smooth curve \gamma :R\to M such that \dot{\gamma}(t) = V_{\gamma(t)}. Integral curves allow us to define an interesting map \theta: R\times M\to M called a flow by defining \theta^p to be an integral curve starting at p for each \theta\in M, and letting \theta(t, p) = \theta^p(t).

On Euclidean hypersurfaces, the shape operator s is defined such that s\partial_i = -\partial_iN, where N is a unit normal vector field to the surface. The principal curvatures are the eigenvalues of the shape operator, the gaussian curvature is the determinant of the shape operator, and the mean curvature is the mean of the principal curvatures.

Two metrics g_1 and g_2 are said to be conformal to each other if there is some positive function f\in C^{\infty}(M) such that g_1 = fg_2. g_1 and g_2 are said to be conformally equivalent if there is a diffeomorphism \varphi such that \varphi^*g_1 = g_2.

The Laplacian is given by \nabla\cdot f = Tr(Hess(f)), where Hess(f) is the Hessian of f.

Differential Geometry Formalism

What follows are the definitions you need to go straight from real analysis to differential geometry.

Let P be a set points, and S a set of subset of P. S is a topological space if

(a) It contains P and the empty set

(b) It is closed under finite intersections

(c) It is closed under arbitrary unions

A homeomorphism is a continuous bijection between topological spaces that has a continuous inverse. Let U and V be topological spaces, f:U\to V be a homeomorphism, and let \{u_i\} and \{v_i\} elements of the respective topological spaces. Then f^{-1}(f(u_i)) = u_i since f is bijective. Also,

f^{-1}(f(u_1 \cap u_2)) = \{u_i : f(u_i) \in f(u_1) and f(u_i) \in f(u_2)\} = u_1 \cap u_2

Similar reasoning gives f^{-1}(f(u_1 \cup u_2)) = u_1 \cup u_2

f is a bijection, and the topological operations, intersections and unions, act the same way in V and U. This means that f preserves the topological properties of U.

Exercise: Convince yourself that the interior of the n-dimensional ball is homeomorphic to R^n, but the n-dimensional ball itself is not.

chart is a homeomorphism \phi from an open subset U of M to R^n. The chart is often written as the pair (U, \phi).

An atlas is a collection of charts (U_i, \phi_i) such that \bigcup_i U_i = M.

Topological manifolds of dimension n are topological spaces that additionally satisfy the following properties:

(a) M is a Hausdorff Space: Every pair of points p and q in M can be separated into disjoint neighborhoods.

(b) M is second-countable: There is a countable basis for M’s topology

(c) Locally, M “looks” like R^n. More specifically, every point p \in M has a neighborhood U that is homeomorphic to R^n

We will need some additional structure to talk about differentiation on manifolds, because homeomorphisms do not preserve differential properties. For example, the function f(u,v) = (u^{\frac{1}{3}}, v^{\frac{1}{3}}) is a homeomorphism, and g(x,y) = x is differentiable, but g \circ f is not differentiable at the origin. Differentiability is easy to define for euclidean space, so we will try to extend that definition to arbitrary manifolds.

Let (U, g) and (V, f) be two charts of a manifold. Consider the homeomorphism g \circ f^{-1}. Its domain and range are both R^n, so it makes sense to talk about whether it is smooth. We say that the two charts are smoothly compatible if either U\bigcap V = \emptyset or g \circ f^{-1} is smooth. Finally, we define a smooth manifold to be a manifold which has an atlas such that any two charts are smoothly compatible.

smooth function f: M\to R^k has the property that for every chart (U, \phi) on M, f\circ \phi^{-1} is smooth.

smooth map f: M \to N has the property that for every pair of charts (U, \phi) of M and (V, \psi) of N, \psi \circ f \circ \phi^{-1} is smooth.

It can be proven that functions and maps need only satisfy their composition properties for a particular choice of smooth atlas.

smooth curve in M is a smooth map \gamma: J \to M where J is an interval in R.

If \gamma is a smooth curve, we define the tangent vector to \gamma at t_0\in J to be \gamma'(t_0) = \gamma_*\frac{d}{dt}(t_0).

A diffeomorphism is a smooth bijection between two smooth manifolds that also has a smooth inverse. You may notice a similarity in the definitions of diffeomorphisms and homeomorphisms – just interchange “smooth manifold” with “topological space” and “smooth” with “continuous”.

Note that the maps g\circ f^{-1} from a smooth atlas of a smooth manifold are diffeomorphisms, since both g\circ f^{-1} and f\circ g^{-1} must be smooth.

If we want to apply our definition of a manifold to spaces with boundaries, we run into a slight issue. For example, if you take a point on the boundary of a closed sphere, the neighborhoods of that point do not look like \mathbb{R}^n – they look like one half of \mathbb{R}^n, such as the set \mathbb{R}^{n+} = \{x_1, x_2, \ldots, x_n | x_n \ge 0\}. So we generalize our definition of smooth manifolds to that of smooth manifolds-with-boundary, defined such that all interior points must still have a neighborhood diffeomorphic to \mathbb{R}^n, but points on the boundary must have a neighborhood diffeomorphic to \mathbb{R}^{n+}.

Let’s define the tangent space T_pM at a point p of manifold M. A derivation at p is a linear map D:C^{\infty}(M)\to R such that D(fg) = f(p)D(g) + g(p)D(f). T_pM is the vector space of all derivations at p, and we call members of T_pM tangent vectors. Two tangent vectors X and Y are orthogonal if g(X, Y) = 0. See here for a more thorough introduction to the tangent space.

The Inverse Function Theorem states that if F:M\to N is a smooth map between manifolds such that dF: T_pM\to T_{F(p)}N is a linear isomorphism at p\in M, then there exists an open neighborhood of p on which F is a diffeomorphism. Intuitively, if dF is an isomorphism at a point, it must be an isomorphism over a neighborhood, and linear maps with full rank are included by diffeomorphisms.

The Inversion Functon Theorem is actually a special case of an even more useful theorem, the Constant Rank Theorem. Suppose F:M\to N is a smooth map with constant rank m. Then there exists a choice of local coordinates (x_1, x_2,\ldots x_n) on M and (v_1, v_2, \ldots, v_k) on N such that F(x_1, x_2, \ldots, x_n) = (x_1, x_2, \ldots, x_m, 0, \ldots, 0)

The tangent bundle is the set of all tangent vectors in a manifold.

Let M and N be differentiable manifolds and F a differentiable map between them. Then F is a submersion if at every point p, F_{p*} is a surjective map from T_pM to T_{F(p)}N. This definition implies dim(M)\geq dim(N), so we can imagine intuitively that a submersion looks locally like a projection from a higher-dimensional vector space to a lower dimensional vector space.

If F_{p*}:T_pM \to T_{F(p)}N is injective for all p, then F is called an immersion. Equivalently, F is an immersion if rank F_* = dim M. F is called proper if for every compact subset A of F(M), \{q\in M | F(q)\in A\} is compact. An embedding is a proper, injective immersion. An alternative definition is that an embedding is an injective immersion that is also a homeomorphism.

Example: If \gamma is a smooth curve, \gamma is an immersion if and only if \gamma'(t) \neq 0 for all t.

Proof: Note that \gamma_{*} is not full rank at t if and only if there is some nonzero X\in T_{\gamma(t)}M such that \gamma_{t*}(X) = 0. And since \gamma'(t_0) = \gamma_* \frac{d}{dt}(t_0), such an X exists if and only if \gamma'(t_0) = 0 for some t_0.

vector field X is a smooth map on a manifold M such that X(p) \in T_p(M).

A Riemannian Metric on a smooth manifold is a smooth, positive-definite, symmetric 2-form acting on the tangent space. A manifold M paired with its metric is often denoted M, g. The smoothness condition here means that if X and Y are smooth, locally defined vector fields, then the map p\to g(X_p, Y_p) must be smooth. If one drops the positive-definite condition, one gets a Pseudo-Riemannian Metric on the manifold. The Lorentz Metric is pseudo-Riemannian, while the Euclidean Metric is Riemannian.

An isometry from (M, g) to (M', g') is a diffeomorphism \varphi such that g(X, Y) = \varphi^*g(X, Y) = g'(\varphi_*X, \varphi_*Y) for any two vector fields X, Y. Just as homeomorphisms preserve topological structure and diffeomorphisms preserve smooth structure, isometries preserve geometric structure.

local parametrization of a submanifold S of manifold M is a smooth embedding X:U\to M for some U \subset R^n whose image is an open subset of S.

vector bundle of rank k over a manifold M is a smooth manifold E paired with a smooth, surjective map \pi:E\to M such that for each p\in M,

(a) E_p = \pi^{-1}(p) is a vector space. E_p is called the fiber of E over p.

(b) There is a neighborhood U of p and a diffeomorphism \phi : \pi^{-1}(U)\to U\times R^k such that U = \pi(\pi^{-1}(U)) = \pi_1(\phi(\pi^{-1}(U))), where \pi_1 is a projection onto the first factor. Also, \phi restricted to E_p must be an isomorphism that takes it to R^k.

This precise definition from John Lee’s Smooth Manifolds is a bit convoluted. Intuitively, we are assigning a vector space to each p\in M, and we can choose a neighborhood U of p such that the vector spaces inside aren’t too different, since there is a diffeomorphism that takes them to U\times R^k. The cylinder S^1\times (0,1) is a so-called trivial bundle, where each point in S^1 is assigned the same vector space. The mobius strip is a nontrivial bundle, since the vector space at each point of S^1 eventually flips.

Given a manifold M, a vector field V on M and a point p\in M, a flow y_p is a curve defined by y_p(0) = p and y_p'(t) = V(y(t)) for t taking values in some open I\subset R.

The tensor bundle of type (k, l) on M is defined as \bigcup_{p\in M} T^k_l(T_pM). Every tensor bundle is a vector bundle, since we can simply project each tensor onto its point in the manifold. A tensor field is a smooth section of a tensor bundle.

Given a smooth function f, we define its exterior derivative df by df(X) = Xf, where X is a vector field.

If M and N are smooth manifolds and F: M\to N is a smooth map, the push-forward F_* associated with F is defined such that (F_*X)_q = dF_{F^{-1}(q)}(X_{F^{-1}(q)}), where X is a vector field.

If F:M\to N is a smooth map, then the rank of F is the rank of F_*

Given a push-forward F_*, we can also define its dual F^*, called the pullback. Pullbacks allow us to differentiate differential forms, as for any n-form X, we can define F^*X(E_1,\ldots, E_n) = X(F_*E_1,\ldots, F_*E_n).

A local frame for a manifold M and vector bundle (E, f) over M is an ordered tuple of smooth sections E_1, E_2 \ldots E_n  of E defined on some open subset U of M such that for each p\in U, the E_{ip} are linearly independent and span the fiber E_p.

topological group is a topological space whose points have a group structure. The unit circle S^1 is an example of a topological group, where the group operation simply adds the angles of two points.

An n-dimensional manifold M is called parallelizable if there exist vector fields V_1, V_2, \ldots, V_n globally defined on M that form a basis for the tangent space at any given p\in M. Obviously, all one-dimensional manifolds are parallelizable. In fact, all topological groups are parallelizable, but the converse is not true – the 7-dimensional sphere S^7 is parallelizable, but is not a topological group (In fact, S^0, S^1 and S^3 are the only spheres which are topological groups – see here for a full proof). The hairy-ball theorem, a result from algebraic topology, says that no even-dimensional spheres have continuous tangent vector fields that are non-vanishing. The usual proof of the hairy-ball theorem requires the euler characteristic, a tool from algebraic topology.

Exercise: Use the hairy-ball theorem to convince yourself that even-dimensional spheres are not parallelizable, and thus are not topological groups.

Exercise: Show that the product of parallelizable manifolds is parallelizable, proving that S^2\neq S^1\times S^1.

In fact, S^1\times S^1 is the 2-torus, and S^2 = (S^1\times S^1) / (S^1\vee S^1), where \vee denotes the wedge sum of topological spaces. We can see how this makes sense using cell-decompositions, described here. Essentially, S^1 is a one-cell and a zero-cell, so S^2 is a two-cell, two one-cells, and a zero-cell. And S^1\vee S^1 is just two one-cells, so taking the quotient leaves a two-cell and a zero-cell, which is topologically S^2.

Let a be a k-form. a is closed if da = 0 and a is exact if there is some form b such that db = a. There is a close relationship between a geometric property of a space, whether it has “holes”, and an algebraic property, whether it can contain forms which are closed but not exact. This is formalized by the deRham Cohomology. The k-dimensional deRham cohomology of a space is the quotient of the space of closed k-forms by the space of exact k-forms. For a great exposition on deRham Cohomology, see here.

Exercise: Prove that all exact forms are closed – in other words d^2 = 0.

Let V(M) be the space of vector fields on M. A connection is a map \nabla:C^{\infty}(TM)\times V(M) \to V(M), denoted \nabla_XY, such that \nabla is a tensor in X and a derivation in Y. In other words, for any f:M\to R^n,

(a) \nabla_X(Y_1 + Y_2) = \nabla_XY_1 + \nabla_XY_2

(b) \nabla_X(fY)(p) = (X_pf)\times Y(p) + f(p)\times\nabla_XY(p)

(c) \nabla_{fX_1 + gX_2}Y(p) = f(p)\nabla_{X_1}Y(p) + g(p)\nabla_{X_2}Y(p)

For more about connections, see here

The Lie Bracket [X, Y] of two smooth vector fields X and Y is defined such that for any f\in C^{\infty}(M), [X,Y]f = X(Y(f)) - Y(X(f)).