The Math we learnt in High school is pre-19th Century up to Newtonian Calculus of 17CE.

The birth of Modern Math since 19CE till WW2 is the “Abstract Algebra” from French Revolution Galois “Group Theory” 群论.

After WW2 till now, Mathematics faces the crisis of “Truth”: whether its Foundation is correct.

3 schools of fight on the Fondation of Mathematics:

1. Russell (Logic with Types to fix the “Russell Paradox” in Set Theory Crisis)

2. Hilbert (Axiomatization of all Mathematics)

3. Brouwer (Intuitionism) against "排中律" (Law of the Excluded Middle)

Winner: Godel “The Incomplete Theorem” (不完备定律)

However, the by-product of these 3 school fights give rise to new Math discovery in Machine Proofing (2010s):

Homotopy Type Theory (HoTT) = Logic (Proof) + Type (Intuitionist) + Topology (Homotopy).

ie. Math Proof = Computer Program

Top 20 Math Books on Machine Learning / AI


    The first 4 books (by Strang, Lang, etc) are the Masterpieces.

    • Linear Algebra by Strang. He writes math like few folks do, no endless paragraphs of definitions and theorems. He tells you why something is important. He wears his heart on his sleeve. If you want to spend a lifetime doing ML, sleep with this book under your pillow. Read it when you go to bed and wake up in the morning. Repeat to yourself: “eigen do it if I try.”

      Strang’s MIT OpenCourse:



      • Introduction to Applied Math by Strang. You’ll need to understand differential equations at some point, even to understand the dynamics of deep learning models, so you’ll benefit from Strang’s tour de force of a survey through a vast landscape of ideas, from numerical analysis to Fourier transforms.
      • Algebra by Lang. This legendary Yale professor has written more “yellow jacketed” tomes in math in the Springer series than any one else. I secretly think he’s a fictitious person actually made up of the entire Yale math faculty. Yes, it’s a long book. Yes, it’s hard going. No, it’s about as far from Strang as you can get. You want out. Be my guest. Mount Everest cannot be climbed by everyone. Here’s a nice phrase : “Today we Strang. Tomorrow we will Lang”. Meaning ML today uses basic linear algebraic ideas like eigenvectors, singular value decomposition etc. in the coming decades, the far more powerful machinery in Lang’s book will come into use. Want to be a leader in the ML of tomorrow? This is what it might require.
      • Computational Homology by Kaczynski. Many of the books above cover some basic topology, the abstract study of shapes. You know, the subfield of math that shows why a coffee cup is the same as a doughnut. Most ML methods assume smoothness of the underlying space. Can one learn anything in a space that has no smoothness metrics defined on it? This subfield of topology studies how to extract geometric structure from datasets without assuming any continuity or smoothness.

      • In All Likelihood by Yudi Pawitan. Fisher’s concept of likelihood is the most important idea in statistics you need to understand and no book I’ve read explains this core idea better than this gem of a book by Yudi. Likelihoods are not probabilities. Repeat to yourself. Yudi wisely avoids complex examples and sticks to simple 1 dimensional examples for the most part. You’ll come away with a much deeper appreciation of statistics from this fine book.
      • Convex Optimization by Boyd. Much of modern ML is couched in the language of optimization. The separating line between tractable and intractable problems is not linear vs. nonlinear but convex vs. nonconvex. Boyd leaves out a lot of important modern ideas but he covers the basics well. Hint: his Stanford lecture notes cover a lot of what is not in the book.
      • Optimization in Vector Spaces by Luenberger. At some point in reading ML papers, you’ll start encountering phrases like “inner product spaces” or “Hilbert” spaces. The latter was popularized by the founder of computer science John von Neumann to formalize quantum mechanics. The joke is he gave a talk at Göttingen on Hilbert spaces and the great mathematician David Hilbert was in the audience. He asked a colleague after the talk: what in the world are these so-called Hilbert spaces? Luenberger covers optimization in infinite dimensional spaces. He explains the most important and profound theorem in optimization: the Hahn Banach theorem. Why do neural nets with sigmoid nonlinear activations represent any smooth function? The HB theorem is the reason. Slim book but a tough one to master.
      • Causal Representations in Statistics by Judea Pearl. For the past 25 years, Pearl has single handedly pursued this problem. To anyone who listens, he will tell you why above all, causality is the most important idea after likelihood in statistics, which however cannot be expressed in the language of probabilities. For all its power, probability theory cannot express such a basic concept like diseases causes symptoms, not the other way. Correlation is symmetric. Causality is fundamentally asymmetric. Pearl explains when and whether one can go from the former to the latter. Pearl is the Isaac Newton of modern AI.
      • Group Representations in Probability and Statistics by Persi Diaconis. Persi is a world famous mathematician who started his career as a magician. He ran away from home when he was young and joined a traveling circus, inventing some very cool card tricks that caught the attention of none other than Martin Gardner who used to write the famous “puzzle column” in Scientific American. When Persi decided to learn math more seriously so he could invent better tricks, he had a problem that he barely had what anyone would call an education. Martin Gardner wrote him a recommendation to Harvard that simply read: “here’s a magician who wants to be a mathematician” and explained why Persi would one day be a famous one. Harvard took the chance and the rest is history. In this slim book, Persi elegantly explains why the mathematics of symmetries — group theory and group representations— can shed deeper light into statistics.
      • Linear Statistical Models by C. R. Rao. For most of you who haven’t heard of this “living god” of statistics, your statistics professor’s PhD advisor likely learned statistics from this book. The famous Rao-Blackwell theorem is at the heart of the foundational concept of sufficient statistics. The equally famous Rao-Cramer theorem relates the ability to learn effectively from samples to the curvature of the likelihood function. In a dazzling paper written in his 20s, he showed that the space of probability distributions was not Euclidean, but a curved Riemannian manifold. This idea shows up in machine learning in a hundred different ways currently. Rao invented multivariate statistics as a young postdoctoral researcher at Cambridge. Hard to believe, but this “Gauss” of statistics is still alive, in his 90s, teaching at a university in India named after him.
      • Convex Analysis by Rockafellar. Unlike Boyd’s book, this one has no pictures. You can instantly tell the difference from a serious math book from a more elementary one. The serious one has no pictures. You want to dig deep into the geometry of convex functions and convex sets, Rockafellar is your guide.
      • The Symmetric Group by Sagan. Group theory comes in two flavors: finite groups and continuous infinite groups. Sagan digs deep into finite groups and their linear algebraic representations in this slim beautiful tome. Think you really understand linear algebra. Reading the first few pages of this book will have you scurrying back to Strang when you realize what you haven’t yet mastered. You might read this along with Persi’s more chatty and less refined presentation. The beautiful concept of the character of a group is explained here. Unlike their linear algebraic cousins, group representations are basis independent (like the trace of a matrix, which is the same in any basis).
      • Analysis of Incomplete Mulitivariate Data by J. L. Shafer. The book to learn EM from, the famous expectation maximization algorithm presented in the way statisticians developed them, not the confusing way it is presented in ML textbooks using mixture models and HMMs. General advice: the statistics you need to learn for ML is best learned from statistics books, not ML textbooks.
      • Neurodynamic Programming by Tsitsiklis and Bertsekas. Still the most authoritative treatment of reinforcement learning. Valuable in many other ways, including a superb treatment of nonlinear function approximation by neural network models. The most enjoyable bus ride of my life was in the company of these two eminent MIT professors a decade ago going to a workshop in a remote region of Mexico. If you really want to understand why Q-learning works, this is your salvation. You’ll quickly discover how weak your math background is, and why you need to understand the deep concept of martingales, which capture the notion of a fair betting game.
      • Non-cooperative Games by John Nash. Yes, the guy who Russell Crowe plays in The Beautiful Mind. This slim 25-page Princeton math PhD thesis earned its author the well deserved Nobel prize in economics. Legend has it von Neumann dismissed this idea when he heard of it as “just another fixed point theorem”. Von Neumann’s own massive tome on games and economic decisions focused entirely on simpler weaker models of games. Nash’s concept has proved more enduring. If you want to understand GAN models more deeply, you need to understand Nash equilibria.
      • Best Approximation in Inner Product Spaces by Deutsch. If you want to see how mathematicians think of machine learning, you need to read this book. Mathematicians tend to think in generalities. This book captures beautifully the way mathematicians think of learning from data, e.g. least squares methods as projections in Hilbert spaces. Even more beautiful ideas like von Neumann’s famous algorithm using alternating projections, the most rediscovered and reinvented algorithm in history, is explained here. Yes, you’ll find that many ideas you thought that came from ML or statistics can all be viewed as special cases of von Neumann’s work (EM, non-negative matrix approximation, and a dozen other ideas). This book teaches you the power of abstraction.
      • The “Lord of the Rings” trilogy on manifolds by Lee. I’m getting to the end of my list of 20 math books for ML, and like most humans, I’m going to start cheating by including “course packs”. You need to really grok manifolds at some point in your quest to study the foundations of ML. Lee’s trilogy on “Topological Manifolds”, “Smooth Manifolds” and “Riemannian manifolds” is the definitive modern guide to understanding curved spaces, like space time (four dimensions), string theory, and probability spaces.
      • Set Theory and Measure Theory by Paul Halmos. PH wasn’t a great mathematician, but he was a great writer. ML is deeply based on being able to measure distances between objects and measure theory is the abstract theory of how to define metrics on sets. Ultimately, probability is just a measure on a set with some special properties.
      • Probability Theory: Independence, Exchangeability, Martingales by Chow and Teicher. Yes, probability is just a measure on sets, but this tour-de-force of a book explains the unique measure-theoretic properties of probability. This book shows you how mathematicians think of probability. I’m guessing you know all about independent random variables. Do you know about exchangeability? Ever used bag of words representations in NLP or computer vision. Why do they work? Why does Q-learning converge? You need to understand the other two foundations of probability theory.
      • For my last book, I’ll choose The Topology of Fiber Bundles by Steenrod. These are ways of parameterizing spaces, and manifolds and Euclidean geometry are special types of fiber bundles. Let’s take the Earth’s surface as a fiber bundle. At each point on the surface, the set of tangents form a second space. The first space, the surface of the Earth, parameterizes the second space of tangents at each point. Ergo, we have a tangent bundle, a special case of fiber bundles. Today’s ML heavily uses the concept of manifolds. Tomorrow’s ML will likely build on fiber bundles.

      Hilbert’s Problem Solving

      David Hilbert was a most concrete, intuitive mathematician who invented, and very consciously used, a principle: namely, if you want to solve a problem first strip the problem of everything that is not essential. Simplify it, specialize it as much as you can without sacrificing its core. Thus it becomes simple, as simple as it can be made, without losing any of its punch, and then you solve it. The generalization is a triviality which you don’t have to pay too much attention to.

      Lord of the Ring

      Lord of the “Ring”:
      The term Ring first introduced by David Hilbert (1862-1943) for Z and Polynomial.
      The fully abstract axiomatic theory of commutative rings by his student Emmy Noether in her paper “Ideal Theory in Rings” @1921.

      eg. 3 Classical Rings:
      1. Matrices over Field
      2. Integer Z
      3. Polynomial over Field.

      Ring Confusions
      Assume all Rings with 1 for * operation.

      Ring has operation + forms an Abelian group, operation * forms a semi group (Close, Associative).

      1) Ever ask why must be Abelian + group ?
      Apply Distributive Axioms below:
      (a+b).(1+1) = a.(1+1) + b.(1+1)
      = a + a + b + b …[1]

      (a+b).(1+1) = (a+b).1 + (a+b).1
      = a + b + a + b …[2]

      a + (a + b) + b = a + (b + a) +b
      => a + b = b + a

      Therefore, + must be Abelian in order for Ring’s * to comply with distributive axiom wrt +.

      2). Subring
      Z/6Z ={0,1,2,3,4,5}
      3.4=0 => 3, 4 zero divisor

      has subrings: {0,2,4},{0,3}

      3). Identity 1 and Units of Ring

      Z/6Z has identity 1
      but 2 subrings do not have 1 as identity.
      subrings {0,2,4}:
      4.4=4 => identity is 4
      4 is also a unit.

      Units: Ring R with 1.
      ∀a ∈ R ∃b ∈ R s.t.
      a.b=b.a = 1
      => a is unit
      and b its inverse a^-1

      Z/6Z: identity for * is 1
      5.5 = 1
      5 is Unit besides 1 which is also unit. (1.1=1)

      Prime Secret: ζ(s)

      Riemann intuitively found the Zeta Function ζ(s), but couldn’t prove it. Computer ‘tested’ it correct up to billion numbers.


      Or equivalently (see note *)

      \frac {1}{\zeta(s)} =(1-\frac{1}{2^{s}})(1-\frac{1}{3^{s}})(1-\frac{1}{5^{s}})(1-\frac{1}{p^{s}})\dots

      ζ(1) = Harmonic series (Pythagorean music notes) -> diverge to infinity
      (See note #)

      ζ(2) = Π²/6 [Euler]

      ζ(3) = not Rational number.

      1. The Riemann Hypothesis:
      All non-trivial zeros of the zeta function have real part one-half.

      ie ζ(s)= 0 where s= ½ + bi

      Trivial zeroes are s= {- even Z}:
      s(-2) = 0 =s(-4) =s(-6) =s(-8)…

      You might ask why Re(s)=1/2 has to do with Prime number ?

      There is another Prime Number Theorem (PNT) conjectured by Gauss and proved by Hadamard and Poussin:

      π(Ν) ~ N / log N
      ε = π(Ν) – N / log N
      The error ε hides in the Riemann Zeta Function’s non-trivial zeroes, which all lie on the Critical line = 1/2 :

      All non-trivial zeroes of ζ(s) are in Complex number between ]0,1[ along real line x=1/2

      2. David Hilbert:

      If I were to awaken after 500 yrs, my 1st question would be: Has Riemann been proven?’

      It will be proven in future by a young man. ‘uncorrupted’ by today’s math.

      Note (*):

      \zeta(s)=1+\frac{1}{2^{s}}+\frac{1}{3^{s}}+\frac{1}{4^{s}}+\dots = \sum \frac {1}{n^{s}} …[1]

      \frac {1}{2^{s}}\zeta(s) =  \frac{1}{2^{s}}(1+\frac{1}{2^{s}}+\frac{1}{3^{s}}+\frac{1}{4^{s}}+\dots)

      \frac {1}{2^{s}}\zeta(s) =  \frac {1}{2^{s}}+ \frac{1}{4^{s}} + \frac{1}{6^{s}} + \frac{1}{8^s} +\dots … [2]


      (1- \frac{1}{2^{s}})\zeta(s) = 1+ \frac{1}{3^{s}} + \frac{1}{5^{s}} + \dots + \frac{1}{p^{s}} +\dots

      \text {Repeat with} (1-\frac{1}{3^s}) \text { both sides:}

      (1- \frac{1}{3^{s}})(1- \frac{1}{2^{s}})\zeta(s) = 1+ \frac{1}{5^{s}} + \frac{1}{7^{s}} + \dots + \frac{1}{p^{s}} +\dots


      (1- \frac{1}{p^{s}}) \dots (1- \frac{1}{5^{s}})(1- \frac{1}{3^{s}})(1- \frac{1}{2^{s}})\zeta(s) = 1


      \zeta(s) = \prod \frac {1}  {1- \frac{1}{p^{s}}}= \sum \frac {1}{n^{s}}

      Note #:
      \zeta(s) = \prod \frac {1}  {1- \frac{1}{p^{s}}}= \sum \frac {1}{n^{s}}

      Let s=1
      RHS: Harmonic series diverge to infinity
      \prod \frac {1}{1- \frac{1}{p}}= \prod \frac{p}{p-1}
      Diverge to infinity => there are infinitely many primes p

      English: Zero-free region for the Riemann_zeta...

      English: Zero-free region for the Riemann_zeta_function (Photo credit: Wikipedia)