Top 20 Math Books on Machine Learning / AI

    The first 4 books (by Strang, Lang, etc) are the Masterpieces.

    • Linear Algebra by Strang. He writes math like few folks do, no endless paragraphs of definitions and theorems. He tells you why something is important. He wears his heart on his sleeve. If you want to spend a lifetime doing ML, sleep with this book under your pillow. Read it when you go to bed and wake up in the morning. Repeat to yourself: “eigen do it if I try.”

      Strang’s MIT OpenCourse:

      • Introduction to Applied Math by Strang. You’ll need to understand differential equations at some point, even to understand the dynamics of deep learning models, so you’ll benefit from Strang’s tour de force of a survey through a vast landscape of ideas, from numerical analysis to Fourier transforms.
      • Algebra by Lang. This legendary Yale professor has written more “yellow jacketed” tomes in math in the Springer series than any one else. I secretly think he’s a fictitious person actually made up of the entire Yale math faculty. Yes, it’s a long book. Yes, it’s hard going. No, it’s about as far from Strang as you can get. You want out. Be my guest. Mount Everest cannot be climbed by everyone. Here’s a nice phrase : “Today we Strang. Tomorrow we will Lang”. Meaning ML today uses basic linear algebraic ideas like eigenvectors, singular value decomposition etc. in the coming decades, the far more powerful machinery in Lang’s book will come into use. Want to be a leader in the ML of tomorrow? This is what it might require.
      • Computational Homology by Kaczynski. Many of the books above cover some basic topology, the abstract study of shapes. You know, the subfield of math that shows why a coffee cup is the same as a doughnut. Most ML methods assume smoothness of the underlying space. Can one learn anything in a space that has no smoothness metrics defined on it? This subfield of topology studies how to extract geometric structure from datasets without assuming any continuity or smoothness.

      • In All Likelihood by Yudi Pawitan. Fisher’s concept of likelihood is the most important idea in statistics you need to understand and no book I’ve read explains this core idea better than this gem of a book by Yudi. Likelihoods are not probabilities. Repeat to yourself. Yudi wisely avoids complex examples and sticks to simple 1 dimensional examples for the most part. You’ll come away with a much deeper appreciation of statistics from this fine book.
      • Convex Optimization by Boyd. Much of modern ML is couched in the language of optimization. The separating line between tractable and intractable problems is not linear vs. nonlinear but convex vs. nonconvex. Boyd leaves out a lot of important modern ideas but he covers the basics well. Hint: his Stanford lecture notes cover a lot of what is not in the book.
      • Optimization in Vector Spaces by Luenberger. At some point in reading ML papers, you’ll start encountering phrases like “inner product spaces” or “Hilbert” spaces. The latter was popularized by the founder of computer science John von Neumann to formalize quantum mechanics. The joke is he gave a talk at Göttingen on Hilbert spaces and the great mathematician David Hilbert was in the audience. He asked a colleague after the talk: what in the world are these so-called Hilbert spaces? Luenberger covers optimization in infinite dimensional spaces. He explains the most important and profound theorem in optimization: the Hahn Banach theorem. Why do neural nets with sigmoid nonlinear activations represent any smooth function? The HB theorem is the reason. Slim book but a tough one to master.
      • Causal Representations in Statistics by Judea Pearl. For the past 25 years, Pearl has single handedly pursued this problem. To anyone who listens, he will tell you why above all, causality is the most important idea after likelihood in statistics, which however cannot be expressed in the language of probabilities. For all its power, probability theory cannot express such a basic concept like diseases causes symptoms, not the other way. Correlation is symmetric. Causality is fundamentally asymmetric. Pearl explains when and whether one can go from the former to the latter. Pearl is the Isaac Newton of modern AI.
      • Group Representations in Probability and Statistics by Persi Diaconis. Persi is a world famous mathematician who started his career as a magician. He ran away from home when he was young and joined a traveling circus, inventing some very cool card tricks that caught the attention of none other than Martin Gardner who used to write the famous “puzzle column” in Scientific American. When Persi decided to learn math more seriously so he could invent better tricks, he had a problem that he barely had what anyone would call an education. Martin Gardner wrote him a recommendation to Harvard that simply read: “here’s a magician who wants to be a mathematician” and explained why Persi would one day be a famous one. Harvard took the chance and the rest is history. In this slim book, Persi elegantly explains why the mathematics of symmetries — group theory and group representations— can shed deeper light into statistics.
      • Linear Statistical Models by C. R. Rao. For most of you who haven’t heard of this “living god” of statistics, your statistics professor’s PhD advisor likely learned statistics from this book. The famous Rao-Blackwell theorem is at the heart of the foundational concept of sufficient statistics. The equally famous Rao-Cramer theorem relates the ability to learn effectively from samples to the curvature of the likelihood function. In a dazzling paper written in his 20s, he showed that the space of probability distributions was not Euclidean, but a curved Riemannian manifold. This idea shows up in machine learning in a hundred different ways currently. Rao invented multivariate statistics as a young postdoctoral researcher at Cambridge. Hard to believe, but this “Gauss” of statistics is still alive, in his 90s, teaching at a university in India named after him.
      • Convex Analysis by Rockafellar. Unlike Boyd’s book, this one has no pictures. You can instantly tell the difference from a serious math book from a more elementary one. The serious one has no pictures. You want to dig deep into the geometry of convex functions and convex sets, Rockafellar is your guide.
      • The Symmetric Group by Sagan. Group theory comes in two flavors: finite groups and continuous infinite groups. Sagan digs deep into finite groups and their linear algebraic representations in this slim beautiful tome. Think you really understand linear algebra. Reading the first few pages of this book will have you scurrying back to Strang when you realize what you haven’t yet mastered. You might read this along with Persi’s more chatty and less refined presentation. The beautiful concept of the character of a group is explained here. Unlike their linear algebraic cousins, group representations are basis independent (like the trace of a matrix, which is the same in any basis).
      • Analysis of Incomplete Mulitivariate Data by J. L. Shafer. The book to learn EM from, the famous expectation maximization algorithm presented in the way statisticians developed them, not the confusing way it is presented in ML textbooks using mixture models and HMMs. General advice: the statistics you need to learn for ML is best learned from statistics books, not ML textbooks.
      • Neurodynamic Programming by Tsitsiklis and Bertsekas. Still the most authoritative treatment of reinforcement learning. Valuable in many other ways, including a superb treatment of nonlinear function approximation by neural network models. The most enjoyable bus ride of my life was in the company of these two eminent MIT professors a decade ago going to a workshop in a remote region of Mexico. If you really want to understand why Q-learning works, this is your salvation. You’ll quickly discover how weak your math background is, and why you need to understand the deep concept of martingales, which capture the notion of a fair betting game.
      • Non-cooperative Games by John Nash. Yes, the guy who Russell Crowe plays in The Beautiful Mind. This slim 25-page Princeton math PhD thesis earned its author the well deserved Nobel prize in economics. Legend has it von Neumann dismissed this idea when he heard of it as “just another fixed point theorem”. Von Neumann’s own massive tome on games and economic decisions focused entirely on simpler weaker models of games. Nash’s concept has proved more enduring. If you want to understand GAN models more deeply, you need to understand Nash equilibria.
      • Best Approximation in Inner Product Spaces by Deutsch. If you want to see how mathematicians think of machine learning, you need to read this book. Mathematicians tend to think in generalities. This book captures beautifully the way mathematicians think of learning from data, e.g. least squares methods as projections in Hilbert spaces. Even more beautiful ideas like von Neumann’s famous algorithm using alternating projections, the most rediscovered and reinvented algorithm in history, is explained here. Yes, you’ll find that many ideas you thought that came from ML or statistics can all be viewed as special cases of von Neumann’s work (EM, non-negative matrix approximation, and a dozen other ideas). This book teaches you the power of abstraction.
      • The “Lord of the Rings” trilogy on manifolds by Lee. I’m getting to the end of my list of 20 math books for ML, and like most humans, I’m going to start cheating by including “course packs”. You need to really grok manifolds at some point in your quest to study the foundations of ML. Lee’s trilogy on “Topological Manifolds”, “Smooth Manifolds” and “Riemannian manifolds” is the definitive modern guide to understanding curved spaces, like space time (four dimensions), string theory, and probability spaces.
      • Set Theory and Measure Theory by Paul Halmos. PH wasn’t a great mathematician, but he was a great writer. ML is deeply based on being able to measure distances between objects and measure theory is the abstract theory of how to define metrics on sets. Ultimately, probability is just a measure on a set with some special properties.
      • Probability Theory: Independence, Exchangeability, Martingales by Chow and Teicher. Yes, probability is just a measure on sets, but this tour-de-force of a book explains the unique measure-theoretic properties of probability. This book shows you how mathematicians think of probability. I’m guessing you know all about independent random variables. Do you know about exchangeability? Ever used bag of words representations in NLP or computer vision. Why do they work? Why does Q-learning converge? You need to understand the other two foundations of probability theory.
      • For my last book, I’ll choose The Topology of Fiber Bundles by Steenrod. These are ways of parameterizing spaces, and manifolds and Euclidean geometry are special types of fiber bundles. Let’s take the Earth’s surface as a fiber bundle. At each point on the surface, the set of tangents form a second space. The first space, the surface of the Earth, parameterizes the second space of tangents at each point. Ergo, we have a tangent bundle, a special case of fiber bundles. Today’s ML heavily uses the concept of manifolds. Tomorrow’s ML will likely build on fiber bundles.

      Big Picture: Linear Algebra 

      The Big Picture of Linear Algebra 线性代数 (MIT Open-Courseware by the famous Prof Gilbert Strang)

      A.X = B

      A(m,n) = \begin{pmatrix} a_{11} &  a_{12} & \ldots & a_{1n}\\ a_{21} & a_{22} & \ldots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{m1}&a_{m2} &\ldots & a_{mn} \end{pmatrix}

      A(m, n)  is a matrix of m rows, n columns

      4 sub-Vector Spaces:
      Column Space {C(A)} , dim = r = rank (A) 
      Row Space {C(A^{T})} , dim = r = rank (A)

      Nullspace {N(A)}   {\perp C(A^{T})} , dim = n – r

      Left Nullspace {N(A^{T})} {\perp C(A)} , dim = m – r

      Abstract Vector Spaces ​向量空间

      Any object satisfying these 8 axioms belong to the algebraic structure of Vector Space: eg. Vectors, Polynomials, Functions, …

      Note: “Vector Space” + “Linear Map” = “CategoryVect_{K}

      Eigenvalues & Eigenvectors (valeurs propres et vecteurs propres) 特征值/特征向量

      [ Note: “Eigen-” is German for Characteristic 特征.]

      Important Trick: (see Monkeys & Coconuts Problem)
      If a transformation A is linear, and the “before” state and “after” state of the “vector” v remain the same  (keep the status-quo) , then : Eigenvalue \boxed {\lambda = 1}

      \boxed {A.v = \lambda.v = 1.v  = v}

      Try to compute:
      { \begin{pmatrix} a_{11} &  a_{12} & \ldots & a_{1n}\\ a_{21} & a_{22} & \ldots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1}&a_{n2} &\ldots & a_{nn} \end{pmatrix} }^{1000000000}
      is more difficult than this diagonalized equivalent matrix:
      { \begin{pmatrix} b_{11} & 0 & \ldots & 0\\ 0 & b_{22} & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 &0 &\ldots & b_{nn} \end{pmatrix} }^{1000000000}

      = { \begin{pmatrix} {b_{11}}^{1000000000}&0 &\ldots &0\\ 0 &{b_{22}}^{1000000000} &\ldots & 0\\ \vdots &\vdots & \ddots & \vdots\\ 0 &0 &\ldots &{b_{nn}}^{1000000000} \end{pmatrix}}

      Note: This is the secret of the Google computation of diagonalized matrix of billion columns & billion rows, where all the bjk are the “PageRanks” (web links coming into a particular webpage and links going out from that webpage).

      The Essence of Determinant (*): (行列式)

      (*) Determinant was invented by the ancient Chinese Algebraists 李冶 / 朱世杰 /秦九韶 in 13th century (金 / 南宋 / 元) in《天元术》.The Japanese “和算” mathematician 关孝和 spread it further to Europe before the German mathematician Leibniz named it the “Determinant” in 18th century. The world, however, had to wait till the 19th century to discover the theory of Matrix 矩阵 by JJ Sylvester (Statistical Math private Tutor of Florence Nightingale, the world’s first nurse) closely linked to the application of Determinant.

      [NOTE] 金庸 武侠小说 《神雕侠女》里 元朝初年的 黄蓉 破解 大理国王妃 瑛姑 苦思不解的 “行列式”, 大概是求 eigenvalues & eigenvectors ? 🙂

      Linear Algebra : Left & Right Eigenvectors and Eigenvalues

      Let A matrix, vector x, λ eigenvalue

      1. Right Eigenvectors and eigenvalues:

      A.x = λx


      A = \begin{pmatrix} 5 & -7 &  & 7 \\ 4 & -3 & & 4\\ 4 & -1 & & 2 \end{pmatrix} and, x = \begin{pmatrix} 1\\ 1\\ 1 \end{pmatrix}

      A.x = \begin{pmatrix} 5\\ 5\\ 5 \end{pmatrix} = 5. \begin{pmatrix} 1\\ 1\\ 1 \end{pmatrix} = 5.x
      (right) eigenvalue = 5

      2. Left Eigenvectors and eigenvalues:

      x.A = λx

      However, be careful that:
      x.A = \begin{pmatrix} 1 & 1 && 1 \end{pmatrix}.\begin{pmatrix} 5 & -7 &  & 7 \\ 4 & -3 & & 4\\ 4 & -1 & & 2 \end{pmatrix} = \begin{pmatrix} 13 & -11 && 13 \end{pmatrix}

      If we want to find the left eigenvector associated with the eigenvalue 5, then we find the eigenvector A^T.

      This would lead us to see that:
      (-1 1 -1).A = (-5 5 -5) = 5. (-1 1 -1)
      So, in this example, the eigenvalue 5 has different left and right eigenvectors:
      (-1 1 -1) & (1 1 1) respectively.

      Remark 1: However, the nice fact about matrices is that always :
      left eigenvalue = right eigenvalue.
      So we just simply call eigenvalue for short.

      Remark 2: Intuitive explanation of eigenvector and eigenvalue.

      《Math Bytes》by Tim Charter
      Princeton University Press
      [NLB #510 CHA]

      Khan Academy

      I find Khan Linear Algebra video excellent. The founder / teacher Sal Khan has the genius to explain this not-so-easy topic in modular videos steps by steps, from 2-dimensional vectors to 3-dimensional, working with you by hand to compute eigenvalues and eigenvectors, and show you what they mean in graphic views.

      If you are taking Linear Algebra course in university, or revising it, just go through all the Khan’s short (5-20 mins) videos on Linear Algebra here:

      In 138 lessons sequence:

      or random revision: