The first 4 books (by Strang, Lang, etc) are the Masterpieces.

Linear Algebraby Strang. He writes math like few folks do, no endless paragraphs of definitions and theorems. He tells you why something is important. He wears his heart on his sleeve. If you want to spend a lifetime doing ML, sleep with this book under your pillow. Read it when you go to bed and wake up in the morning. Repeat to yourself: “eigen do it if I try.”

Introduction toApplied Mathby Strang. You’ll need to understand differential equations at some point, even to understand the dynamics of deep learning models, so you’ll benefit from Strang’s tour de force of a survey through a vast landscape of ideas, from numerical analysis to Fourier transforms.

Algebraby Lang. This legendary Yale professor has written more “yellow jacketed” tomes in math in the Springer series than any one else. I secretly think he’s a fictitious person actually made up of the entire Yale math faculty. Yes, it’s a long book. Yes, it’s hard going. No, it’s about as far from Strang as you can get. You want out. Be my guest. Mount Everest cannot be climbed by everyone. Here’s a nice phrase : “Today we Strang. Tomorrow we will Lang”. Meaning ML today uses basic linear algebraic ideas like eigenvectors, singular value decomposition etc. in the coming decades, the far more powerful machinery in Lang’s book will come into use. Want to be a leader in the ML of tomorrow? This is what it might require.

Computational Homologyby Kaczynski. Many of the books above cover some basic topology, the abstract study of shapes. You know, the subfield of math that shows why a coffee cup is the same as a doughnut. Most ML methods assume smoothness of the underlying space. Can one learn anything in a space that has no smoothness metrics defined on it? This subfield of topology studies how to extract geometric structure from datasets without assuming any continuity or smoothness.

In All Likelihoodby Yudi Pawitan. Fisher’s concept of likelihood is the most important idea in statistics you need to understand and no book I’ve read explains this core idea better than this gem of a book by Yudi. Likelihoods are not probabilities. Repeat to yourself. Yudi wisely avoids complex examples and sticks to simple 1 dimensional examples for the most part. You’ll come away with a much deeper appreciation of statistics from this fine book.

Convex Optimizationby Boyd. Much of modern ML is couched in the language of optimization. The separating line between tractable and intractable problems is not linear vs. nonlinear but convex vs. nonconvex. Boyd leaves out a lot of important modern ideas but he covers the basics well. Hint: his Stanford lecture notes cover a lot of what is not in the book.

Optimization in Vector Spacesby Luenberger. At some point in reading ML papers, you’ll start encountering phrases like “inner product spaces” or “Hilbert” spaces. The latter was popularized by the founder of computer science John von Neumann to formalize quantum mechanics. The joke is he gave a talk at Göttingen on Hilbert spaces and the great mathematician David Hilbert was in the audience. He asked a colleague after the talk: what in the world are these so-called Hilbert spaces? Luenberger covers optimization in infinite dimensional spaces. He explains the most important and profound theorem in optimization: the Hahn Banach theorem. Why do neural nets with sigmoid nonlinear activations represent any smooth function? The HB theorem is the reason. Slim book but a tough one to master.

Causal Representations in Statisticsby Judea Pearl. For the past 25 years, Pearl has single handedly pursued this problem. To anyone who listens, he will tell you why above all, causality is the most important idea after likelihood in statistics, which however cannot be expressed in the language of probabilities. For all its power, probability theory cannot express such a basic concept like diseases causes symptoms, not the other way. Correlation is symmetric. Causality is fundamentally asymmetric. Pearl explains when and whether one can go from the former to the latter. Pearl is the Isaac Newton of modern AI.

Group Representations in Probability and Statisticsby Persi Diaconis. Persi is a world famous mathematician who started his career as a magician. He ran away from home when he was young and joined a traveling circus, inventing some very cool card tricks that caught the attention of none other than Martin Gardner who used to write the famous “puzzle column” in Scientific American. When Persi decided to learn math more seriously so he could invent better tricks, he had a problem that he barely had what anyone would call an education. Martin Gardner wrote him a recommendation to Harvard that simply read: “here’s a magician who wants to be a mathematician” and explained why Persi would one day be a famous one. Harvard took the chance and the rest is history. In this slim book, Persi elegantly explains why the mathematics of symmetries — group theory and group representations— can shed deeper light into statistics.

Linear Statistical Modelsby C. R. Rao. For most of you who haven’t heard of this “living god” of statistics, your statistics professor’s PhD advisor likely learned statistics from this book. The famous Rao-Blackwell theorem is at the heart of the foundational concept of sufficient statistics. The equally famous Rao-Cramer theorem relates the ability to learn effectively from samples to the curvature of the likelihood function. In a dazzling paper written in his 20s, he showed that the space of probability distributions was not Euclidean, but a curved Riemannian manifold. This idea shows up in machine learning in a hundred different ways currently. Rao invented multivariate statistics as a young postdoctoral researcher at Cambridge. Hard to believe, but this “Gauss” of statistics is still alive, in his 90s, teaching at a university in India named after him.

Convex Analysisby Rockafellar. Unlike Boyd’s book, this one has no pictures. You can instantly tell the difference from a serious math book from a more elementary one. The serious one has no pictures. You want to dig deep into the geometry of convex functions and convex sets, Rockafellar is your guide.

The Symmetric Groupby Sagan. Group theory comes in two flavors: finite groups and continuous infinite groups. Sagan digs deep into finite groups and their linear algebraic representations in this slim beautiful tome. Think you really understand linear algebra. Reading the first few pages of this book will have you scurrying back to Strang when you realize what you haven’t yet mastered. You might read this along with Persi’s more chatty and less refined presentation. The beautiful concept of the character of a group is explained here. Unlike their linear algebraic cousins, group representations are basis independent (like the trace of a matrix, which is the same in any basis).

Analysis of IncompleteMulitivariate DatabyJ. L.Shafer. The book to learn EM from, the famous expectation maximization algorithm presented in the way statisticians developed them, not the confusing way it is presented in ML textbooks using mixture models and HMMs. General advice: the statistics you need to learn for ML is best learned from statistics books, not ML textbooks.

Neurodynamic Programmingby Tsitsiklis and Bertsekas. Still the most authoritative treatment of reinforcement learning. Valuable in many other ways, including a superb treatment of nonlinear function approximation by neural network models. The most enjoyable bus ride of my life was in the company of these two eminent MIT professors a decade ago going to a workshop in a remote region of Mexico. If you really want to understand why Q-learning works, this is your salvation. You’ll quickly discover how weak your math background is, and why you need to understand the deep concept of martingales, which capture the notion of a fair betting game.

Non-cooperative Gamesby John Nash. Yes, the guy who Russell Crowe plays inTheBeautifulMind. This slim 25-page Princeton math PhD thesis earned its author the well deserved Nobel prize in economics. Legend has it von Neumann dismissed this idea when he heard of it as “just another fixed point theorem”. Von Neumann’s own massive tome on games and economic decisions focused entirely on simpler weaker models of games. Nash’s concept has proved more enduring. If you want to understand GAN models more deeply, you need to understand Nash equilibria.

Best Approximation in Inner Product Spacesby Deutsch. If you want to see how mathematicians think of machine learning, you need to read this book. Mathematicians tend to think in generalities. This book captures beautifully the way mathematicians think of learning from data, e.g. least squares methods as projections in Hilbert spaces. Even more beautiful ideas like von Neumann’s famous algorithm using alternating projections, the most rediscovered and reinvented algorithm in history, is explained here. Yes, you’ll find that many ideas you thought that came from ML or statistics can all be viewed as special cases of von Neumann’s work (EM, non-negative matrix approximation, and a dozen other ideas). This book teaches you the power of abstraction.

The “Lord of the Rings” trilogy on manifolds by Lee. I’m getting to the end of my list of 20 math books for ML, and like most humans, I’m going to start cheating by including “course packs”. You need to really grok manifolds at some point in your quest to study the foundations of ML. Lee’s trilogy on “Topological Manifolds”, “Smooth Manifolds” and “Riemannian manifolds” is the definitive modern guide to understanding curved spaces, like space time (four dimensions), string theory, and probability spaces.

Set Theory and Measure Theoryby Paul Halmos. PH wasn’t a great mathematician, but he was a great writer. ML is deeply based on being able to measure distances between objects and measure theory is the abstract theory of how to define metrics on sets. Ultimately, probability is just a measure on a set with some special properties.

Probability Theory: Independence, Exchangeability, Martingalesby Chow and Teicher. Yes, probability is just a measure on sets, but this tour-de-force of a book explains the unique measure-theoretic properties of probability. This book shows you how mathematicians think of probability. I’m guessing you know all about independent random variables. Do you know about exchangeability? Ever used bag of words representations in NLP or computer vision. Why do they work? Why does Q-learning converge? You need to understand the other two foundations of probability theory.

For my last book, I’ll chooseThe Topology of Fiber Bundlesby Steenrod. These are ways of parameterizing spaces, and manifolds and Euclidean geometry are special types of fiber bundles. Let’s take the Earth’s surface as a fiber bundle. At each point on the surface, the set of tangents form a second space. The first space, the surface of the Earth, parameterizes the second space of tangents at each point. Ergo, we have a tangent bundle, a special case of fiber bundles. Today’s ML heavily uses the concept of manifolds. Tomorrow’s ML will likely build on fiber bundles.

Eigenvalues & Eigenvectors (valeurs propres et vecteurs propres) 特征值/特征向量

[ Note: “Eigen-” is German for Characteristic 特征.]

Important Trick: (see Monkeys & Coconuts Problem)
If a transformation A is linear, and the “before” state and “after” state of the “vector” v remain the same (keep the status-quo) , then : Eigenvalue

Try to compute:

is more difficult than this diagonalized equivalent matrix:

Note: This is the secret of the Google computation of diagonalized matrix of billion columns & billion rows, where all the bjk are the “PageRanks” (web links coming into a particular webpage and links going out from that webpage).

The Essence of Determinant (*): (行列式)

(*) Determinant was invented by the ancient Chinese Algebraists 李冶 / 朱世杰 /秦九韶 in 13th century (金 / 南宋 / 元) in《天元术》.The Japanese “和算” mathematician 关孝和 spread it further to Europe before the German mathematician Leibniz named it the “Determinant” in 18th century. The world, however, had to wait till the 19th century to discover the theory of Matrix 矩阵 by JJ Sylvester (Statistical Math private Tutor of Florence Nightingale, the world’s first nurse) closely linked to the application of Determinant.

Linear transformations are intuitively those maps of everyday space which preserve “linear” things. Specifically, they send lines to lines, planes to planes, etc., and they preserve the origin.
(One which does not preserve the origin is very similar but has a different name; see Affine Transformation)

This would lead us to see that: (-1 1 -1).A = (-5 5 -5) = 5. (-1 1 -1)
So, in this example, the eigenvalue 5 has different left and right eigenvectors:
(-1 1 -1) & (1 1 1) respectively.

Remark 1: However, the nice fact about matrices is that always : left eigenvalue = right eigenvalue.
So we just simply call eigenvalue for short.

I find Khan Linear Algebra video excellent. The founder / teacher Sal Khan has the genius to explain this not-so-easy topic in modular videos steps by steps, from 2-dimensional vectors to 3-dimensional, working with you by hand to compute eigenvalues and eigenvectors, and show you what they mean in graphic views.

If you are taking Linear Algebra course in university, or revising it, just go through all the Khan’s short (5-20 mins) videos on Linear Algebra here:

Google Search Engine & Linear Algebra:
1) Let M(nxn) matrix of size n (say 1 billion) web pages:

Note: This PageRank of 0 & 1 is over-simplied. The actual PageRank is a fuzzy number between 0 and 1, based on Larry Page’s patented PageRank formula, taking into accounts of the importance of the pages linked from and to, plus many other factors.

2) Let v(n) eigenvector of n webpages’ PageRank ak:

(all Page j pageRanks)
The page k pointed to by all pages j.

3) Let λ eigenvalue: M.v =λ.v

4) Iterate n times:

=> page k is ranked more important if many important pages j point to it;
& pages j themselves pointed by other important pages, …(iterate n times).
=> highest ranked pages appear first in Search list.

Solution 2: Use Linear Algebra Eigenvalue equation: A.X = λ.X

A =S(x)= where x = coconuts

S(x)=λx

Since each iteration of the transformation caused the coconut status ‘unchanged’, which means λ = 1 (see remark below)

We get
x = – 4

Also by recursive, after the fifth monkey: =

=

divides (x)

Minimum positive x= – 4 mod ( )= = 3,121 [QED]

Note: The meaning of eigenvalue λ in linear transformation is the change by a scalar of λ factor (lengthening or shortening by λ) after the transformation. Here

λ = 1 because “before” and “after” (transformation A) is the SAME status (“divide coconuts by 5 and left 1”).