Tuesday, July 29, 2014

Conditional Probabilities in "Thinking Fast and Slow"

I'm currently reading Thinking Fast and Slow by Daniel Kahneman. (Thanks to Elianna for letting me borrow it.) I'm not finished yet, but 60% of the way through I definitely recommend it.

While reading the "Causes Trump Statistics" chapter (number 16), there is an description of a study about cabs and hit-and-run accidents. It describes a scenario where participants are told that 85% of cabs are Green, 15% are Blue and a given observer has an 80% chance of correctly identifying the color of a given cab. Given this data, the chapter presents a scenario where a bystander identifies a cab in an accident as Blue and Kahneman goes on to explain how we fail to take the data into consideration. I really enjoyed this chapter, but won't wreck the book for you.

Instead, I want to do some math (big surprise, I know). However, I want to make it accessible to non-mathematicians (atypical for my posts).

Given the data, Kahneman tells us that the true probability that the cab was Blue is 41% though we likely bias our thinking towards the 80% probability of the identification being correct. I was on the bus and it kept bothering me, to the point that I couldn't continue reading. Eventually I figured it out (when I got to the train) and I wanted to explain how this is computed using Bayes' Law. As a primer, I wrote a post using layman's terms explaining how we use Bayes' Law. (There is some notation introduced but I hope it isn't too confusing.)

Putting Bayes' Law to Use

We need to understand what 41% even corresponds to before we can compute it. What's actually happened is that we know the event \(IDB\) has occurred -- the cab has been identified (\(ID\)) as Blue (\(B\)). What we want is the probability that the cab is Blue given we know it has been identified -- we want:
\[\text{Pr}(B \mid IDB).\] Using Bayes' Law, we can write
\[\text{Pr}(B \mid IDB) = \frac{\text{Pr}(B \text{ and } IDB \text{ both occur})}{\text{Pr}(IDB)} \quad \text{and} \quad \text{Pr}(IDB \mid B) = \frac{\text{Pr}(B \text{ and } IDB \text{ both occur})}{\text{Pr}(B)}.\] We're told that a cab can be correctly identified 80% of the time hence
\[\text{Pr}(IDB \mid B) = 0.8\] (i.e. the probability of correct ID as Blue given it is actually Blue). We're also told that 15% of the cabs are Blue hence
\[\text{Pr}(B) = 0.15.\] We can combine these with the second application of Bayes' Law above to show that
\[\text{Pr}(B \text{ and } IDB \text{ both occur}) = \text{Pr}(IDB \mid B) \cdot \text{Pr}(B) = 0.8 \cdot 0.15 = 0.12.\] The only piece of data missing now to finish our computation is \(\text{Pr}(IDB)\).

Using the extended form of Bayes' Law, since we know that the events \(B\) and \(G\) (the cab is Blue or Green) are exclusive and cover all possibilities for the cab, we can say that
\[\text{Pr}(IDB) = \text{Pr}(IDB \mid B) \cdot \text{Pr}(B) + \text{Pr}(IDB \mid G) \cdot \text{Pr}(G).\] Since there is only an 80% chance of correct identification, we know that \(\text{Pr}(IDB \mid G) = 0.2\) (the probability of misidentifying a Green cab as Blue). We also know that 85% of the cabs are Green hence we can plug these in (along with numbers already computed) to get
\[\text{Pr}(IDB) = 0.8 \cdot 0.15 + 0.2 \cdot 0.85 = 0.12 + 0.17 = 0.29.\] Putting it all together we get our answer
\[\text{Pr}(B \mid IDB) = \frac{\text{Pr}(B \text{ and } IDB \text{ both occur})}{\text{Pr}(IDB)} = \frac{0.12}{0.29} = \boxed{\frac{12}{29} \approx 0.413793103}.\] Fantastic! Now we can get back to reading...

Bayes' Law Primer

I'm currently writing a blog post that uses Bayes' Law but don't want to muddy the post with a review in layman's terms. So I have something to link, here is a short description and a chance to flex my teaching muscles before the school year starts.

Bayes' Law

For those who aren't sure, Bayes' Law tells us that the probability event \(X\) occurs given we know that event \(Y\) has occurred can easily be computed. It is written as \(\text{Pr}(X \mid Y)\) and the vertical bar is meant like the word "given", in other words, the event \(X\) is distinct from the event \(X \mid Y\) (\(X\) given \(Y\)). Bayes' law, states that
\[\text{Pr}(X \mid Y) = \frac{\text{Pr}(X \text{ and } Y \text{ both occur})}{\text{Pr}(Y)}.\]
This effectively is a re-scaling of the events by the total probability of the given event: \(\text{Pr}(Y)\).

For example, if \(X\) is the event that a \(3\) is rolled on a fair die and \(Y\) is the event that the roll is odd. We know of course that \(\text{Pr}(Y) = \frac{1}{2}\) since half of the rolls are odd. The event \(X \text{ and } Y \text{ both occur}\) in this case is the same as \(X\) since the roll can only be \(3\) is the roll is odd. Thus
\[\text{Pr}(X \text{ and } Y \text{ both occur}) = \frac{1}{6}\]
and we can compute the conditional probability
\[\text{Pr}(X \mid Y) = \frac{\frac{1}{6}}{\frac{1}{2}} = \frac{1}{3}.\]
As we expect, one out of every three odd rolls is a \(3\).

Bayes' Law Extended Form

Instead of considering a single event \(Y\), we can consider a range of \(n\) possible events \(Y_1, Y_2, \ldots, Y_n\) that may occur. We require that one of these \(Y\)-events must occur and that they cover all possible events that could occur. For example \(Y_1\) is the event that H2O is vapor, \(Y_2\) is the event that H2O is water and\(Y_3\) is the event that H2O is ice.

In such a case we know that since the \(Y\)-events are distinct
\[\text{Pr}(X) = \text{Pr}(X \text{ and } Y_1 \text{ both occur}) + \text{Pr}(X \text{ and } Y_2 \text{ both occur}) + \text{Pr}(X \text{ and } Y_3 \text{ both occur}).\]
Using Bayes' law, we can reinterpret as
\[\text{Pr}(X \text{ and } Y_j \text{ both occur}) =  \text{Pr}(X \mid Y_j) \cdot \text{Pr}(Y_j)\]
and the above becomes
\[\text{Pr}(X) = \text{Pr}(X \mid Y_1) \cdot \text{Pr}(Y_1) + \text{Pr}(X \mid Y_2) \cdot \text{Pr}(Y_2) + \text{Pr}(X \mid Y_3) \cdot \text{Pr}(Y_3).\]
The same is true if we replace \(3\) with an arbitrary number of events \(n\).

Monday, November 25, 2013

Trigonometry and Nested Radicals

Early last month, I was chatting with one of my officemates about a curious problem I had studied in high school. I hadn't written any of the results down, so much of the discussion involved me rediscovering the results and proving them with much more powerful tools than I once possessed.

Before writing about the problem I had played around with, I want to give a brief motivation. For as long as humans have been doing mathematics, finding values of \(\pi\) has been deemed worthwhile (or every generation has just found it worthwhile to waste time computing digits).

One such way the Greeks (particularly Archmides) computed \(\pi\) was by approximating a circle by a regular polygon and letting the number of sides grow large enough so that the error between the area of the unit circle (\(\pi \cdot 1^2\)) and the area of the polygon would be smaller than some fixed threshold. Usually these thresholds were picked to ensure that the first \(k\) digits were fully accurate (for some appropriate value of \(k\)).

In many introductory Calculus courses, this problem is introduced exactly when the limit is introduced and students are forced to think about the area problem in the regular polygon:
Given \(N\) sides, the area is \(N \cdot T_N\) where \(T_N\) is the area of each individual triangle given by one side of the polygon and the circumcenter.

Call one such triangle \(\Delta ABC\) and let \(BC\) be the side that is also a side of the polygon while the other sides have \(\left|AB\right| = \left|AC\right| = 1\) since the polygon is inscribed in a unit circle. The angle \(\angle BAC = \frac{2\pi}{N}\) since each of the triangles has the same internal angle and there are \(N\) of them. If we can find the perpendicular height \(h\) from \(AB\) to \(C\), the area will be \(\frac{1}{2} h \left|AB\right| = \frac{h}{2}\). But we also know that
\[\sin\left(\angle BAC\right) = \frac{h}{\left|AC\right|} \Rightarrow h = \sin\left(\frac{2\pi}{N}\right).\] Combining all of these, we can approximate \(\pi\) with the area:
\[\pi \approx \frac{N}{2} \sin\left(\frac{2\pi}{N}\right) = \pi \frac{\sin\left(\frac{2\pi}{N}\right)}{\frac{2 \pi}{N}}. \] As I've shown my Math 1A students, we see that
\[\lim_{N \to \infty} \pi \frac{\sin\left(\frac{2\pi}{N}\right)}{\frac{2 \pi}{N}} = \pi \lim_{x \to 0} \frac{\sin(x)}{x} = \pi\] so these are indeed good approximations.

Theory is Nice, But I Thought We Were Computing Something

Unfortunately for us (and Archimedes), computing \(\sin\left(\frac{2\pi}{N}\right)\) is not quite as simple as dividing by \(N\), so often special values of \(N\) were chosen. In fact, starting from \(N\) and then using \(2N\), the areas could be computed via a special way of averaging the previous areas. Lucky for us, such a method is equivalent to the trusty half angle identities (courtesy of Abraham De Moivre). To keep track of these polygons with a power of two as the number of sides, we call \(A_n = \frac{2^n}{2} \sin\left(\frac{2\pi}{2^n}\right)\).

Starting out with the simplest polygon, the square with \(N = 2^2\) sides, we have
\[A_2 = 2 \sin\left(\frac{\pi}{2}\right) = 2.\] Jumping to the octagon (no not that "The Octagon"), we have
\[A_3 = 4 \sin\left(\frac{\pi}{4}\right) = 4 \frac{\sqrt{2}}{2} = 2 \sqrt{2}.\] So far, the toughest thing we've had to deal with is a \(45^{\circ}\) angle and haven't yet had to lean on Abraham (himnot him) for help. The hexadecagon wants to change that:
\[A_4 = 8 \sin\left(\frac{\pi}{8}\right) = 8 \sqrt{\frac{1 - \cos\left(\frac{\pi}{4}\right)}{2}} = 8 \sqrt{\frac{2 - \sqrt{2}}{4}} = 4 \sqrt{2 - \sqrt{2}}.\]
To really drill home the point (and motivate my next post) we'll compute this for the \(32\)-gon (past the point where polygons have worthwhile names):
\[A_5 = 16 \sin\left(\frac{\pi}{16}\right) = 16 \sqrt{\frac{1 - \cos\left(\frac{\pi}{8}\right)}{2}}.\] Before, we could rely on the fact that we know that a \(45-45-90\) triangle looked like, but now, we come across \(\cos\left(\frac{\pi}{8}\right)\), a value which we haven't seen before. Luckily, Abraham has help here as well:
\[\cos\left(\frac{\pi}{8}\right) = \sqrt{\frac{1 + \cos\left(\frac{\pi}{4}\right)}{2}} = \sqrt{\frac{2 + \sqrt{2}}{4}} = \frac{1}{2} \sqrt{2 + \sqrt{2}}\] which lets us compute
\[A_5 = 16 \sqrt{\frac{1 - \frac{1}{2} \sqrt{2 + \sqrt{2}}}{2}} = 8 \sqrt{2 - \sqrt{2 + \sqrt{2}}}.\]

So why have I put you through all this? If we wave our hands like a magician, we can see this pattern continues and for the general \(n\)
\[A_n = 2^{n - 2} \sqrt{2 - \sqrt{2 + \sqrt{2 + \sqrt{\cdots + \sqrt{2}}}}}\]
where there are \(n - 3\) nested radicals with the \(\oplus\) sign and only one minus sign at the beginning.

This motivates us to study two questions, what is the limiting behavior of such a nested radical:
\[\sqrt{2 + s_1 \sqrt{2 + s_2 \sqrt{ \cdots }}}\] as the signs \(s_1, s_2, \ldots\) takes values in \(\left\{-1, 1\right\}\). Recasting in terms of the discussion above, we want to know how close we are to \(\pi\) as we increase the number of sides.

When I was in high school, I just loved to nerd out on any and all math problems, so I studied this just for fun. Having heard about the unfathomable brain of Ramanujan and the fun work he had done with infinitely nested radicals, I wanted to examine which sequences of signs \((s_1, s_2, \ldots)\) produced an infinite radical that converged and what the convergence behavior was.

I'm fairly certain my original questions came from an Illinois Council of Teachers of Mathematics (ICTM) contest problem along the lines of
\[\text{Find the value of the infinite nested radical } \sqrt{2 + \sqrt{2 + \cdots}}\] or maybe the slightly more difficult \[\text{Find the value of the infinite nested radical } \sqrt{2 - \sqrt{2 + \sqrt{2 - \sqrt{2 + \cdots}}}}.\] Armed with my TI-83, I set out to do some hardcore programming and figure it out. It took me around a month of off-and-on tinkering. This second time around as a mathematical grown-up, it took me the first half of a plane ride from SFO to Dallas.

In the next few weeks/months, I hope to write a few blog posts, including math, proofs and some real code on what answers I came up with and what other questions I have.

Tuesday, September 10, 2013

Calculating a Greatest Common Divisor with Dirichlet's Help

Having just left Google and started my PhD in Applied Mathematics at Berkeley, I thought it might be appropriate to write some (more) math-related blog posts. Many of these posts, I jotted down on napkins and various other places on the web and just haven't had time to post until now.

For today, I'm posting a result which was somewhat fun to figure out with/for one of my buddies from Michigan Math. I'd also like to point out that he is absolutely kicking ass at Brown.

While trying to determine if
\[J(B_n)_{\text{Tor}}\left(\mathbb{Q}\right) \stackrel{?}{=} \mathbb{Z}/2\mathbb{Z} \] where \(J(B_n)\) is the Jacobian of the curve \(B_n\) given by \(y^2 = (x + 2) \cdot f^n(x)\) where \(f^n\) denotes \(\underbrace{f \circ \cdots \circ f}_{n \text{ times}}\) and \(f(x) = x^2 - 2\).

Now, his and my interests diverged some time ago, so I can't appreciate what steps took him from this to the problem I got to help with. However, he was able to show (trivially maybe?) that this was equivalent to showing that
\[\gcd\left(5^{2^n} + 1, 13^{2^n} + 1, \ldots, p^{2^n} + 1, \ldots \right) = 2 \qquad (1)\] where the \(n\) in the exponents is the same as that in \(B_n\) and where the values we are using in our greatest common divisor (e.g. \(5, 13\) and \(p\) above) are all of the primes \(p \equiv 5 \bmod{8}\).

My buddy, being sadistic and for some reason angry with me, passed me along the stronger statement:
\[\gcd\left(5^{2^n} + 1, 13^{2^n} + 1\right) = 2 \qquad (2)\] which I of course struggled with and tried to beat down with tricks like \(5^2 + 12^2 = 13^2\). After a few days of this struggle, he confessed that he was trying to ruin my life and told me about the weaker version \((1)\).

When he sent me the email informing me of this, I read it at 8am, drove down to Santa Clara for PyCon and by the time I arrived at 8:45am I had figured the weaker case \((1)\) out. This felt much better than the days of struggle and made me want to write about my victory (which I'm doing now). Though, before we actually demonstrate the weaker fact \((1)\)  I will admit that I am not in fact tall. Instead I stood on the shoulders of Dirichlet and called myself tall. Everything else is bookkeeping.

Let's Start the Math

First, if \(n = 0\), we see trivially that
\[\gcd\left(5^{2^0} + 1, 13^{2^0} + 1\right) = \gcd\left(6, 14\right) = 2\] and all the remaining terms are divisible by \(2\) hence the \(\gcd\) over all the primes must be \(2\).

Now, if \(n > 0\), we will show that \(2\) divides our \(\gcd\), but \(4\) does not and that no odd prime can divide this \(\gcd\). First, for \(2\), note that
\[p^{2^n} + 1 \equiv \left(\pm 1\right)^{2^n} + 1 \equiv 2 \bmod{4}\] since our primes are odd. Thus they are all divisible by \(2\) and none by \(4\).

Now assume some odd prime \(p^{\ast}\) divides all of the quantities in question. We'll show no such \(p^{\ast}\) can exist by contradiction.

In much the same way we showed the \(\gcd\) wasn't divisible by \(4\), we seek to find a contradiction in some modulus. But since we are starting with \(p^{2^n} + 1 \equiv 0 \bmod{p^{\ast}}\), if we can find some such \(p\) with \(p \equiv 1 \bmod{p^{\ast}}\), then we'd have our contradiction from
\[0 \equiv p^{2^n} + 1 \equiv 1^{2^n} + 1 \equiv 2 \bmod{p^{\ast}}\] which can't occur since \(p^{\ast}\) is an odd prime.

With this in mind, along with a subsequence of the arithmetic progression \(\left\{5, 13, 21, 29, \ldots\right\}\), it seems that using Dirichlet's theorem on arithmetic progressions may be a good strategy. However, this sequence only tells us about the residue modulo \(8\), but we also want to know about the residue modulo \(p^{\ast}\). Naturally, we look for a subsequence in
\[\mathbb{Z}/\mathbb{8Z} \times \mathbb{Z}/\mathbb{p^{\ast}Z}\] corresponding to the residue pair \((5 \bmod{8}, 1 \bmod{p^{\ast}})\). Due to the Chinese remainder theorem this corresponds to a unique residue modulo \(8p^{\ast}\).

Since this residue \(r\) has \(r \equiv 1 \bmod{p^{\ast}}\), we must have
\[r \in \left\{1, 1 + p^{\ast}, 1 + 2p^{\ast}, \ldots, 1 + 7p^{\ast}\right\} .\] But since \(1 + kp^{\ast} \equiv r \equiv 5 \bmod{8}\), we have \(kp^{\ast} \equiv 4 \bmod{8}\) and \(k \equiv 4\left(p^{\ast}\right)^{-1} \bmod{8}\) since \(p^{\ast}\) is odd and invertible mod \(8\). But this also means its inverse is odd, hence \(k \equiv 4\cdot(2k' + 1) \equiv 4 \bmod{8}\). Thus we have \(1 + 4 p^{\ast} \in \mathbb{Z}/8p^{\ast}\mathbb{Z}\) corresponding to our residue pair. Thus every element in the arithmetic progression \(S = \left\{(1 + 4p^{\ast}) + (8p^{\ast})k \right\}_{k=0}^{\infty}\) is congruent to \(1 + 4 p^{\ast} \bmod{8p^{\ast}}\) and hence \(5 \bmod{8}\) and \(1 \bmod{p^{\ast}}\).

What's more, since \(5 \in \left(\mathbb{Z}/8\mathbb{Z}\right)^{\times}\) and \(1 \in \left(\mathbb{Z}/p^{\ast}\mathbb{Z}\right)^{\times}\), we have \(1 + 4 p^{\ast} \in \left(\mathbb{Z}/8p^{\ast}\mathbb{Z}\right)^{\times}\) (again by the Chinese remainder theorem). Thus the arithmetic progression \(S\) satisfies the hypothesis of Dirichlet's theorem. Hence there must at least one prime \(p\) occurring in the progression (since there are infinitely many). But that also means \(p\) occurs in \(\left\{5, 13, 29, 37, \ldots\right\}\) hence we've reached our desired contradiction. RAA.

Now What?

We still don't know if the strong version \((2)\)
\[\gcd\left(5^{2^n} + 1, 13^{2^n} + 1, \ldots, p^{2^n} + 1, \ldots \right) = 2\] By similar arguments as above, if any odd prime \(p^{\ast}\) divides this \(\gcd\), then we have
\[5^{2^n} \equiv -1 \bmod{p^{\ast}}\] hence there is an element of order \(2^{n + 1}\). This means the order of the multiplicative group \(\varphi\left(p^{\ast}\right) = p^{\ast} - 1\) is divisible by \(2^{n + 1}\). Beyond that, who knows? We're still thinking about it (but only passively, more important things to do).

Sunday, August 18, 2013

Some Fibonacci Fun with Primes

I haven't written in way too long and just wanted to post this fun little proof.

Assertion: Let \(F_n\) be the \(n\)th Fibonacci number defined by \(F_n = F_{n-1} + F_{n-2}\), \(F_0 = 0, F_1 = 1\). Show that for an odd prime \(p\neq 5\), we have \(p\) divides \(F_{p^2−1}\).

Proof: We do this by working inside \(\mathbb{F}_p\) instead of working in \(\mathbb{R}\). The recurrence is given by

\[ \left( \begin{array}{cc}
1 & 1 \\
1 & 0 \end{array} \right)
\left( \begin{array}{c}
F_{n-1} \\
F_{n-2} \end{array} \right)
=
\left( \begin{array}{c}
F_{n-1} + F_{n-2} \\
F_{n-1} \end{array} \right)
=
\left( \begin{array}{c}
F_n \\
F_{n-1} \end{array} \right)\] and in general
\[ \left( \begin{array}{cc}
1 & 1 \\
1 & 0 \end{array} \right)^{n}
\left( \begin{array}{c}
1 \\
0 \end{array} \right)
=
\left( \begin{array}{cc}
1 & 1 \\
1 & 0 \end{array} \right)^{n}
\left( \begin{array}{c}
F_1 \\
F_0 \end{array} \right)
=
\left( \begin{array}{c}
F_{n + 1} \\
F_n \end{array} \right)\] The matrix \(A = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array} \right)\) has characteristic polynomial
\[\chi_A(t) = (1 - t)(0 - t) - (1)(1) = t^2 - t - 1\] If this polynomial has distinct roots, then \(A\) is diagonalizable (this is sufficient, but not necessary). Assuming the converse we have \(\chi_A(t) = (t - \alpha)^2\) for some \(\alpha \in \mathbb{F}_p\); we can assume \(\alpha \in \mathbb{F}_p\) since \(-2\alpha = -1\) is the coefficient of \(t\), which means \(\alpha = 2^{-1} \) (we are fine with this since \(p\) odd means that \(2^{-1}\) exists). In order for this to be a root of \(\chi_A\), we must have
\[0 \equiv 4 \cdot \chi_A\left(2^{-1}\right) \equiv 4 \cdot \left(2^{-2} - 2^{-1} - 1\right) \equiv 1 - 2 - 4 \equiv -5 \bmod{p}.\] Since \(p \neq 5\) is prime, this is not possible, hence we reached a contradiction and \(\chi_A\) does not have a repeated root.

Thus we may write \(\chi_A(t) = (t - \alpha)(t - \beta)\) for \(\alpha, \beta \in \mathbb{F}_{p^2}\) (it's possible that \(\chi_A\) is irreducible over \(\mathbb{F}_p\), but due to degree considerations it must split completely over \(\mathbb{F}_{p^2}\)). Using this, we may write

\[A = P \left(\begin{array}{cc} \alpha & 0 \\ 0 & \beta \end{array} \right) P^{-1}\] for some \(P \in GL_{2} \left(\mathbb{F}_{p^2}\right)\) and so

\[A^{p^2 - 1} = P \left(\begin{array}{cc} \alpha & 0 \\ 0 & \beta \end{array} \right)^{p^2 - 1} P^{-1}
= P \left(\begin{array}{cc} \alpha^{p^2 - 1}  & 0 \\ 0 & \beta^{p^2 - 1}  \end{array} \right)P^{-1}\] Since \(\chi_A(0) = 0 - 0 - 1 \neq 0\) we know \(\alpha\) and \(\beta\) are nonzero, hence \(\alpha^{p^2 - 1} = \beta^{p^2 - 1} = 1 \in \mathbb{F}_{p^2} \). Thus \(A^{p^2 - 1} = P I_2 P^{-1} = I_2\) and so

\[\left( \begin{array}{c}
F_p \\
F_{p^2 - 1} \end{array} \right)
=
\left( \begin{array}{cc}
1 & 1 \\
1 & 0 \end{array} \right)^{p^2 - 1}
\left( \begin{array}{c}
1 \\
0 \end{array} \right)
=
I_2 \left( \begin{array}{c}
1 \\
0 \end{array} \right)
=
\left( \begin{array}{c}
1 \\
0 \end{array} \right)\] so we have \(F_{p^2 - 1} = 0\) in \(\mathbb{F}_p\) as desired.

Monday, December 24, 2012

Bridging OAuth 2.0 objects between GData and Discovery

My colleague +Takashi Matsuo and I recently gave a talk about using OAuth2Decorator (from the google-api-python-client library) with request handlers in Google App Engine. Shortly after, a Stack Overflow question sprung up asking about the right way to use the decorator and, as a follow up,  if the decorator could be used with the Google Apps Provisioning API. As I mentioned in my answer,
The Google Apps Provisioning API is a Google Data API...As a result, you'll need to use the gdata-python-client library to use the Provisioning API. Unfortunately, you'll need to manually convert from a oauth2client.client.OAuth2Credentials object to a gdata.gauth.OAuth2Token object to use the same token for either one.
Instead of making everyone and their brother write their own, I thought I'd take a stab at it and write about it here. The general philosophy I took was that the token subclass should be 100% based on an OAuth2Credentials object:
  • the token constructor simply takes an OAuth2Credentials object
  • the token refresh updates the OAuth2Credentials object set on the token
  • values of the current token can be updated directly from the OAuth2Credentials object set on the token
Starting from the top, we'll use two imports:
import httplib2
from gdata.gauth import OAuth2Token
The first is needed to refresh an OAuth2Credentials object using the mechanics native to google-api-python-client, and the second is needed so we may subclass the gdata-python-client native token class.

As I mentioned, the values should be updated directly from an OAuth2Credentials object, so in our constructor, we first initialize the values to None and then call our update method to actual set the values. This allows us to write less code, because, repeating is bad (I think someone told me that once?).
class OAuth2TokenFromCredentials(OAuth2Token):
  def __init__(self, credentials):
    self.credentials = credentials
    super(OAuth2TokenFromCredentials, self).__init__(None, None, None, None)
    self.UpdateFromCredentials()
We can get away with passing four Nones to the superclass constructor, as it only has four positional arguments: client_idclient_secret, scope,  and user_agent. Three of those have equivalents on the OAuth2Credentials object, but there is no place for scope because that part of the token exchange handled elsewhere (OAuth2WebServerFlow) in the google-api-python-client library.
  def UpdateFromCredentials(self):
    self.client_id = self.credentials.client_id
    self.client_secret = self.credentials.client_secret
    self.user_agent = self.credentials.user_agent
    ...
Similarly, the OAuth2Credentials object only implements the refresh part of the OAuth 2.0 flow, so only has the token URI, hence auth_urirevoke_uri and redirect_uri will not be set either. However, the token URI and the token data are the same for both.
    ...
    self.token_uri = self.credentials.token_uri
    self.access_token = self.credentials.access_token
    self.refresh_token = self.credentials.refresh_token
    ...
Finally, we copy the extra fields which may be set outside of a constructor:
    ...
    self.token_expiry = self.credentials.token_expiry
    self._invalid = self.credentials.invalid
Since OAuth2Credentials doesn't deal with all parts of the OAuth 2.0 process, we disable those methods from OAuth2Token that do.
  def generate_authorize_url(self, *args, **kwargs): raise NotImplementedError
  def get_access_token(self, *args, **kwargs): raise NotImplementedError
  def revoke(self, *args, **kwargs): raise NotImplementedError
  def _extract_tokens(self, *args, **kwargs): raise NotImplementedError
Finally, the last method which needs to be implemented is _refresh, which should refresh the OAuth2Credentials object and then update the current GData token after the refresh. Instead of using the passed in request object, we use one from httplib2 as we mentioned in imports.
  def _refresh(self, unused_request):
    self.credentials._refresh(httplib2.Http().request)
    self.UpdateFromCredentials()
After refreshing the OAuth2Credentials object, we can update the current token using the same method called in the constructor.

Using this class, we can simultaneously call a discovery-based API and a GData API:
from apiclient.discovery import build
from gdata.contacts.client import ContactsClient

service = build('calendar', 'v3', developerKey='...')

class MainHandler(webapp2.RequestHandler):
  @decorator.oauth_required
  def get(self):
    auth_token = OAuth2TokenFromCredentials(decorator.credentials)
    contacts_client = ContactsClient()
    auth_token.authorize(contacts_client)
    contacts = contacts_client.get_contacts()
    ...
    events = service.events().list(calendarId='primary').execute(
        http=decorator.http())
    ...

Monday, September 10, 2012

Last to Cross the Finish Line: Part Three

Recently, my colleague +Fred Sauer and I gave a tech talk called "Last Across the Finish Line: Asynchronous Tasks with App Engine". This is part three in a three part series where I will share our learnings and give some helpful references to the App Engine documentation.

Check out the previous post if you haven't already. In this section, we'll define the PopulateBatch function and explore the ndb models and Task Queue operations that make it work.

Imports

Before defining the models and helper functions in models.py, let's first review the imports:
import json

from google.appengine.api import channel
from google.appengine.ext.deferred import defer
from google.appengine.ext import ndb
Again, we import json and channel for serialization and message passing. We import the defer function from the deferred library to abstract away task creation and take advantage of the ability to "defer" a function call to another thread of execution. Finally, we import ndb as a means for interacting with the App Engine Datastore.

Method Wrapper Built for Tasks

As we saw in the BeginWork handler in part two, units of work are passed to PopulateBatch as 3-tuples containing a method, the positional arguments and the keyword arguments to that method.

In order to keep our task from hanging indefinitely due to unseen errors and to implicitly include the work unit in the batch, we define a wrapper around these method calls:
def AlwaysComplete(task, method, *args, **kwargs):
  try:
    method(*args, **kwargs)
  except:  # TODO: Consider failing differently.
    pass
  finally:
    defer(task.Complete)
As you can see, we catch any and all errors thrown by our method and don't retry the method if it fails. In our example, if the call method(*args, **kwargs) fails, the data won’t be sent through the channel and the given square will not show up in the quilt. However, since we catch these exceptions, the batch will complete and the spinner will disappear with this square still missing.

This part is likely going to be customized to the specific work involved, but for our case, we didn't want individual failures to cause the whole batch to fail. In addition, we implicitly link the work unit with a special type of task object in the datastore.

In the finally section of the error catch, we defer the Complete method on the task corresponding to this work unit. We defer the call to this complete method in order to avoid any errors (possibly from a failed datastore action) that the method may cause. If it were to throw an error, since AlwaysComplete is called in a deferred task, the task would retry and our worker unit would execute (or fail) again, which is bad if our user interface is not idempotent.

Task Model

As we saw above, we need a datastore model to represent tasks within a batch. We start out initially with a model having only one attribute — a boolean representing whether or not the task has completed.
class BatchTask(ndb.Model):
  # Very important that the default value True of `indexed` is used here
  # since we need to query on BatchTask.completed
  completed = ndb.BooleanProperty(default=False)
As we know, we'll need to define a Complete method in order to use the task in AlwaysComplete, but before doing so, we'll define another method which will put the task object in the datastore and pass a unit of work to AlwaysComplete:
  @ndb.transactional
  def Populate(self, method, *args, **kwargs):
    self.put()
    kwargs['_transactional'] = True
    defer(AlwaysComplete, self.key, method, *args, **kwargs)
In this Populate method, we first put the object in the datastore transactionally by using the ndb.transactional decorator. By adding the _transactional keyword to the keyword arguments, defer strips away the underscore and creates a transactional task. By doing this
"the task is only enqueued — and guaranteed to be enqueued — if the transaction is committed successfully."
We need this deferred task to be enqueued transactionally for consistency of the completed boolean attribute. The datastore put in Populate uses the default value of False, but after Complete is called we want to set this boolean to True. If this value was not consistent, we may have a race condition that resulted in a completed task in the datastore being marked as incomplete. As we'll see later, we rely on this consistency for a query that will help us determine if our batch is done.

To signal that a unit of work has completed, we define the Complete method on the task object:
  @ndb.transactional
  def Complete(self):
    self.completed = True
    self.put()

    batcher_parent = self.key.parent().get()
    defer(batcher_parent.CheckComplete, _transactional=True)
It performs two functions. First, it sets completed to True in a transaction. Second, it retrieves the parent entity of the task object and defers the CheckComplete method on this parent. As we will see in more depth in the PopulateBatch function, we use a special type of batch parent object to create an entity group containing all the worker tasks for the batch. We don't want to check if the batch has completed until the datastore put has succeeded, so we defer the call to call to CheckComplete transactionally, just as we did with AlwaysComplete in the Populate method.

NoteIt may seem that these get calls to retrieve the parent via self.key.parent().get() are using more bandwidth than necessary. However, we are relying here on the power of ndb. Using a combination of instance caching and memcache, most (if not all) of these gets will use the cache and will not incur the cost of a round-trip to the datastore.

Batch Parent Model

Given what we rely on in BatchTask, we need to define a special type of datastore object that will act as the parent entity for a batch. Since we are going to use it to check when a batch is complete, we define the boolean attribute all_tasks_loaded to signal whether or not all worker tasks from the batch have begun. We can use this as a short circuit in our CheckComplete method (or as a guard against premature completion).
class TaskBatcher(ndb.Model):
  all_tasks_loaded = ndb.BooleanProperty(default=False, indexed=False)
To check if a batch is complete, we first determine if all tasks have loaded. If that is the case, we perform an ancestor query that simply attempts to fetch the first worker task in the entity group which has not yet completed. If such a task does not exist, we know the batch has completed, and so start to clean up the task and batch parent objects from the datastore.
  def CheckComplete(self):
    # Does not need to be transactional since it doesn't change data
    session_id = self.key.id()
    if self.all_tasks_loaded:
      incomplete = BatchTask.query(BatchTask.completed == False,
                                   ancestor=self.key).fetch(1)
      if len(incomplete) == 0:
        channel.send_message(session_id, json.dumps({'status': 'complete'}))
        self.CleanUp()
        return

    channel.send_message(session_id, json.dumps({'status': 'incomplete'}))
We again do the utmost at this step to ensure consistency by using an ancestor query:
"There are scenarios in which any pending modifications are guaranteed to be completely applied...any ancestor queries in the High Replication datastore. In both cases, query results will always be current and consistent."
After checking if a batch is complete, we need to communicate the status back to the client. We'll rely on PopulateBatch to create instances of TaskBatcher with the ID of the session corresponding to the batch as the datastore key. We send a status complete or incomplete message to the client using the session ID for the channel. In order to correctly handle these messages on the client, we'll need to update the onmessage handler (defined in part two) to account for status updates:
socket.onmessage = function(msg) {
  var response = JSON.parse(msg.data);
  if (response.status !== undefined) {
    setStatus(response.status);
  } else {
    var squareIndex = 8*response.row + response.column;
    var squareId = '#square' + squareIndex.toString();
    $(squareId).css('background-color', response.color);
  }
}
Just as the setStatus method revealed the progress spinner when work began, it will remove the spinner when the status is complete.

We'll next define the CleanUp method that is called when the batch is complete:
  def CleanUp(self):
    children = BatchTask.query(ancestor=self.key).iter(keys_only=True)
    ndb.delete_multi(children)
    self.key.delete()
This method uses the key from the batch parent to perform another ancestor query and creates an object which can iterate over all the keys of the tasks in the batch. By using the delete_multi function provided by ndb, we are able to delete these in parallel rather than waiting for each to complete. After deleting all the tasks, the batcher deletes itself and clean up is done. Since the TaskBatcher.CheckComplete spawns CleanUp in a deferred task, if the deletes time out, the task will try again until all tasks in the batch are deleted.

As a final method on TaskBatcher, we define something similar to BatchTask.Populate that is triggered after all workers in the batch have been added:
  @ndb.transactional
  def Ready(self):
    self.all_tasks_loaded = True
    self.put()
    self.CheckComplete()
This simply signals that all tasks from the batch have loaded by setting all_tasks_loaded to True and calls CheckComplete in case all the tasks in the batch have already completed. This check is necessary because if all worker tasks complete before all_tasks_loaded is True, then none of the checks initiated by those tasks would signal completion. We use a transaction to avoid a race condition with the initial datastore put — a put which is a signal that all tasks have not loaded.

Populating a Batch

With our two models in hand, we are finally ready to define the PopulateBatch function used (in part two) by the BeginWork handler. We want users of this function to be able to call it directly, but don't want it to block the process they call it in, so we wrap the real function in a function that will simply defer the work:
def PopulateBatch(session_id, work):
  defer(_PopulateBatch, session_id, work)
In the actual function, we first create a TaskBatcher object using the session ID as the key and put it into the datastore using the default value of False for all_tasks_loaded. Since this is a single synchronous put, it blocks the thread of execution and we can be sure our parent is in the datastore before members of the entity group (the task objects) are created.
def _PopulateBatch(session_id, work):
  batcher_key = ndb.Key(TaskBatcher, session_id)
  batcher = TaskBatcher(key=batcher_key)
  batcher.put()
After doing this, we loop through all the 3-tuples in the passed in batch of work. For each unit of work, we create a task using the batcher as parent and then call the Populate method on the task using the method, positional arguments and keyword arguments provided in the unit of work.
  for method, args, kwargs in work:
    task = BatchTask(parent=batcher_key)
    task.Populate(method, *args, **kwargs)
Finally, to signal that all tasks in the batch have been added, we call the Ready method on the batch parent:
  batcher.Ready()
Note: This approach can cause performance issues as the number of tasks grows, since contentious puts within the entity group can cause task completions to stall or retry. I (or my colleagues) will be following up with two posts on the following topics:
  • using task tagging and pull queues to achieve a similar result, but reducing contention
  • exploring ways to extend this model to a hierarchical model where tasks may have subtasks