Go and passwords - Stojan's Notes

[Go](Go.md) is terrible with [passwords](Passwords.md). [[Storing passwords]] in databases for the purpose of authentication should always be done using purpose-built [[Password hashing algorithms]]. These exist to prevent attackers from easily recovering passwords from a stolen database of hashes. And this is achieved by making sure that hashing a password takes some combination of: * a lot of compute instructions, * a minimum number of cores, * some minimum access to RAM. An attacker would then need to spend quite a bit of money to acquire enough CPUs and memory sticks to practically reverse the output of a one-way hashing function. Only three high-quality password hashing functions should really be used: * [Argon2](Argon2.md) * Winner of the [Password Hashing Competition](https://www.password-hashing.net) * Customizable cores, instructions, memory * [bcrypt](bcrypt.md) * Computes a lot of instructions, specified via a cost parameter (cost of 10 approx. `2^(cost+1)` large expensive computations) * [scrypt](scrypt.md) * Computes a lot of instructions, specified via an iterations parameter * Based on [PBKDF2](PBKDF2.md) with [SHA-256](SHA-2.md) Clearly then, using passwords in a system is quite expensive. Any time the system needs to store or authenticate a password, it needs to run through a lot of CPU instructions. Special care must be taken when using these functions in a system written in Go. ## What does Go do? Go is a language and runtime designed for building high-concurrency servers. Using its [Goroutines](Goroutines.md) paradigm it's able to handle hundreds of thousands of concurrent connections at once. Goroutines can be thought of as threads that run concurrently; but they're not exactly OS threads. They require Go's runtime and cooperative scheduler to execute properly. Essentially the compiler, runtime and scheduler are able to create a program which intercepts blocking system calls made in each Goroutine, adds the syscall to an [epoll](epoll.md) set and runs the loop. There are as many epoll select loops as there are CPU cores on the machine, each running in a normal OS thread. Each loop iteration enables Go to identify Goroutines that are no longer blocked by external events, so it runs them in sequence on the OS thread that ran the loop. Conclusions: 1. High concurrency is bound by the number of epoll select loops. 2. Less instructions between runs means higher concurrency. 3. Duration between runs is latency incurred by all Goroutines (blocked or ready). In fact, this is why Go's scheduler is called _cooperative_. Goroutines need to play nice with each other to achieve the overarching goal of high-concurrency. A Goroutine running a password hashing function _does not cooperate_ with others. Because it executes billions of instructions, it increases the time between epoll select runs, producing these consequences: - Each password hashing attempt induces latency on all scheduled Goroutines on the same core. - Dependent Goroutines (B waits on A) may start experiencing pathological latencies. - Higher latencies means more in-kernel buffering and system resource depletion (inability to accept more connections: too many file descriptors / sockets, TCP flow control adjustments). Thus the consequences are not only limited to the password hashing Goroutines, but sprawl to the whole application and system. In traditional systems this problem is practically invisible because the OS thread scheduler is _preemptive_ instead of _cooperative_. Every few microseconds the CPU interrupts to the kernel and another thread is scheduled to run. In a bounded thread-pool network server setup, the worst case is where the pool is fully busy doing only password hashing work. Only then will other workloads experience latencies due to the load. In an unbounded thread-pool setup password hashing attempts are effectively given the least priority. ## Solution To overcome this problem password hashing algorithms need to be forced to cooperate. There are two options: 1. Dedicate some CPU cores to only do password hashing when necessary 2. Yield the CPU every few microseconds so other Goroutines can do their thing ### Motivation But why solve this problem at all? Aren't passwords dying? Technically, yes. Passwords are out of fashion; yet they can still be found in many places. And not all of those places do the same things or have the same purpose. In fact these are the major uses today: 1. [Key](Cryptographic%20keys.md) [derivation](Key%20derivation.md) from a rememberable phrase. 2. Authenticating users, in two flavors 1. Via keyboard. 2. Via network. Key derivation is usually a one-off operation, so it's not particularly motivating to find a solution to the problems if done with Go. Authenticating users via a keyboard (e.g. a user logging on their computer, or using `sudo`) is also something that occurs rarely and is often rate-limited (you can only do so many attempts before you're locked out, or you have to wait exponentially longer each time). However, authenticating users via the network is something different entirely. Each such request to a server poses a major challenge: a disproportionally large amount of work and resources are being dedicated to someone that may or may not be a legitimate user. This asymmetric balance advantages attackers to cause denial of service to operators, or just rack up their cloud bills. As shown earlier, software written in Go is particularly susceptible to the denial of service issue (or worsening of service). Service operators often have two goals: 1. Great service for authenticated users. Low latencies. Great SLA. 2. Best-effort service for unauthenticated users. If under attack you may not be able to get in -- don't use passwords otherwise. But the issue with a naive implementation of a password hashing algorithm in Go affects any network requests, regardless of whether their origin is legitimate or not. Thus, the motivation for solving this problem: Giving service operators a fighting chance against attackers whose goal is to disrupt service for authenticated users, or are just on an automated [credential stuffing](Credential%20stuffing%20attack.md) spree. Typically this problem does not really have a solution. You'll find the use of CAPTCHA or some form of (semi-)intelligent rate limiting to help filter out legitimate from illegitimate traffic. But, because of the CPU intensity of these algorithms, as little as 10 requests at once can cause issues for many operators. Here are some metrics on hashing with `bcrypt` with a cost of 11 on some common AWS EC2 instance types: - t2.micro (1 vCPU) - 170 ms per request - 5.9 r/s for 1 second per vCPU - t3.small (2 vCPU) - 142 ms per request - 7.0 r/s for 1 s latency per vCPU - t2.medium (2 vCPU) - 168 ms per request - 5.9 r/s for 1 s latency per vCPU - t3.medium (2 vCPU) - 140 ms - 7.1 r/s for 1 s latency per vCPU - m5.large (2 vCPU) - 136 ms - 7.4 r/s for 1 s latency per vCPU Let's try and quantify the probability of at least half of all quick password hashing Goroutines incurring some password hashing latency: $Q$ is the number of quick Goroutines ready to be scheduled. $L$ is tne number of password hashing Goroutines ready to be scheduled. The scheduler (imagining it's completely fair) randomly pulls a permutation of the $Q$ and $L$ ones, and then executes them in sequence. Thus the total number of possibilities to randomly select Goroutines is, unsurprisingly: $(Q+L)!$ To calculate the total number of possibilities where $n$ are quick Goroutines cheduled at the start (thus they will not incur a latency penalty for being scheduled after one or more password hashing Goroutines), we can use this process: When $n = 1$: $ Q (Q - 1 + L)! $ _because we fix $Q$ choices at the first position, and then are left with $Q-1$ choices for the rest._ When $n=2$: $ Q(Q-1)(Q-2+L)! $ _because we fix $Q$ choices in the first position, $Q-1$ on the second, and are left with $Q-2$ for the rest._ Or, as a formula: $ (Q-n+L)! \prod_{i=0}^{n-1}{(Q-i)} $ Probability of $n$ quick Goroutines incurring no latency (are scheduled in front of any password hashing Goroutines): $ P(n) = \frac{(Q-n+L)! \prod_{i=0}^{n-1}{(Q-i)}}{(Q+L)!} $ Or in the world's best calculator programming language Ruby: ```ruby def permutations_n_are_quick n, q, l (n-1).times.inject(1) do |a, i| a * (q-i) end * (q-n+l-1).times.inject(1) do |a, i| a * (i+1) end end def probability_n_are_quick n, q, l Rational( permutations_n_are_quick(n, q, l), permutations_n_are_quick(0, q, l) ) end ``` For multiple cores, the formula is as follows from this process. Imagine two cores, i.e. $C = 2$: The probability that 5 out of 10 quick Goroutines are scheduled before any other password hashing ones is equivalent to the sum of probabilities where: - 5 occured on core no. 1, and 0 on core no. 2 - 4 occured on core no. 1, and 1 on core no. 2 - ... - 1 occured on core no. 1, and 4 on core no. 2 - 0 occured on core no. 1, and 5 on core no. 2 So in Ruby: ```ruby def c2 n v = [] (n+1).times do |i| (n+1).times do |j| v << [i, j] if i + j == n end end v end def c4 n v = [] (n+1).times do |i| (n+1).times do |j| (n+1).times do |k| (n+1).times do |l| v << [i, j, k, l] if i + j + k + l == n end end end end v end def probability_n_are_quick_with_c c, n, q, l v = [[n]] if c == 1 v = c2(n) if c == 2 v = c4(n) if c == 4 v.map do |x| x.map do |i| probability_n_are_quick i, q, l end.inject(1) do |a, i| a * i end end.sum end ``` Some practical numbers. Let's assume a server processes 10 quick requests per second (i.e. they do no password hashing). Below are the probabilities that **at least half of those incur 0ms latency due to other password hashing requests** in the system. Requests $L$ | 1 vCPU | 2 vCPU | 4 vCPU -|-:|-:|-: 1| 16.67%| 38.97%| 100.0% 2| 9.09%| 21.49%| 58.03% 3| 5.30%| 12.65%| 34.63% 4| 3.26%| 7.84%| 21.72% 5| 2.10%| 5.08%| 14.2% 6| 1.40%| 3.40%| 9.60% 7| 0.96%| 2.35%| 6.68% 8| 0.68%| 1.67%| 4.77% 9| 0.49%| 1.21%| 3.48% 10| 0.36%| 0.89%| 2.58% 11| 0.27%| 0.67%| 1.95% 12| 0.21%| 0.51%| 1.50% 13| 0.16%| 0.40%| 1.17% 14| 0.12%| 0.31%| 0.92% 15| 0.10%| 0.25%| 0.73% 16| 0.08%| 0.20%| 0.59% 17| 0.06%| 0.16%| 0.48% ```ruby def summary (1..17).map do |l| "#{l.to_s.rjust(2, ' ')}|" + [1, 2, 4].map do |c| probability_n_are_quick_with_c c, 5, 10, l end.map do |x| (x * 100.0).round(2).to_s.rjust(6, ' ') + "%" end.join("|") end.join("\n") end ``` As you can see, not so happy numbers. The moment you just add 1 password hashing run there is _only_ a 16.7% chance that half of your users will not experience an unnecessary latency of 100+ ms. (Obviously, for a Go application running on one vCPU.) Thus an attacker can visibly slow down your server's response latency by just injecting a handful of password hashing request per second. Such a low number is likely to evade rate-limiters. Put more bluntly, if at the top of every second an attacker sends out 7, 10, 14 (for 1, 2 or 4 vCPU) bogus password hashing requests your server *will slow down* responses for at least 50% of your users all day except for about 15 minutes. Note that it's not like in those 15 minutes none of your users are experiencing latency, it's just that less than 50% of them are (which may be 49.99% or may be 0%). So objectively, an attacker can very cheaply affect service on your Go servers that do password hashing. Increasing the number of cores does somewhat improve the situation but not significantly. Since this is a model, a word on the assumptions. It is very likely that, due to the nature of `epoll`, almost all Goroutines that are marked as ready in the kernel will be scheduled at once on the first core asking for them, instead of evenly distributing among multiple cores, so the theoretical leeway given by the model above is probably more conservative than expressed. Thus, it does make sense to solve this problem and take away much of the power attackers have over unsuspecting Go servers. ### Option (1): Dedicated CPU cores Essentially, it's a queue. If we dedicate some portion of CPUs to do password hashing, we would not be affecting other Goroutines. ```go var hashRequests chan func() func init() { hashRequests = make(chan func()) processors := runtime.GOMAXPROCS(0) / 2 if processors < 1 { processors = 1 } for i := 0; processors; i += 1 { go func() { for fn := range hashRequests { fn() } } } } func CompareHashAndPassword(hash, password []byte) (ok bool, err error) { done := make(chan struct{}) hashRequests <- func() { defer close(done) ok, err = bcrypt.CompareHashAndPassword(hash, password) } <-done return ok, err } ``` Here the package will `CompareHashAndPassword` only on half of the cores available to the system. All the rest will essentially queue up. An obvious drawback of this setup is that you actually need to have more than 1 core to be somewhat effective. This is not always an option (cheaper EC2 instances, for example). It is also giving password hashing request some reserved priority, which is likely not what is desired. Cheaper and lower latency requests should probably be given priority for a better user experience. Furthermore there's no guarantee that the scheduler won't schedule other quicker Goroutines right after picking up a password hashing one on the dedicated cores. Again, this is an issue with the way `epoll` works. And so it does take away some of the power the attacker has, but not completely. In a multi-instance environment CPU is going to be a very bad metric to base auto scaling, as it won't go above the alloted cores. A good thing about it though is that the queue is quite predictable and `CompareHashAndPassword` may even be able to reject attempts which are likely to take too long to complete because of the fullness of the queue. In this case, a good auto-scaling metric is encountering this sort of errors. ### Option (2): Yield CPU every few microseconds A better way to solve the problem, but requires a change in the underlying algorithm. Essentially, every few milliseconds the password hashing algorithm, aware of its problems, yields the CPU so other ready Goroutines can run. To illustrate we'll use [bcrypt](bcrypt.md). The costly portion of the algorithm is as follows: ```go rounds = 1 << cost for i = 0; i < rounds; i++ { blowfish.ExpandKey(ckey, c) blowfish.ExpandKey(csalt, c) } ``` For a typical cost parameter of 10 it runs 2048 `ExpandKey` operations from the [Blowfish](Blowfish.md) cipher. Each of those do a non-trivial number of operations. We can modify this cycle to yield the CPU core for other Goroutines to run. Thankfully you can use the [`runtime.Gosched()`](https://pkg.go.dev/runtime#Gosched) function every few cycles to achieve this. ```go rounds = 1 << cost for i = 0; i < rounds / 16; i++ { for j = 0; j < 16; j++ { blowfish.ExpandKey(ckey, c) blowfish.ExpandKey(csalt, c) } runtime.Gosched() } ``` This example yields the processor every 16 rounds. If a password hashing attempt takes 100ms and is latency that all Goroutines on the same core will take on, now the latency that they may take on would be 6.25ms. But even better, we can add support for cancellation too: ```go rounds = 1 << cost for i = 0; i < rounds / 16; i++ { for j = 0; j < 16; j++ { blowfish.ExpandKey(ckey, c) blowfish.ExpandKey(csalt, c) } select { case <-ctx.Done(): return nil, ctx.Error() default: runtime.Gosched() } } ``` So that when the work is cancelled no CPU is wasted on queued up password hashing requests. This approach is obviously fairer to other Goroutines and takes away all of the power attackers had. But it significantly penalizes password hashing attempts, since their whole group now equally competes for resources. That is, now the group of password hashing Goroutines will bear much of the latency of other password hashing Goroutines. Since the most efficient way to finish hashing a password is to let it finish on the core uninterrupted, and since there's a fixed amount of CPU cores on a system, it does not make much sense to have too many of these Goroutines running concurrently. ### Optimal Thus the most optimal solution, as often seen in computer science, is to combine (1) and (2): - Run as many as `runtime.GOMAXPROCS(0)` password hashing Goroutines at once, but - Let them yield so they cooperate well with others. The queue approach gives First-In First-Out service guarantees to password hashing attempts (and a way to signal service disruption only on those endpoints), while less CPU-intensive tasks are not penalized for running next to password hashing workloads. All the power from attackers is thus removed. ### Library I've written a library that implements the above and is a drop-in replacement for the `golang.org/x/crypto/{bcrypt,argon2,scrypt,pbkdf2}` packages. Check it out: [github.com/hf/passwords](https://github.com/hf/passwords). ## Metrics and proof Here are some metrics to prove the problem and solution. Assumptions: - Running on an Apple MacBook Pro 14" with M1 chip, on battery. - Instead of simulating network requests, `runtime.Gosched()` is used to delay the execution of all Goroutines as if they were recevied over the network. - An equal number of password hashing and non-password hashing Goroutines are started, as fast as possible (hashers going first). - Only the low-CPU Goroutine durations are measured; effectively showing the latency caused by scheduling each Goroutine. - Using bcrypt with a cost of 10; hashing `password1`. [Test code and results.](https://gist.github.com/hf/0296692fca5e5d734cb886bffcd0722c) ### Without Gosched ![](Gotrue_and_passwords_H_4096_without_Gosched.svg) As you can see most of the non-password hashing Goroutines took an amazing 11.9 to 15.5 **seconds** to complete. Even though they absolutely do nothing inside! In total, hashing 4096 passwords took 41.7 seconds. ### With Gosched and queue ![](Gotrue_and_passwords_H_4096_with_Gosched.svg) As you can now see essentially all of the non-password hashing Goroutines took between 8.4 to 11.2 **milliseconds** to complete. A very small number of them took up to 33.6 milliseconds. In total, hashing 4096 passwords took 41.6 seconds. Essentially the same!