Thursday, April 6, 2023

CRC64 in Redis is getting even faster

On Friday last week, I submitted this PR to Redis: https://github.com/redis/redis/pull/11988 .

In the PR, I make single-threaded CRC64 53-73% faster for a CPU that is 13 years old, and ~67% faster for a CPU that is 5 years old. I'll get into the details more as to how / why, but modern CPUs could see 2x+, and an update to the algorithm should allow for leveraging more CPU hardware as it scales. This is also generally applicable to all CRC polynomials, and brings software CRC within 2x of hardware CRC performance.

Oh, then I also made CRC concatenation >10,000x faster, but I suspect that was previously known, and just not reported for what it was.

I know, this is going to feel like "learn this one simple trick to make all your programs faster" clickbait, but for some algorithms, 2x faster is just the start. (and extrapolating to other CRCs

Computer Architecture Background

For folks who already know about the history of computers, and how / why CPUs do what they do, jump ahead to "Making it faster".

 Decades before I was born, Alan Turing and John von Neumann developed competing thoughts about how to conceptualize and build computers. Ultimately, the von Neuman architecture won out. It is best characterized by a central processing unit with control flow, arithmetic operations, along with an interface to memory, input devices, and output devices. To go a little further for our specific example, processors have been built to basically do a few things really well: read instructions, decode those instructions into sub-instructions, wait for data to arrive, then finally execute on any data that is available.

Fetching things can sometimes take a while, and execution isn't nearly an instant operation. Even with your 3 or 4 gigahertz PC, what's really going on is that most CPU instructions, especially complex mathematical functions like square root, division, and modulo, all take multiple steps. So perhaps if you are performing a division operation, you may only get a few bits each cycle, and you may need to wait tens of cycles to get your result. Normally this is all hidden, so you only get your final results, and not incomplete results.

In the specification manuals for most CPUs, you can look up information about how many instructions of what types can be dispatched in a cycle, and how many cycles each operation will take. Depending on the operation, and other things going on at the time, sometimes you can start multiple operations of the same type in adjacent "pipelines", like having multiple lines in a factory. Sometimes you can add items to the lines every cycle, and sometimes each step takes longer than a cycle and items to be added to the pipeline have to wait. Sometimes, there are groups of 2, 4, or 8 items that need to be processed at once, and we have instructions specifically designed to process these groups in what is called vectorization.

Making it faster

Back in 2013, Mark Adler released performance improvements to the CRC class of algorithms. I didn't learn of them until several years later, through Matt Standcliff's Redis work and optimizations around 2017 or 2018. Huge thank you to Mark for writing the code, Matt for making it work with Redis, and Madelyn Olson for finally getting it merged in 2020.

A friend and former coworker sent me a reference to the "hacker's delight CRC", which was posted in 2014. That looks to use the same techniques that Mark Adler released in 2013.

I don't know who originated the work, but I'm going to guess it was Mark Adler or one of his colleagues, being he is the creator of the Adler CRC32, and co-creator of the Zip format, among others.

What was done? Mark transformed the typically bit or byte-wise operations of CRC into what could be done up to 8 bytes at a time. CPUs are built to do many such operations at the same time, so this naturally sped up CRC from around 400-408 MB/second to ~1420 MB/second, at least on one of my daily-workhorse Intel Xeon 2670's. That's a ~3.5x speedup by switching from 1 byte at a time to 8 bytes. Quite incredible if I didn't compile, run, and compare the outputs myself.

Not used, and rarely mentioned, I noticed that Mark had provided a method to do CRC combining. Where normally if you had some data, you had to have one function run from start to finish over that one whole block of data. You could pause, but you had to do the first part first. This limitation is very common among hashing algorithms.

Again, I don't know the origination of the idea, but Mark determined a way to merge the crcs of conceptually adjacent segments. So you can cut your 1 segment into 2, compute over the segments in parallel, combine the results, and get the same result as if you had computed the value serially.

Normally, most folks would add some threads, build a thread scheduler, and work out how to make this faster with threads. I did this, but only because I already had a threaded snapshot engine for Redis (not a part of this PR), so my CRCs were naturally faster. But I wasn't just interested in threads, and Mark's improvements weren't from thread shuffling on one CPU, it was from putting more operations in the CPU from one thread.

Armed with the knowledge that I could extend a CRC, I pulled the CRC operation over 8 bytes at a time into a group of macros, then had the single thread run over 2 or 3 segments producing 2 or 3 concurrent 8-byte segment CRCs over the data, all in one thread.

I had difficulty making CRC64 faster for data smaller than about 2-3 megabytes until I noticed something important. The original CRC combine method was taking ~270 microseconds to staple ~1 meg CRCs together. After some vectorization, manual and automatic, I settled on a compiler-vectorized method that got me down to ~29 microseconds for the CRC combine operation back in spring 2018.

On Wednesday, March 29, after looking at the algorithm further while preparing the patch for what ultimately became this pull request, I noticed that if you were to cache a certain set of data, much like we had done with the CRC computation in the first place, you could eliminate much of that remaining ~29 microseconds. So I did.

After a quick cache initialization that takes around 140 microseconds, or just more than half the time of the original staple over 1 meg, a 32 kilobyte cache is populated for your polynomial that allows CRC64 stapling. Due to an apparent cycle in the CRC after 64 bits of merge, we can actually merge unlimited length segments. After initialization, calls to combine about a megabyte now take roughly 25 nanoseconds. Yes, nanoseconds. From 270 microseconds without vectorization and caching, to 29 microseconds with vectorization, then all the way to 25 nanoseconds with caching.

So at least in Redis-land, CRC64 combining is about 10,800 times faster overall with the use of a 32 kilobyte cache and some auto-vectorization.

Wow.

While I am pretty sure that the caching portion is a well-known optimization, at least to Mark Adler, as I see a similar set of CRC combine optimizations in the most recent version of his codebase, which makes me think I should do a literature check before I write software. That said, Mark does not provide the additional vector operations, and without them, the cache speedup (at least in my version) is limited to getting us to 8 microseconds, instead of 50 nanoseconds typical worst-case.

So overall, the Redis CRC combine is about 160x faster than CRCany's combine, and nets >10,000x speedup from what I was originally working with, compared to around a 60x speedup with caching, or only 10x with vectorization (the combination is super effective).

Here are some benchmark outputs in a Github gist.

Overall, that gets us from 400-408 Megs/second for 1 byte at a time, to ~1420 megs/second for 8 bytes at a time, to up to ~2325 megs/second. So roughly 63% faster on an Intel Xeon E5-2670 v0 @ 2.60GHz with 20Mb of cache.

That is an old CPU, originally marketed Q1 2012. So it has a relatively smaller number of load, store, and arithmetic units compared to a more modern processor. I happen to have a slightly newer Intel Core i3-8130U @ 2.20 ghz, which was first listed for sale Q1 2018, and which I got in the summer of 2019, inside the laptop in which it currently runs.

Again, we see 400-408 Megs/second for 1 byte at a time processing. The 8-byte at a time version gets us up to ~1400 megs/second, or 3.5x faster, with the 24-byte at a time version seeing ~2350 megs/second on larger data. Ours gets us an approximate 68% speedup at 24 bytes, over the 8 byte at a time method.

In some benchmarks on ~10k of data (fitting inside L1 cache), I saw crc64 achieve upwards of 5670 megs/second on the 8130U.

Beating 1 byte/cycle

On my 5 year old mobile CPU, we see the 2.2ghz processor handling 2350 Megs/second, which is 1.068 bytes/cycle. If the data happens to already be loaded into cache after being created by another step, as is the case in many Redis snapshot operations, we see 2.6 bytes/cycle processed.

This speed puts us on par with the performance of "properly engineered" hashes that don't use any tables, and we are within a factor of 2 of the hardware CRC64 available in some more modern CPUs. For those about to ask, no we can't use that CRC in Redis or generally, it is not the right polynomial; if you run the functions on the data, you get different outputs.

I do not know if you can convert the output of CRC from one polynomial to another, but I suspect the answer is no. Might be an interesting question to ask a good mathematician.

Similar optimizations on 16 and 24-way (or more) processing can be done on CRC16 in Redis, or any other CRCs anywhere, as the 8 bytes at a time processing has already been implemented in CRCany. We skip this 16/24 byte optimization in Redis because CRC16 is typically used there for cluster slot identification, which is performed on keys that are generally in the tens to maybe hundreds of bytes in size, and there isn't a benefit to the optimization until 128 bytes for 16-way, and ~2k for 24-way. On the other hand, CRC64 is typically performed over snapshots of keys and data, which can reach to tens of gigabytes in size.

I'll try to get the optimizations upstream to crcany soon.

Summary

That's a lot of information, but CRC64 in Redis is getting about 60-70% faster for old machines, and likely 2x or faster on even newer hardware. For CRC combining, we achieve ~50ns worst-case time.

References

Pull request: https://github.com/redis/redis/pull/11988

Banchmark outputs: https://gist.github.com/josiahcarlson/9a147d39cc492951faffb723149f7926

Mark Adler's crcany project: https://github.com/madler/crcany/

Thursday, January 26, 2023

Getting an early birthday present from YouTube

Back in 2008-2010, I was working as one of the software engineers tasked with developing features and functionality on YouTube for Google. I had joined as the Santa Monica team was growing, right before a summer-long hiring freeze, about 3 months after officially getting my doctorate in 2008. Among other things, I worked on auto-redirection based on demographics (age-gating alcohol-related Youtube pages, but it was nearly Turing-Complete; using user profiles as nodes, and cookies as the tape), home-page customization, user-page customization (whee, CSS sanitizers!), but the biggest thing I did at YouTube was rewrite the built-in social Groups platform.

Humble beginnings


I can't recall the original author of the original YouTube Groups platform, but the software had several bothersome limitations and design constraints. The first part was that no more than (if I remember correctly) 5000 videos could be posted to the group, and that after a video was posted, there could be up to 1000 unthreaded comments related to that video. To find the conversation, you looked through the reverse-chronological listing by posting date. What happened when you hit 5000? Sorry, you couldn't post any more videos. Oops. What about text posts? Well, text posts only existed as chunks of text like any standard bulletin board at the time, and was mostly just a repeating wall of author, text, horizontal line.

While some of these could be fixed with some minor edits, nearly everyone that I spoke with internally wanted to delete Groups. Not for any reason except that it was pulling less than 1 million total views weekly (this was shortly before YouTube publicly announced 1 billion daily watch page views), and occasionally changes to some Python-generated html widget would cause a maintenance headache as someone needs to figure out what broke Groups. It also didn't help that the design was a bit out of date, owing to zero real updates in the several years since it had been written.

A first step


By the time I started paying attention in late 2008, I had hacked together a mini Twitter / Reddit / Digg / whatever forum thingy for posting daily standup statuses for our group, and it had more interaction functionality than the existing YouTube Groups functionality. I had built it using AppEngine as an, "I should use AppEngine for something so I can learn it" project, and after working up enough "I am embarrassed by YouTube Groups" intellectual inertia, I started considering building a new YouTube Groups in AppEngine.

After designing the database schema for the non-relational AppEngine datastore, there was a request from our Chief Architect Mike Solomon to also provide a MySQL schema as an option. At the time, Groups would be the only YouTube property not using either the sharded or non-sharded MySQL databases that stored everything not pictures or video. While comments would eventually move to one of the BigTable successors in a separate process / migration, Groups ultimately ended up using the non-sharded MySQL database, as we never expected the thing to grow very big. We were wrong. :)

Once the schema was approved, I started hacking on a horribly designed proof-of-concept 20% project. A month or two later, after it was far enough along to understand what I was going for (think threaded LiveJournal / Reddit with video posts and comments you could upvote / downvote), my manager James Phillips convinced our PM to get one of the YouTube designers on it. Fortunately for me, I ended up getting the illustrious Gunthar Hartwig to help integrate the developing YouTube profile / channel styles into YouTube Groups. This integration ended up going pretty far, and by the summer of '09, old groups migration and new groups release, Groups was using nearly all of the same widget building, widget positioning, data packing, and unpacking code as YouTube Channels (I also attempted to get the same code pushed to the homepage for customization there, but the homepage team went with something different, and looking at LinkedIn profiles after the fact, apparently earned a few patents for their efforts).

Pitfalls


No project is complete without hitting some bumps, or in my case, pretty deep pitfalls. Along the way in rebuilding Groups, I made several mistakes.

The first obvious mistake was that I built groups against what we called "mono", which was a single large database that was meant to hold things that didn't make sense to, or couldn't be sharded across our sharded DB instances. Clearly groups could have been sharded on a per-group basis, with the only real costs being a lookup of non-cached group name -> id (we sharded on ids), and a need to hit all shards to get a listing of groups.

This was compounded with a particular design choice / mistake made where I both avoided joins, and tried to satisfy the 4 most common queries with covering indexes. Which queries?
  • Show me the top k videos for this group
  • Show me the most recent k videos for this group
  • Show me the top k videos for this topic
  • Show me the most recent k videos for this topic
Four queries, four covering indexes. I didn't worry about people posting, that wasn't a big deal; just a main table + 4 index writes (all of whom were spam-checked, throttled, etc.). But the problem was deletion. Because there were no joins, to remove a thread (and its associated videos) from the 4 indexes, all video posts needed to be removed from the index. This was (relatively) rare, but occurred often enough and for long enough threads that I had added 'order by ... limit 1000' to keep the database from stuttering, with a secondary batch process that ran on a daily basis to clean up any threads longer than 1000 posts. Had I sharded the data by group in the first place, used a join to store the thread's delete bit in 1 place (indexed), and considered just using caches instead of covering indexes, I could have avoided most of the database load that I had induced. To the point: occasionally, those delete thread operations would be reported in our weekly database query reports as one of the top-100 latency queries. Funny how just a little join on the queries and some caching could have solved it.

Foolishly, I also wrote my own javascript Ajax mini-library separate from what most of the rest of YouTube was using. I wrote it because I couldn't find the documentation for any of the internal libraries that YouTube had been using for Javascript, and literally everyone else at my site just magically seemed to know how to use it. This included the recently graduated undergrad who started the same day I did! I didn't find out until a bit later, but the recent grad had done an internship with YouTube the summer before, so already knew the tools, people, process, ... I probably could have saved myself a lot of time if I'd asked, "where are the docs?" Alternatively, maybe I did and I don't remember, as there were several other projects I was working on at the time, and I was overworked.

If you want me to be honest, internally much of the code I wrote for Groups was crap. Some of the HTML templates weren't too bad, but the inline onclick bindings nowadays might get me fired in some shops, even if the css selectors didn't. There was also an if / elif block in the Python-side API that dispatched 95% of the interactive elements processing based on a variety of context (upvotes, downvotes, posting, replies, content pagination, ...), and it got worse to maintain with every addition.

My only real excuse for any of this is that working on YouTube was the first web application I touched at any level, and I got to work on: subscription email libraries, memcache libraries, sharded + non-sharded sql schemas and indexes, homepage html, css, javascript, user profile pages, all of the (groups) templates, all of the (groups) css, all of the (groups) javascript, all of the (groups) Python, (groups) data migration, daily (groups) batch processes, (groups) API design, ensuring proper layout for RTL languages for user channels and groups, ... along with several other side projects and helping others as necessary.

Trouble in paradise


Having just come out of a much smaller organization before Google, I had been used to being recognized and promoted for my abilities and achievements. That doesn't really / actually happen at Google. It's a bit of a fight, and the best way you can fight your way to the top is to get folks with a higher position than you to give you a recommendation for what amounts to a 'promotion packet' (conversations suggested that 3-5 was the minimum for promotion, but more is better). I didn't put those numbers together early enough (I didn't know the game), and by the time I did, whatever career I might have had at Google had been sand-bagged by not doing the right work for the right people, while also stepping on toes trying to create something new / better.

Once I realized that my time at YouTube / Google was coming to a close, I started looking at what was causing the most issues with Groups stability. And when I ran the numbers, nearly 99% of all issues with Groups were caused by periodic updates and incompatibility with the Channels carousel / player widget and backend. I'm not going to get into details, but keeping that API stable was tough, breaking some part of the YouTube world roughly 1 out of every 3 deploys.

Several times over the 9 months after migration to what we called "New Groups", I needed to write fixes during Wednesday deploys, which should have been prevented / nearly impossible, except for the following series of events that seemed to occur with startling regularity...

1. Code that breaks channel carousel would be committed just prior to code freeze on a Friday/Monday (maybe or maybe not breaking Groups).
2. Monday/Tuesday we find out that the carousel is broken.
2a. Because I was very difficult to get LGTM from me during code reviews, the breaker commits new fixes *after* I leave for the day with someone else's approval on Monday/Tuesday at 5-6PM.
3. Newly broken groups would then go out on deploy Wednesday late morning / midday, even when flags were raised because, "your shit should have been fixed".
4. This inevitably causes production errors, as new groups is 10x-100x bigger than old groups almost overnight.
5. Write fixes, hopefully to get re-deployed that day, in the Wednesday (to sometimes Thursday afternoon) "oh shit" deploy shuffle / kerfluffle.
6. In the worst-case, groups carousel was broken for a week (usually only 2 days, thanks to other hotfixes going out on Friday).

Oops.

For the future!


As part of a going-away birthday present to myself (I left Google in March 2010, my birthday is in March), and as a way to future-proof against breakage, I wrote a new player widget specifically for Groups that bypassed the carousel. Instead of calling the shared API endpoint, I called my own endpoint, that only did the 4 things I needed it to do. Okay, exception solved, but what about the carousel itself? Can't rely on the templates either...

So I wrote a new player; inside a conversation thread, you could watch a big video at the top, while interacting with the comment stream below, or with the video in a thumbnail in the upper-right. On all pages, a scrollable and paginated list of top / recent videos for that thread / group was available for browsing and viewing. As you scrolled down and interacted with posts and topics, you could watch videos, cue up videos to play, vote on videos, start a video without leaving the page, and more. This worked great on every browser - including IE6, as I was mostly returning to the original viewer interface I'd written before YouTube deprecated IE6 [1] and I was "encouraged" to use the carousel for a consistent experience. Ultimately it was very similar to the carousel (in terms of total functionality), but with a completely different backend and widget, so any later changes to the channel carousel wouldn't affect it (hard to break a code path you don't interact with, at least not without trying).

The new player / interface wasn't supposed to turn on until my birthday, a few weeks after I had left. Sadly, the world doesn't quite work the way we intend. The last day I was working, I was still committing code for the groups revamp, and it turns out that resurrecting code while being plied with whiskey shots at a rate of 1.5 / hour is tough. I got up to 11 shots by the time I wrote "gvn commit" and had my exit interview at 5PM, which was probably not all that good for the quality of the code I was committing. Also; clearly I was drunken rambling to my exit interviewer.

Thinking that was the end of that, I got another call just one week later, as my former coworker broke the carousel for Groups. I had let a few folks in on the surprise, so when the carousel was inevitably broken, they called to ask which line they should change to make it live. After updating the line, then running a test on their local machine, I get the question... "um, did you forget to commit a template?" Indeed, 11 shots is too many. After giving them the password to my old workstation, they got the template, checked again, committed, and groups limped along for another 9 months.

That any of this was possible was fortunate, as pursuant to our development group's hoarding tendencies[2], one of my coworkers had kept my old workstation aside as a "backup workstation" away from IT re-imaging, "just in case".

During the migration I performed for Old Groups to New Groups in August 2009, Groups had been seeing ~1 million weekly video views. By December 2009, that had grown to ~10 million weekly video views, or ~80% growth month-over-month. When I left in March 2010, hundreds of new groups were being created every day, people were posting and voting on content, ... and it was pretty incredible. By the time Groups was turned down in December 2010, there were groups with hundreds of thousands of members and videos posted, along with hundreds of millions of votes for videos and posts. All of that is gone now, the 8195 lines of Python, HTML, Javascript, CSS, and SQL archived in an old SVN (or now git) commit, a database backup, and whatever is scattered in the Internet Archive. I won't miss banning another Sub4Sub group, but I will miss running into the anime and game fandom music video groups.

I suspect YouTube ultimately turned Groups off because it was unnecessary database load, no one wanted to maintain it, and it was suffering from 9+ months of bit rot. I'd rewritten old groups because it wasn't great at what it should have been, but I suspect that Groups was also a partial victim of its own success. Good enough for users to use it / abuse it, but not good enough for Google to want to dedicate even a single engineer to the task of maintenance. This was tough for me, because ad revenue for Groups could have paid for several full-time engineers by December 2009, and even a team by the time they turned it off.

My updated groups was a perfect example of someone building up a revenue stream for Google on an existing service, allowing for Google to control the media, as well as the message boards discussing and disseminating the media. Can you imagine YouTube was hosting hundreds of individual forums of 200k+ individuals each, with 100k+ posts in each of hundreds of groups, after ~ 15 months of use? That dwarfs many current "big name" social media sites, and that was in 2010.

Alternatively, because groups self-organized, it also helped filter (bad) content, as bad content would gather in Groups postings for mass-flagging and deletion. Groups are the perfect opportunity for honeypots; groups intended to gather bad content and users to feed into the automated filters for blocking, banning, and/or legal action. Some people like to think that algorithms are good at filtering data, and I agree. But people are really good at gathering around their interests and collecting videos related to it.

But Groups is gone, my birthday present relegated to the corners of the Internet Archive.

Looking back at a different path


As I get older, I wonder if there was something else that I could / should have done, or could have known earlier that would have provided me with success at Google. About the only thing I can come up with was that just a few months prior to my leaving, my manager left to join early SpaceX. Now I'm not saying I should have followed him to SpaceX, but maybe I should have tried to step up into his position at Google.

At the time, I had quite a few ideas about engineering, and about applying what I knew to solving problems for startups. I thought that if I could find a startup with the right mix of business model, technology, people, and funding, that I could beat the 90-95% 2-year failure rate for startups (I did 4 times in a row, FWIW). Ultimately, I ended up leaving and starting down a path of startups and entrepreneurship.

I sometimes wonder if trying to become a manager at Google for YouTube would have been a good path, or if I had so many ideas about engineering and software design that I wouldn't have been able to give it up completely. Hard to say, considering that I still write software today, some 13 years later.


[1] This is a great story from one of the SBO team's efforts to kill IE6, which talks about what OldTuber is and what it meant: http://blog.chriszacharias.com/a-conspiracy-to-kill-ie6

[2] My particular tendencies were extreme; I had a Mac desktop workstation (originally allocated), Linux desktop workstation (acquired from leaving coworker), Windows laptop workstation (swapped with my original Powerbook, used as travel / SSH + Synergy host), and 2x 30" monitors. Normally 1 desktop, 1 laptop, and your choice of 2x 24" monitors or 1x 30" was standard, but I'd acquired or traded up enough to have an extra linux desktop, an extra 30" monitor, and when my cube-mate wasn't around (3-4 days a week; due to travel and WFH), a 3rd 30" monitor. I had more monitor area than our site director and original GnuPlot author Thomas Williams by at least 50%. It was glorious.

Sunday, December 11, 2022

Goodbye Twitter

Some people have heavy hearts, but I don't. I originally signed up for Twitter for development / API access. I wrote software that analyzed millions of tweets and twitter profiles every day for Adly and our celebrity tweet network. I'm about a decade removed from Adly, never much got into tweeting, and I spend most of my time on Twitter recently posting updates about my software, book, etc. Things that could be blog posts here. I'll probably do that going forward.

I'm leaving Twitter because after Elon Musk's takeover, all I've seen is hard-working people who built the platform being denigrated, while the most useless CEO on the planet tweet-abuses, blocks, and suspends anyone who has a problem with his banality echoing though his slowly emptying anti-social network. Any semblance of an original thought seems to be limited to lightly-edited pro-Russia and pro-MAGA talking points, while pretending that he's the all-seeing eye traversing dimensions.

Imagine having access to billions of dollars... then using it to put Trump on Twitter, where Trump already has a whole social network dedicated to it - "Truth Social" (the irony is not lost on me, and it shouldn't on you). And using it to insult / abuse the people who made Twitter what it was. From developers, to community managers, to the entire Twitter community. Imagine having had the support of most of the world to put rockets on Mars... then squandering it on posting memes to your $40 billion micro-blog platform.

I was over Musk when he talked about couping Bolivia. The union busting at his factories, when he straight fired a huge portion of the Twitter workforce, and the recent public details of his failure at being a father. All of that behavior is anti-people, anti-labor, and obviously seeks to create animosity among as many people as possible.

No Tesla, no powerwall, and no Musk-profiting products, not ever. I've got other, better things to spend my time and money on. I bet you all do too.

Wednesday, May 29, 2019

Re: Leaky Python Batteries

After reading information about Amber Brown's talk regarding Python and its included "batteries" in the form of the standard library, there is a lot to agree with, and only a small amount to not agree with. I know I'm late.

Here's my take...


For experience consideration; I'm the reason that asyncore/asynchat/... (callback-defined async sockets in Python before asyncio) was included and updated in Python 2.6, and lived to see Python 3.0 (other folks took on maintenance after 2.6.?/3.0.?). I stepped on a lot of toes to make that happen, but I was still using asyncore/asynchat, and not having them in the standard library would have been bad for me and folks like me. So I "stepped up".

My experience is that moving anything in Python in any direction is hard and frustrating, no matter how "sane" you think your words and opinions are*. I remember being confused that I was even having to argue for static byte literals in Python 3.0, never mind pushing for a syntax like b'hello' instead of bytes([104, 101, 108, 108, 111]) [1], or even that an immutable variant should be available (instead of using a tuple for hash keys).

Thinking about it well after the fact, I have the sense it was because I didn't feel like I had any real level of ownership or control. I felt responsible as a member of the community to voice my opinion about the software I was using, to try to contribute. But I also felt as though I didn't have any level of real ownership. Like, somehow there was a path through the Python community that lead to being able to fix a 2 line bug without 25 people signing off on the change. I saw people walking the path, but I couldn't really find it or manage to walk it myself.

In Contrast


But compare that experience to your own project? Or even working in almost any tech-based organization? I was fortunate, in the 21 months I put in at YouTube / Google from 2008-2010, I was made an "oldtuber", and later "owner" of several files early on as I hit readability with Python quickly, and showed that I could actually fix problems with the platform. Within 3 months of joining in 2008, I could write code, ensure it passed the automated tests, commit without *any* code reviews in a half dozen files, and have code running on youtube.com within a week of writing it (pushes for hotfixes could happen randomly, sometimes 3-4 times a week back then).

Compare that with any library in Python? I can't even remember how many folks needed to sign off on my asyncore / asynchat changes, but it was far more than the 0/1/2 I needed to commit at YouTube. Python has more folks weighing in now than ever before, and I bet that's not making contributing easier.

I don't know how much this feeling pervades other existing / past python contributors, nor how much this is a barrier for creating new contributors. I don't know if this is a pain point for others, but it was big for me.

I don't know if Amber's frustration is similarly comparing her experience in Twisted vs. Python core. Where in Twisted she found it easy to make an impact and help the project, but Python core is uninterested / unwilling in similar insights. I don't know (a lot of things).

Regardless, I don't know if splitting the Python standard library out makes sense. On the one hand, pulling out the standard library would give some people more control and nearly absolute ownership. But that will mostly just create inconsistencies in code quality, maintenance, and interoperability - like we've all seen in our own dependency hells elsewhere. All the while making Python a language without batteries.

Address the Leaky Batteries


Personally, even if the batteries are leaking, I'd rather have them included. Compiling Python issues aside (you can disable compiling tkinter / lxml modules FWIW), that's as much a "work on Python's Makefiles" as it is anything else. Which is a much more approachable and appropriate thing than getting the organization to create dozens of sub-projects.


I do agree with Amber that asyncio's inclusion in Python does make it hard for the community, especially given how long that Twisted has been around and doing much of the same stuff. Heck, I even suggested that if Python were to deprecate and remove modules from the standard library because the functionality were duplicated in Twisted, then the answer was to dump those modules and include Twisted [2] (even though I've actually never used Twisted, and have more recently used asyncore/asynchat, asyncio, pyuv, uvloop, and shrapnel since I read Twisted source). But it's going to be hard to put that genie back in the bottle; best case scenario is that there aren't any more "big" packages included that quickly.

Ultimately, I believe that the answer might just be better organization, better management of maintainers, better expectation management re: possible contributors, and better management of contributions. I think that giving people more ownership of the modules / packages they maintain, but also allow for the community to override a bad maintainer, might help things move forward in each individual module. If everything splits, many sorts of organization-level efficiencies are lost, including just the benefit of being part of a larger project meaning you get more attention at all levels.

Personally, I'd suggest sticking together and working on trying to manage the project itself better. For some subjective definition of "better".

A Mistake 20 Years in the Making


That said, until visiting [3] while writing this entry, I didn't even know there was a formal process of being involved in the Python Software Foundation and moving through the organization. I know I spent 10-20 hours a week trying to help back when I was active on the mailing lists. No wonder I failed to find a path to success in Python core for the 20 years I've used Python; I didn't know you literally needed to be a part of the club. Hah.

C'est la vie.
[1] - https://mail.python.org/pipermail/python-3000/2007-February/005822.html
[2] - https://mail.python.org/pipermail/python-dev/2007-February/071202.html
[3] - https://www.python.org/psf/membership/

Sunday, May 15, 2016

rom Indexes and Search

So... some updates on rom.

The end of January, I posted a teaser picture about "Interleaved Indexes" in rom/Redis via tweet. If you haven't seen the picture, it's here:


Interleaved index?

I ended up building an interleaved index in Redis using a bit of Lua scripting and Python. No ZRANGEBYLEX, surprisingly. What is an interleaved index? It's what I was calling an indexing mode that has the same ordering and filtering options as a few different multi-dimensional indexes stored as tree-structures. Specifically some KD trees, Redshift interleaved sort keys, BSP trees, a specific implementation of crit-bit trees, and several others offer the specific functionality I was after.

Why

The simple reason why I was implementing an interleaved index was because I see some intersection options on data in Redis to be potentially less efficient than a proper multi-dimensional index would or could be. Long story short, it worked, but not without issues. I mentioned some of these issues in a series of tweets 1, 2, and 3, semi-abandoned the code in late-February, and now am ultimately not releasing it. Why? Because it doesn't actually add anything. It was complex, borrowed about 750 lines of code I wrote 5 1/2 years ago, and ... no.

A better option

There were a few very simple wins that I knew could be made with the query optimizer, including a fix on my side for a bug when calling TYPE from within a Lua script (which returns a table instead of a string). The ultimate result of that work is that some queries on numeric ranges can be hundreds or thousands of times faster on large indexes in theory. Partly due to starting with the correct set/sorted set to start, but also implementing a direct scan of an index instead of intersect/union + delete outside ranges.

Sometimes being more specific for optimizations is worth it. Definitely is in this case. For one of my use-cases involving search, I'm seeing 10-20x faster queries in practice, and 150x faster in a simple canned test.

I also removed non-Lua writing mode code. Sorry for those of you living in pre-2.6 days, but you'll have to upgrade. Hell, even with Lua scripting turned off, the query optimizer still used Lua, so if this worked in Redis 2.4 recently, I'd be surprised.

So that's what's going on right now.

Rebuild your indexes

Yeah. And rebuild your indexes. I'm sorry. Whenever I'm using rom as a cache or index of some kind, I re-cache and re-index daily so things like this always eventually resolve themselves, especially immediately after a library upgrade. Not a new idea; Google did it with their bigtables for cleanup, Postgres does auto-vacuum. Call this a manual vacuum via re-indexing.

Once/day or however often, maybe during off hours:

# import all of your models first
# then...
from rom import columns, util
for model in columns.MODELS.values():
    util.show_progress(util.refresh_indices(model))

That will rebuild all of the indexes on all of your models.

Almost P.S. - Loadable modules in Redis?

Redisconf 2016 happened last week and loadable modules were announced. I think that for people who host their own Redis, it could be awesome. Think of it like an answer to Postgres plugins. Hope I can pay someone to run my loadable modules, if I ever get around to building any :P

Wednesday, July 29, 2015

Transactions in Redis

Over the last few months, I've been thinking about and implementing transactions for Lua scripting in Redis. Not everyone understands why I'm doing this, so let me explain with a bit of history.

MySQL and Postgres

In 1998-2003 if you wanted to start a serious database driven web site/service and didn't have money to pay Microsoft or Oracle for their databases, you picked either MySQL or Postgres. A lot of people picked MySQL because it was faster, and much of that was due to the MyISAM storage engine that traded performance for a lack of transaction capability - speed is speed. Some people went with Postgres because despite its measurably slower performance on the same hardware, you could rely on Postgres to not lose your data (to be fair, the data loss with MySQL was relatively rare, but data loss is never fun).

A lot of time has passed since then; MySQL moved on from MyISAM as the default storage engine to InnoDB (which has been available for a long time now), gained full transaction support in the storage engine, and more. At the same time, Postgres got faster, and added a continually expanding list of features to distinguish itself in the marketplace. And now the choice of whether to use MySQL or Postgres usually boils down to experience and preference, though occasionally business or regulatory needs dictate other choices.

TL;DR; data integrity

In a lot of ways, Redis up to now is a lot like MySQL was back before InnoDB was an option. There is already a reasonable best-effort to ensure data integrity (replication, AOF, etc.), and the introduction of Lua scripting in Redis 2.6 has helped Redis grow up considerably in its capabilities and the overall simplification of writing software that uses Redis.

Comparatively, Lua scripting operates very much like stored procedures in other databases, but script execution itself has a few caveats. The most important caveat for this post is that once a Lua script has written to the database, it will execute until any one of the following occurs:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script hits an error and exits in the middle, all writes that were done up to the error have occurred, but no more writes will be done from the script
  3. Redis is shut down without saving via SHUTDOWN NOSAVE
  4. You attach a debugger and "fix" your script to get it to do #1 or #2 (or some other heroic deed that allows you to not lose data)
To anyone who is writing software against a database, I would expect that you agree that only case #1 in that list is desirable. Cases #2, #3, and #4 are situations where you can end up with data corruption (cases #2 and #4) and/or data loss (cases #3 and #4). If you care about your data, you should be doing just about anything possible to prevent data corruption and loss. This is not philosophy, this is doing your job. Unfortunately, current Redis doesn't offer a lot of help here. I want to change that.

Transactions in Lua

I am seeking to eliminate cases #2, #3, and #4 above, replacing the entire list with:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script exits with an error, no changes have been made (all writes were rolled back)
No data loss. Either everything is written, or nothing is written. This should be the expectation of any database, and I intend to add it to the expectations that we all have about Redis.

The current pull request is a proof of concept. It does what it says it does, removing the need to lose data as long as you either a) explicitly run your scripts using the transactional variants, or b) force all Lua script calls to have transactional semantics with a configuration option.

There are many ways the current patch can be made substantially better, and I hope for help from Salvatore (the author of Redis) and the rest of the community.

Wednesday, November 26, 2014

Introduction to rate limiting with Redis [Part 2]

This article first appeared on November 3, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

In Introduction to rate limiting with Redis [Part 1], I described some motivations for rate limiting, as well as provided some Python and Lua code for offering basic and intermediate rate limiting functionality. If you haven’t already read it, you should, because I’m going to discuss several points from the article. In this post, I will talk about and address some problems with the previous methods, while also introducing sliding window functionality and-cost requests.

Problems with previous methods

The last rate limiting function that we wrote was over_limit_multi_lua(), which used server-side Lua scripting in Redis to do the heavy lifting of actually performing the rate limiting calculations. It is included below with the Python wrapper as a reference.

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count = redis.call('INCR', key)
        redis.call('EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
        end
    end
end
return 0
'''

Hidden inside this code are several problems that can limit its usefulness and correctness when used for its intended purpose. These problems and their solutions are listed below.

Generating keys in the script

One of the first problems you might notice was mentioned in a comment by a commenter named Tobias on the previous post, which is that we are constructing keys inside the Lua script. If you’ve read the Redis documentation about Lua scripting, you should know that we are supposed to be passing all keys to be used in the script from outside when calling it.

The requirement to pass keys into the script is how Redis attempts to future-proof Lua scripts that are being written, as Redis Cluster (currently in beta) distributes keys across multiple servers. By having your keys known in advance, you can calculate which Redis Cluster server the script should run on, and if keys are on multiple Cluster servers, that the script can’t run properly.

Our first problem is that generating keys inside the script can make the script violate Redis Cluster assumptions, which makes it incompatible with Redis Cluster, and generally makes it incompatible with most key-based sharding techniques for Redis.

To address this issue for Redis Cluster and other client-sharded Redis setups, we must use a method that handles rate limiting with a single key. Unfortunately, this can prevent atomic execution for multiple identifiers for Redis Cluster, but you can either rely on a single identifier (user id OR IP address, instead of both), or stick with non-clustered and non-sharded Redis in those cases.

What we count matters

Looking at our function definition, we can see that our default limits were 10 requests per second, 120 requests per minute, and 240 requests per hour. If you remember from the “Counting correctly” section, in order for our rate limiter to complete successfully, we needed to only increment one counter at a time, and we needed to stop counting if that counter went over the limit.

But if we were to reverse the order that the limits were defined, resulting in us checking our per-hour, then per-minute, then per-second limits (instead of per-second, minute, then hour), we would have our original counting problem all over again. Unfortunately, due to details too involved to explain here, just sorting by bucket size (smallest to largest) doesn’t actually solve the problem, and even the original order could result in requests failing that should have succeeded. Ultimately our problem is that we are counting all requests, both successful and unsuccessful (those that were prevented due to being over the limit).

To address the issue with what we count, we must perform two passes while rate limiting. Our first pass checks to see if the request would succeed (cleaning out old data as necessary), and the second pass increments the counters. In previous rate limiters, we were basically counting requests (successful and unsuccessful). With this new version, we are going to only count successful requests.
Stampeding elephants

One of the most consistent behaviors that can be seen among APIs or services that have been built with rate limiting in mind is that usually request counts get reset at the beginning of the rate limiter’s largest (and sometimes only) time slice. In our example, at every hour on the hour, every counter that had been incremented is reset.

One common result for APIs with these types of limits and limit resets is what’s sometimes referred to as the “stampeding elephants” problem. Because every user has their counts reset at the same time, when an API offers access to in-demand data, many requests will occur almost immediately after limits are reset. Similarly, if the user knows that they have outstanding requests that they can make near the end of a time slice, they will make those requests in order to “use up” their request credit that they would otherwise lose.

We partially addressed this issue by introducing multiple bucket sizes for our counters, specifically our per-second and per-minute buckets. But to fully address the issue, we need to implement a sliding-window rate limiter, where the count for requests that come in at 6:01PM and 6:59PM aren’t reset until roughly an hour later at 7:01PM and 7:59PM, respectively, not at 7:00PM. Further details about sliding windows are a little later.

Bonus feature: variable-cost requests

Because we are checking our limits before incrementing our counts, we can actually allow for variable-cost requests. The change to our algorithm will be minor, adding an increment for a variable weight instead of 1.

Sliding Windows

The biggest change to our rate limiting is actually the process of changing our rate limiting from individual buckets into sliding windows. One way of understanding sliding window rate limiting is that each user is given a number of tokens that can be used over a period of time. When you run out of tokens, you don't get to make any more requests. And when a token is used, that token is restored (and can be used again) after the the time period has elapsed.

As an example, if you have 240 tokens that can be used in an hour, and you used 20 tokens at 6:05PM, you would only be able to make up to another 220 requests until 7:04PM. At 7:05PM, you would get those 20 tokens back (and if you made any other requests between 6:06PM and 7:05PM, those tokens would be restored later).

With our earlier rate limiting, we basically incremented counters, set an expiration time, and compared our counters to our limits. With sliding window rate limiting, incrementing a counter isn’t enough; we must also keep history about requests that came in so that we can properly restore request tokens.

One way of keeping a history, which is the method that we will use, is to imagine the whole window as being one large bucket with a single count (the window has a ‘duration’), similar to what we had before, with a bunch of smaller buckets inside it, each of which has their own individual counts. As an example, if we have a 1-hour window, we could use smaller buckets of 1 minute, 5 minutes, or even 15 minutes, depending on how precise we wanted to be, and how much memory and time we wanted to dedicate (more smaller buckets = more memory + more cleanup work). We will call the sizes of the smaller buckets their “precision.” You should notice that when duration is the same as precision, we have regular rate limits. You can see a picture of various precision buckets in a 1 hour window below.


As before, we can consider the smaller buckets to be labeled with individual times, say 6:00PM, 6:01PM, 6:02PM, etc. But as the current time becomes 7:00PM, what we want to do is to reset the count on the 6:00PM bucket to 0, adjust the whole window’s count, and re-label the bucket to 7:00PM. We would do the same thing to the 6:01PM bucket at 7:01PM, etc.

Data representation

We’ve now gotten to the point where we need to start talking about data representation. We didn’t really worry about representation before simply because we were storing a handful of counters per identifier. But now, we are no longer just storing 1 count for a 1 hour time slice, we could store 60 counts for a 1 hour time slice (or more if you wanted more precision), plus a timestamp that represents our oldest mini-bucket label.

For a simpler version of sliding windows, I had previously used a Redis LIST to represent the whole window, with each item in the LIST including both a time label, as well as the count for the smaller buckets. This can work for limited sliding windows, but restricts our flexibility when we want to use multiple rate limits (Redis LISTs have slow random access speeds).

Instead, we will use a Redis HASH as a miniature keyspace, which will store all count information related to rate limits for an identifier in a single HASH. Generally, for a sliding window of a specified duration and precision for an identifier, we will have the HASH stored at the key named by the identifier, with contents of the form:

<duration>:<precision>:o --> <timestamp of oldest entry>
<duration>:<precision>: --> <count of successful requests in this window>
<duration>:<precision>:<ts> --> <count of successful requests in this bucket>

For sliding windows where more than one sub-bucket has had successful requests, there can be multiple <duration>:<precision>:<ts> entries that would each represent one of the smaller buckets. For regular rate limits (not sliding window), the in-Redis schema is the same, though there will be at most one <duration>:<precision>:<ts> key, and duration is equal to precision for regular rate limits (as we mentioned before).

Because of the way we named the keys in our HASH, a single HASH can contain an arbitrary number of rate limits, both regular and windowed, without colliding with one another.

Putting it all together

And finally, we are at the fun part; actually putting all of these ideas together. First off, we are going to use a specification for our rate limits to simultaneously support regular and sliding window rate limits, which looks a lot like our old specification.

One limit is: [duration, limit, precision], with precision being optional. If you omit the precision option, you get regular rate limits (same reset semantics as before). If you include the precision option, then you get sliding window rate limits. To pass one or more rate limits to the Lua script, we just wrap the series of individual limits in a list: [[duration 1, limit 1], [duration 2, limit 2, precision 2], ...], then encode it as JSON and pass it to the script.

Inside the script we need to make two passes over our limits and data. Our first pass cleans up old data while checking whether this request would put the user over their limit, the second pass increments all of the bucket counters to represent that the request was allowed.

To explain the implementation details, I will be including blocks of Lua that can be logically considered together, describing generally what each section does after. Our first block of Lua script will include argument decoding, and cleaning up regular rate limits:

local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
local weight = tonumber(ARGV[3] or '1')
local longest_duration = limits[1][1] or 0
local saved_keys = {}
-- handle cleanup and limit checks
for i, limit in ipairs(limits) do

    local duration = limit[1]
    longest_duration = math.max(longest_duration, duration)
    local precision = limit[3] or duration
    precision = math.min(precision, duration)
    local blocks = math.ceil(duration / precision)
    local saved = {}
    table.insert(saved_keys, saved)
    saved.block_id = math.floor(now / precision)
    saved.trim_before = saved.block_id - blocks + 1
    saved.count_key = duration .. ':' .. precision .. ':'
    saved.ts_key = saved.count_key .. 'o'
    for j, key in ipairs(KEYS) do

        local old_ts = redis.call('HGET', key, saved.ts_key)
        old_ts = old_ts and tonumber(old_ts) or saved.trim_before
        if old_ts > now then
            -- don't write in the past
            return 1
        end

        -- discover what needs to be cleaned up
        local decr = 0
        local dele = {}
        local trim = math.min(saved.trim_before, old_ts + blocks)
        for old_block = old_ts, trim - 1 do
            local bkey = saved.count_key .. old_block
            local bcount = redis.call('HGET', key, bkey)
            if bcount then
                decr = decr + tonumber(bcount)
                table.insert(dele, bkey)
            end
        end

        -- handle cleanup
        local cur
        if #dele > 0 then
            redis.call('HDEL', key, unpack(dele))
            cur = redis.call('HINCRBY', key, saved.count_key, -decr)
        else
            cur = redis.call('HGET', key, saved.count_key)
        end

        -- check our limits
        if tonumber(cur or '0') + weight > limit[2] then
            return 1
        end
    end
end

Going section by section though the code visually, where a blank line distinguishes individual sections, we can see 6 sections in the above code:
  1. Argument decoding, and starting the for loop that iterates over all rate limits
  2. Prepare our local variables, prepare and save our hash keys, then start iterating over the provided user identifiers (yes, we still support multiple identifiers for non-clustered cases, but you should only pass one identifier for Redis Cluster)
  3. Make sure that we aren’t writing data in the past
  4. Find those sub-buckets that need to be cleaned up
  5. Handle sub-bucket cleanup and window count updating
  6. Finally check the limit, returning 1 if the limit would have been exceeded
Our second and last block of Lua operates under the precondition that the request should succeed correctly, so we only need to increment a few counters and set a few timestamps:

-- there is enough resources, update the counts
for i, limit in ipairs(limits) do
    local saved = saved_keys[i]

    for j, key in ipairs(KEYS) do
        -- update the current timestamp, count, and bucket count
        redis.call('HSET', key, saved.ts_key, saved.trim_before)
        redis.call('HINCRBY', key, saved.count_key, weight)
        redis.call('HINCRBY', key, saved.count_key .. saved.block_id, weight)
    end
end

-- We calculated the longest-duration limit so we can EXPIRE
-- the whole HASH for quick and easy idle-time cleanup :)
if longest_duration > 0 then
    for _, key in ipairs(KEYS) do
        redis.call('EXPIRE', key, longest_duration)
    end
end

return 0

Going section by section one last time gets us:
  1. Start iterating over the limits and grab our saved hash keys
  2. Set the oldest data timestamp, and update both the window and buckets counts for all identifiers passed
  3. To ensure that our data is automatically cleaned up if requests stop coming in, set an EXPIRE time on the keys where our hash(es) are stored
  4. Return 0, signifying that the user is not over the limit

Optional fix: use Redis time

As part of our process for checking limits, we fetch the current unix timestamp in seconds. We use this timestamp as part of the sliding window start and end times and which sub-bucket to update. If clients are running on servers with reasonably correct clocks (within 1 second of each other at least, within 1 second of the true time optimally), then there isn’t much to worry about. But if your clients are running on servers with drastically different system clocks, or on systems where you can’t necessarily fix the system clock, we need to use a more consistent clock.

While we can’t always be certain that the system clock on our Redis server is necessarily correct (just like we can’t for our other clients), if every client uses the time returned by the TIME command from the same Redis server, then we can be reasonably assured that clients will have fairly consistent behavior, limited to the latency of a Redis round trip with command execution.

As part of our function definition, we will offer the option to use the result of the TIME command instead of system time. This will result in one additional round trip between the client and Redis to fetch the time before passing it to the Lua script.

Add in our Python wrapper, which handles the optional Redis time and request weight parameters, and we are done:

def over_limit_sliding_window(conn, weight=1, limits=[(1, 10), (60, 120), (3600, 240, 60)], redis_time=False):
    if not hasattr(conn, 'over_limit_sliding_window_lua'):
        conn.over_limit_sliding_window_lua = conn.register_script(over_limit_sliding_window_lua_)

    now = conn.time()[0] if redis_time else time.time()
    return conn.over_limit_sliding_window_lua(
        keys=get_identifiers(), args=[json.dumps(limits), now, weight])

If you would like to see all of the rate limit functions and code in one place, including the over_limit_sliding_window() Lua script with wrapper, you can visit this Github gist.

Wrap up and conclusion

Congratulations on getting this far! I know, it was a slog through problems and solutions, followed by a lot of code, and now after seeing all of it I get to tell you what you should learn after reading through all of this.

Obviously, the first thing you should get out of this article is an implementation of sliding window rate limiting in Python, which is trivially ported to other languages -- all you need to do is handle the wrapper. Just be careful when sending timestamps, durations, and precision values to the script, as the EXPIRE call at the end expects all timestamp values to be in seconds, but some languages natively return timestamps as milliseconds instead of seconds.

You should also have learned that performing rate limiting with Redis can range from trivial (see our first example in part 1) to surprisingly complex, depending on the features required, and how technically correct you want your rate limiting to be. It also turns out that the problems that were outlined at the beginning of this article aren’t necessarily deal-breakers for many users, and I have seen many implementations similar to the over_limit_multi_lua() method from part 1 that are perfectly fine for even heavy users*. Really it just means that you have a choice about how you want to rate limit.

And finally, you may also have learned that you can use Redis hashes as miniature keyspaces to collect data together. This can be used for rate limiting as we just did, as well as a DB row work-alike (the hash keys are like named columns, with values the row content), unique (but unsorted) indexes (i.e. email to user id lookup table, id to encoded data lookup table, ...), sharded data holders, and more.

For more from me on Redis and Python, you can check out the rest of my blog at dr-josiah.com.

* When Twitter first released their API, they had a per-hour rate limit that was reset at the beginning of every hour, just like our most basic rate limiter from part 1. The current Twitter API has a per-15 minute rate limit, reset at the beginning of every 15 minute interval (on the hour, then 15, 30, and 45 minutes after the hour) for many of their APIs. (I have no information on whether Twitter may or may not be using Redis for rate limiting, but they have admitted to using Redis in some capacity by virtue of their release of Twemproxy/Nutcracker).