After a bit of a hiatus, I again picked up my book “Agile Data Science” and began working through the examples from the beginning. However, as the book was originally published in 2013, which is an eternity in the computing and data science world. In fact, the version I’m using is old enough that the author came out with “Agile Data Science 2.0.,” which came out in 2017. Given that I had the original, I decided to give it a go with that one, rather than updating to Spark and everything else.
Chapter 3 is where the real coding begins, and that’s where I immediately ran into problems. I had spun up my Amazon linux EC2 instance as a micro with optimized kernel for AWS. When I tried to install Google’s Snappy (which is a dependency of Avro), it choked as apparently 512mb isn’t enough… so, terminate that instance and spin up a ‘small’ EC2 instance… ok, somewhat success. But then I realize how old (and perhaps abandoned) Snappy is… and that there is a new, easier way to install Snappy rather than what’s in the book:
The book’s command: pip install python-snappy
Au contraire, the current install is: pip install snappy
Installing Avro then went according to the book. Then, continuing with the example to mine your gmail inbox for data, everything just came off the rails. First of all, pages 42-44 purport to show you what you need to do to gather the data from your gmail account. However, the reality is that these meager pages hide the 90% of the code that actually makes up this endeavor–kind of like the calculus teaching assistant who skips over the ‘obvious’ middle steps and ‘solves’ the end of the problem without showing you how you get there…to boot, the book references github links for the code, without ever mentioning what the base URL is to find them… Off to the interwebs!
Spelunking through github, I finally find the four full python examples. Not so fast grasshopper! The code bombs almost immediately. You have to go and find & install the lepl.apps.rfc3696 library (note, this is no longer being developed) . Fortunately, that was pretty painless:
pip install lepl
Surely now that all these little discrepancies are cleared up, the way forward is easy, right? Not really! There have been a number of security improvements to IMAP and Gmail over the years that makes much of this code inoperative. Fortunately, I have suffered through the pain and light the path forward for you. Follow these instructions:
- First, enable two factor authentication on your Gmail account.
- Next, you will need to add an application password:
- Note, you must first enable two factor authentication
- Go to your google accounts security settings
- Click on the link labeled ‘app passwords’ under the ‘Signing into google section’ or click here.
- Click on the drop down menu labeled ‘select app’
- Choose custom app.
- Give your app a name. The great google will generate a 16 digit password. Copy this down in a safe place.
- Now, for the changes you will need to make running gmail.py from the above referenced python examples:
- for your username, use your gmail address
- for your password, use the app password you generated above
- for the folder name use ‘INBOX’ (do not use any other variant and it is case sensitive).
If you follow the above, you will not experience the dreaded error message “ imaplib.error: command SELECT illegal in state NONAUTH, only allowed in states AUTH, SELECTED “