6 ways to make working with DynamoDB an awesome experience
I've had a pleasant experience working with DynamoDB. Amazon did fantastic work in the past few years improving their NoSQL offering. It's now truly powerful and versatile.
Thanks to our client, ClearCare, for enabling me to work with DynamoDB and share my lessons. Together, we created an open source library called cc_dynamodb3 to help others make the most of their python integration.
Background
As part of ClearCare's broader move to SOA (and the now more popular term, microservices), we had more freedom to choose the right tools for the job.
The need to launch a new feature created an opportunity to experiment with a separate, simpler datastore. Since they were already using so much of the AWS stack, DynamoDB was a natural choice.
We set up the new feature as a separate service, with its own servers, separate domain and database, interacting with the main product via APIs.
End result
Thanks largely to DynamoDB's scalability, plus a solid API integration (mostly via JSON), we were able to launch the feature on a very tight deadline and have it scale immediately.
Throughout 6+ months of enhancements, we rarely had fires to put out, and almost no downtime. Remarkable, considering usage grew ~100x.
We learned a lot along the way, and I'm happy to share some of those lessons with you right now!
Alright, help me make the most of working with DyanmoDB and python!
The following tips may help you work with Amazon Web Services' DynamoDB and python (including Django or Flask).
1. Use boto3
As of June, 2015, Amazon recommends boto3 moving forward. boto2 is being deprecated and boto3 offers official support and a much cleaner and more pythonic API to work with AWS!
Upgrading from boto2 to boto3 is fairly easy, although I strongly suggest you write tests for the affected code. Unit tests are essential, and you should have at least one for each low-level piece of code that used boto2 directly.
I also suggest at least one or two higher level tests, and manually try out a code path that's core to the business (to avoid breaking key user paths and waking up the Ops department ;).
Check out how we implemented the boto3 connection.
2. Validate your data
DynamoDB supports several data types natively that your language will probably handle a bit differently.
In our case, it's nice to convert our data to native python objects where appropriate:
- date and datetime objects, for dates and timestamps (we always store date-like data as integers, for easy comparison)
- booleans
- empty attributes (to avoid pesky "AttributeValue may not contain an empty string" errors)
- Automatic UUID generation (see #3)
Fortunately, we have that work all done and tested as part of cc_dynamodb3. Check out models.py here and here especially.
We used schematics to create a light ORM.
3. Consider a UUID HashKey for your primary key
A table's primary key cannot be changed once created. There are lots of resources out there to help you design your tables, and I suggest you research your use cases ahead of time.
In our experience, the safest choice is a string UUID Hash Key. It gives you 100% flexibility over how you're going to uniquely identify your data. You can always add, modify or remove GSIs to improve performance for specific query operations.
Here are some easily avoidable mistakes:
- Don't include a RangeKey in an index unless you actually have multiple records with the same HashKey. It makes it a pain to GetItem directly, since you'll need both the Hash and Range keys to find it. (PS. range key = sort key)
- Choose the primary key carefully. You cannot change the primary key once the table is created, and then you'll have to deal with migrating your data to a new table
- When you update an item via PutItem, DynamoDB will create a new entry if the primary key has changed. The old item will remain unchanged!
4. Consider a composite RangeKey for extra flexibility
You can use a begins_with
filter on a RangeKey, but you cannot use it on a HashKey.
For example, say you have a book library. If you first and foremost care to keep your books by publisher, that could be your HashKey. Then, you may sort the books by year of publication, and uniquely identify them by their ISBN. So you can have:
- HashKey: publisher ID (or publisher name), e.g. HarperBusiness
- RangeKey: year + ISBN, e.g. 1995-9512512
You can find all HarperBusiness books published in 1995 via a single query using begins_with
, what great performance!
5. Scans and queries do not return all data in the table.
Here is an example of how we do this in the light ORM from cc_dynamodb3:
class DynamoDBModel(Model):
# moar stuff here...
@classmethod
def all(cls):
response = cls.table().scan()
# DynamoDB scan only returns up to 1MB of data, so we need to keep scanning.
while True:
metadata = response.get('ResponseMetadata', {})
for row in response['Items']:
yield cls.from_row(row, metadata)
if response.get('LastEvaluatedKey'):
response = cls.table().scan(
ExclusiveStartKey=response['LastEvaluatedKey'],
)
else:
break
Caveat: Item.all()
may do multiple queries to DynamoDB, and thus has a hidden cost and time. Please note the use of yield
to lazy-evaluate the results. This gives you an opportunity to retrieve data only until you need to, and avoids scanning the whole table in one go.
In practice, you don't want to perform table scans or large queries on the main sever thread anyway.
6. Include created
and updated
columns in each table
These days being the days of big data and analytics, I suggest always having a created
and an updated
column.
In our light ORM, we can do this with a BaseModel
:
from schematics import types as fields
from cc_dynamodb3.models import DynamoDBModel
class BaseModel(DynamoDBModel):
created = fields.DateTimeType(default=DynamoDBModel.utcnow)
updated = fields.DateTimeType(default=DynamoDBModel.utcnow)
def save(self, overwrite=False):
self.updated = DynamoDBModel.utcnow()
return super(DynamoDBModel, self).save(overwrite=overwrite)
Then, in your code:
from myproject.db import BaseModel
class Book(BaseModel):
publisher = fields.StringType(required=True)
This makes sure all your models inheriting from BaseModel
will have the two columns automatically populated. Ta-da!
If you'd enjoy working on these types of projects while having a chance to help make aging better, ClearCare is hiring full-time engineers! Check out their careers section.