Geo Code

Hyperpublic Local Data Engineering Blog
May 31

Migrating From PostgreSQL to MongoDB at Hyperpublic

The engineering team at Hyperpublic has been hard at work over the last few weeks re-architecting our platform in order to migrate from a relational database (PostgreSQL) to a NoSQL datastore (MongoDB). We completed the migration about two weeks ago, and nothing major has broken down so far, so it's about time to do a recap so that the community can benefit from what we learned along the way. This post starts off anecdotal and moves to technical, and hopefully after reading it you'll have a good background on the reasons why you may want/need to migrate and how to go about performing the migration.

Where we came from
The Hyperpublic platform was originally built using Ruby on Rails on the Heroku platform in order to speed development time and iterate quickly. Until we discovered our true utility as an open rich location data platform, there was no sense in over engineering a custom system for performance and reliability. As a result, our choice of database was made for us by Heroku, which only supports PostgreSQL out of the box. This was fine, as most local objects were being added to our system by our users, and we were only adding objects within a couple major US cities in order to prove the concept and utility of our platform.

For those of you not familiar with Hyperpublic, what we do is provide a rich data layer on top of local objects. For every real world person, place, or thing we want to be able to provide developers with the object's physical location, tags that describe it, photos, descriptions, and various properties that will be useful to anyone building an application that could use local data. Since the data was modeled relationally, you can probably make some accurate guesses about the database tables that we defined:

  • People
  • Places
  • Things
and each of the above have many different...
  • Locations
  • Images
  • Tags
  • Properties

...among others.

So what was the problem?
When Hyperpublic began to grow we began building up the data programmatically. The number of local objects that we had in our database increased and we found our niche as a data provider, we began to run into our first two problems of scale.

As you can imagine, in order to return a local object to a user making an API call or viewing our application at Hyperpublic.com, we would have to join on all of the above tables. This was slow. Also, it felt illogical that we would constantly have to join across a normalized data structure to receive images, tags, locations, and properties for a given place, when those images, tags, locations, and properties only belonged to one specific place every time. 

The second problem that we faced was support for geo-spatial queries. Without using a geo-extension for PostgreSQL called PostGIS, in order to do proximity, bounding-box/radius, and nearest-neighbor queries you have to do math on the stored lat/lon for every point in your system. This means a table scan, and when you get beyond one or two cities worth of local object data in Hyperpublic, this gets very slow. We began researching and educating ourselves on PostGIS. It is undoubtedly a reasonable option to solve the types of problems we were facing, but the implementation felt less than clean. It felt ugly to program against, and it felt tacked onto Postgres instead of embedded within it from the beginning.

Enter MongoDB
While evaluating solutions for the above problems we were looking for a database that could store the arbitrary properties and undefined quantities of metadata along with each object. This is the prototypical usecase for a NoSQL datastore. Additionally, we were looking for geo-spatial index and query support: one of the oft-focused-upon features of MongoDB. We knew that the 10gen team was here in NYC as we've participated in many events and conferences in which they've been present, and they're very supportive of startups building on their technology. The choice to migrate to MongoDB was obvious being that they support all of the near term features that we require from a datastore. Then it was just a matter of how we would do the migration...

How we migrated
(Note - this section is somewhat Rails/ActiveRecord/Mongoid heavy, but you can replace these terms with your frameworks and ORM/ODM of your choice).

Step 0 - Have a unit test suite with good coverage on your models and make sure all your tests pass.

Step 1 - Write scripts to copy your data into MongoDB. Do not delete/change anything in the current schema.

In a non-optimized Rails application, the closest that you get to your database is by writing your model classes using the ActiveRecord object-relational mapper. When switching to MongoDB we chose to use the Mongoid object-document mapper as a non-quite-dropin replacement for ActiveRecord.Our migration script first namespaced the Mongoid objects, defined the collections they would be stored in, and defined the fields that would be mapped in each collection. It looked something like this...

The goal of the data migration scripts is to instantiate your ActiveRecord objects and then insert them into Mongo using Mongoid objects. (You can bypass Mongoid and go straight to MongoDB using the ruby driver, however Mongoid gives you some conveniences like relationships and timestamps). We opted to always instantiate the objects, however we would do so using a queue and background jobs so that we wouldn't exceed the memory on the server by instantiating every single object in one process. Our script would basically just map the data from AR to Mongoid like so...

Notice that we are copying the old_id from AR to Mongoid. This has proven useful many times over when needing to do lookups after the fact, and I recommend you keep it around for awhile until you are certain that you'll never need it again. Regarding ID's, keep in mind that your object ID's will change. You'll need to update any external resources that refer to the old object ID's. For example, our Amazon S3 was configured to store photos in buckets named after the object's ID.

After your script is written, you should be able to safely run it on a copy of your production dataset as a test, since it won't modify or delete any production data.

Step 2 - Update your application
At this point, you'll want to create another branch to update your application. The reason is because the migration needs to reference the ActiveRecord models in the current application, so you can't delete or modify them until all the data is copied over. On a separate branch, you can update your models to use Mongoid instead of ActiveRecord. The meat of this process can be copied over from the models you created during your data migration. When you do eventually get your application working again with the updated models, update and run your unit tests to make sure that they all pass.

Step 3 - Configure your production MongoDB environment
We won't go into details here, but we recommend at least a 3 node replica set configuration. MongoDB has a good writeup on how to set this up here.

Step 4 - Deploy to production
Backup your database and take your application down for maintenance so that no writes come in after the data migration. Deploy the branch with your data migration and run it. Assuming all goes well, deploy the second branch with your updated application. Restart your application and you'll be up and running on MongoDB.

The results
I don't have hard numbers so this is going to be more anecdotal than scientific, but after migrating to MongoDB the Hyperpublic platform was immediately faster and more scalable than it was previously. We've seen over 5x speed increases in the user facing application and 20x speed increases within our API. 

Geospatial queries now use the indexes computed ahead of time and return instantly instead of doing table scans and distance computing math on each query. 

Loading all of the metadata associated to a local object is now completed without any joins, and the number of queries per page was reduced dramatically.  

We went from 2 cities worth of data pushed into the production system as a proof of concept, to 10+ cities worth of data pushed in and usable by third party developers off of our live platform with no performance bottlenecks in site in the near term.

Gotchyas and lessons learned
10gen is iterating very quickly on MongoDB, and as a result programming against it is sometimes a moving target. Here are some of the lessons learned along the way during the migration:

  • MongoDB has great support for geo-spatial indexes, but if you want multiple locations indexed per document, you'll have to use MongoDB 1.9. Our "People" can have multiple locations - where they live, where they work, etc - so this was a requirement for us. 1.9 is supposedly unstable and not recommended for production use, but we have been quite happy with it.
  • The ODM's likely won't be fully featured, up-to-date, or drop-in replacements for the ORM you may have been using with PostgreSQL. If you're a beginner, I recommend getting very familiar with the MongoDB driver for the language of your choice, as you'll frequently have to drop down to it directly.
  • Do not try to model things relationally in MongoDB. If your problem is suited to relational modeling, then stick with a relational DB. 
  • Keep old postgres ID's around. You frequently have to refer to them when doing in memory "joins" with your old relational data or when referring to legacy data stored in external services.
  • The MongoDB community and 10gen are very helpful. Talk to people at local meetups and conferences, and they're usually happy to help you with any issues you have in your migration to MongoDB.

I hope this post was useful. If you're migrating to MongoDB and have any questions or could use some help, drop me a line anytime @petkanics on twitter.

About Geo Code

Geo Code is the engineering blog of Hyperpublic, an open location platform.
Tumblr

Search Blog

Get Updates

Tags

Archive

2011 (27)