Unlike relational databases, the bigtable is a schema-less database. Therefore, adding new properties is quite easy. You just need to add new property definition(s) like this:
class Foo(db.Model):
existingProperty = db.StringProperty()
newProperty = db.StringProperty()
When you db.Model.get() one of existing entities (which are created before you define this new property), the newProperty will be automatically set to None (because the default value of db.StringProperty is None).
This is because the bigtable is schema-less database, and it does not require all entities to have the same set of properties. When you see existing entities using the Datastore Viewer, you see <missing> as the value of this new property.
It's very easy, isn't it?
There is one caveat, though. You can't query against those missing properties. The following query will NOT find those entities with <missing> property value.
query = Foo.all().filter('newProperty', None)
This is a little bit counter-intuitive because you actually see "None" as the property value when you db.Model.get() those entities. You need to understand that this "None" is not coming from the database, but simply coming from the default value of the property definition.
Then, how can I find those entities with missing properties? This is a common questions among app-engine newbies, you see a lot of discussions around this topic, such as:
AppEngine: Query database for records with <missing> value
There is even an article about this topic written by one of Google engineers.
This article is, however, a bit outdated (written in 2008 before the introduction of taskqueue) and not generalized.
After creating quite a few appengine aplications, I came to the conclusion that it is much easier to use schema-versioning technique instead.
Here is the quick summary of this technique.
1. Define the "version" property when you define a model object
class Foo(db.Model):
VER = 1
version = db.IntegerProperty(default = VER)
existingProperty = db.StringProperty()
2. Increment the version number when you change the schema
class Foo(db.Model):
VER = 2
version = db.IntegerProperty(default = VER)
existingProperty = db.StringProperty()
newProperty = db.StringProperty()
3. Define a migration method, which typically looks like this (notice that we don't need to explicitly set the values of new properties as long as the default is appropriate - None in this case)
class Foo(db.Model):
...
@classmethod
def migrate(cls):
query = cls.all().filter("version <", cls.VER)
items = query.fetch(1)
if items:
item = items[0]
item.version = cls.VER
item.put()
deferred.defer(cls.migrate)
4. Execute this method from your own admin screen (via admin url)
gdispatch.route(lambda: ('/admin/api/migrate', AdminMigrateHandler, 'admin')
class AdminMigrateHandler(webapp.RequestHandler):
def get(self):
Foo.migrate()
return self.responce.out.write('{"success:true"})
Once you run this migration code, the property values of existing entities will become None (or the default value you specified in the property definition), and you can query against those property values.
You can create a set of 100 task queue elements and taskqueue.add() them 100 at a time (per rpc call). I would suggest the mapreduce API as a better alternative to executing your migration method from the remote api or interactive console. This will eliminate the need for task queue or deferred API.
Posted by: Kyle Roberts | December 24, 2010 at 11:42 AM
how do you deal with updating of reference properties. We are stuck with them.
Posted by: sai | February 25, 2011 at 01:12 AM
I don't buy the Cache argument I'm airfad. It implies all updates are occuring only through HTTP? how practical is that in reality?If I have an application running inside AMAZON's network that applies some general discounting rules then all cached URI's that represent books are immediately out of date anyway?It is not as if because we expose our data restfully, all of a sudden that is going to be the only way you can update it. So I don't see why you should enforce some rules about PUT granularity in your REST implementation, it makes no sense to me.I also don't buy the HTTP swamping argument either. Updates are generally a lot less frequent than gets, so why not go granular on the update. If you have a number of tuples to update sure go more chunky, that is a standard piece of distributed architecture advice, but if you have only one thing to update, why pass everything else? That can have some nasty side effects too, you might now introduce the last update problem, because of optimistic concurrency issues, i.e. the server doesn't actually know what fields you are trying to update so it assumes you are trying to update everything. If in our distributed systems we mandate chunkyness we simply further increase the systems brittleness.
Posted by: Shafiqul | August 04, 2012 at 12:43 PM