MongoWave: Persistence on Google FedOne Wave Server with mongoDB

Posted Mar 03, 2010 by Anthony in Architecture, Blogs, Google Wave, Strategy

Purpose

My company, SESI, has been working on applying advanced collaboration and analytical techniques to complex-situations to aid decision-making. As part of that effort I have been quite involved with the Google Wave Protocol project. I wanted to share my preliminary success implementing persistence on FedOne with MongoDB.  In my previous post on the topic of persistence I explored where you could add persistence.  As I followed the recommendations of that post, it is probably a good primer for this entry.  This post will be more focused on the mechanics of my implementation.  I will discuss my current progress, why I chose mongoDB, my design goals, how I implemented the design in code, the hurdles I encountered, and my next steps.

Current Progress

I have tested persistence locally and through federation with the console client, QWaveClient, and wavesandbox.com.  I can create multiple waves, add and remove participants, and host conversations between multiple wavesandbox, QWave, and console clients. When I restart my server the clients’ waves are loaded automatically.  When I open a wave, the conversation is brought to the latest version (even if the client was not online when the last updates were made).  I was pretty excited when the QWave client restored correctly, because I tested solely on the console client that is part of FedOne and had no knowledge of how QWave was implemented.  Many thanks to Torben Weis for his excellent work on QWave (and not just because it worked with my persistence).

I have NOT constructed or run any JUnit tests.  My persistence mechanisms are focused on storing information in the current implementation of FedOne and I have not expended any effort into correctly handling the commit-notice.  That is to say I persist information when I receive a wavelet-update as FedOne does not currently have an algorithm in place to store pending deltas in a separate container from applied deltas that have been acknowledged by a commit-notice (see this post for details).  This isn’t an attempt to blame FedOne, rather to say that is not where my focus has been. And as a side note, I’m hoping the Google team decides to do away with the notion of a separate commit-notice.

Why MongoDB

As defined on their website, MongoDB (from “humongous”) is a scalable, high-performance, open source, schema-free, document-oriented database.  I admittedly like the sound and thought of ‘MongoWave’ as in huge wave, but that was not part of the decision to go in that direction.  My primary reasons for using mongoDB were the flexibility of schema-free storage (no tables), a JSON sytle syntax for storing and retrieving information (converted to and stored in BSON), embedded map-reduce, auto-sharding for scalability, good documentation, easy setup, drivers in many languages (Java, C++, Ruby, Python, Perl), and an active expanding community.  I do not want to start a holy war here or engage in the “why didn’t you go with or have you tried <insert my favorite storage solution>?”.  Suffice to say this is what I feel mongoDB offers me and that is why I am using it.

For more in-depth info on mongoDB, I recommend:
Design Goals

My overarching goals in building persistence into the FedOne solution were
  • Minimal disruption to code base.  FedOne is an open source project that is rapidly changing.  I did not want my code to be littered throughout such that it became tedious to re-implement persistence anytime a new release was made available.
  • Flexibility to modify what information I persist and where I persist/restore that information.  As FedOne matures, I expect there to be more information that needs to be persisted (e.g. - user management/access controls).  I also expect there to be refactoring due to optimizations, updates to the protocol, and other progress in general.  In consideration of this, I wanted to be able to easily modify the information I persist and update the location in code of my persistence mechanisms.
  • Loose coupling and clear separation of responsibilities.  Just good design practices here.
  • Ability to swap out different implementations of storage solution.  This applies at any level of the persistence package and is not solely related to the use of mongoDB.
Implementation

To achieve my design goals, I decided to incorporate the following tools and techniques:
  • Google Guice for dependency injection and Aspect Oriented Programming (AOP)
  • Seperation of code into AOPInterceptor, AOPImplementation, PersistenceManager, and WaveStore
  • MongoDB/Schema-free database

I have previosly written about Wave’s use of Guice.  Another capability of Guice outside of dependency injection is the ability to provide annotations on methods that will act as interceptors.  Whenever an annotated method is invoked, Guice will proxy the call and forward execution to a segment of code I have defined.  I don’t want to go over the benefits of AOP in general, so I’ll just point to the wiki article and the Guice AOP site.

The reasons I used AOP was to avoid introducing a lot of cross-cutting concerns directly into the code base.  With the FedOne code base changing rapidly I did not want to go through great effort restoring my persistence everytime a significant release was made.  By using AOP I can just annotate the method and all the work is done in a separate set of classes defined by me.  Lastly, I gain great flexibility using AOP.  If I decide at some point that it is more efficient to save or restore information at another location, I simply move my annotation and adjust my code for the new set of parameters (which may not be any different since I am probably still storing the same information).  In this way there is no refactoring of the main code base.  FedOne will continue to function without regard to my update.

I provide an activity diagram below to show how the internals of my persistence solution interact.  What’s important at a high-level is the way the components have been separated to allow for localized changes.  I am using Guice for dependency injection, so everything is written to interfaces with the subclasses being applied in a separate Module class.  This allows me to easily change out an implementation for another in the case of optimizations, changes in course, testing, etc.  For example, my persistence modules will still invoke the WaveStore interface if I later decide to switch to Couchdb for a comparison.  I simply change WaveStoreMongo to WaveStoreCouch in my Guice binding without the need to update any of my source code.

A breakdown of responsibilities of each component is as follows:
  • AOPIntercepter - invoked by Guice when an binded annotation is encountered. Responsible for interpreting annotation and forwarding to correct method of AOPImplementation.  This module only has knowledge of the annotation and methods available on the AOPImplementation.  It has no knowledge and takes no action with regards to Wave structures.
  • AOPImplementation - acts as a bridge between the method invocation passed from the interceptor and the PersistenceManager.  It will extract parameters, invoke the method, and enact the correct method within the PersistenceManager (not necessarily in that order).  It has no knowledge of how the information extracted is used and therefore no knowledge of the underlying WaveStore.
  • PersistenceManager - prepares the data passed in for storing or as a key for retrieving values from the WaveStore.  It has no knowledge of its caller (AOPImplementation) and has no upward dependency.  This means that I  can easily choose not to use AOP and instead make direct invocations within code by calling the PersistenceManager.  This class is responsible for understanding the wavelet data and its internal structures so that it can be stored efficiently.
  • WaveStore - acts as a store for serialized information.  It has no knowledge of Wave structures, AOP, or the PersistenceManager.  This keeps the interface simple and immune to changes in Wave structures.  Any serialization that needs to be done prior to storing information is the responsibility of the PersistenceManager

The following activity diagram illustrates a general interaction (click image to see in full-size):



The following is a dependency graph produced by using GraphViz with Guice.  You will see the dependencies exist only on generic interfaces such that the implementations may be substituted at any level. Click on the image to see in full-size.



Challenges

I did experience a couple of challenges while implementing persistence.  Recall that one of my goals was minimal disruption of the FedOne code base.  The framework I established allows for this by and large, but there were still a couple of changes that needed to be made to the FedOne core.

If you read my previous post on the topic of persistence you will recall that I suggested much of the work of storing and restoring wavelet data could be accomplished in the WaveletContainerImpl class.  Perhaps my largest hurdle in implementing persistence within my framework was intercepting the method annotations I placed in this class with Guice.  Guice AOP has a restriction that states instances that support method interception must be created by Guice by an @Inject-annotated or no-argument constructor.  Both the LocalWaveletConatinerImpl and RemoteWaveletContainerImpl (which both extend WaveletContainerImpl) are created by a Factory method in fedone.waveserver.WaveServerModule as:

[code]

bind(LocalWaveletContainer.Factory.class).to(LocalWaveletContainerFactory.class)
        .in(Singleton.class);

private static class LocalWaveletContainerFactory implements LocalWaveletContainer.Factory {
    @Override
    public LocalWaveletContainer create(WaveletName waveletName) {
      return new LocalWaveletContainerImpl(waveletName);
    }

    invokes

public LocalWaveletContainerImpl(WaveletName waveletName) {
    super(waveletName);
  }

bind(RemoteWaveletContainer.Factory.class).to(RemoteWaveletContainerFactory.class)
        .in(Singleton.class);

private static class RemoteWaveletContainerFactory implements RemoteWaveletContainer.Factory {
    @Override
    public RemoteWaveletContainer create(WaveletName waveletName) {
      return new RemoteWaveletContainerImpl(waveletName);
    }
  }

  invokes

public RemoteWaveletContainerImpl(WaveletName waveletName) {
    super(waveletName);
    state = State.LOADING;
  }

[/code]


Even though the Factory class is binded by Guice, you see that the actual
LocalWaveletContainerImpl and RemoteWaveletContainerImpl are not created by Guice, thereby negating the possibility of AOP within these classes.  After a little bit of research I realized that Guice does provide support for creating objects through Factory methods via its @AssistedInject construct.  Using this method the code above became:

[code]

bind(LocalWaveletContainer.Factory.class).toProvider(FactoryProvider.newFactory(
     LocalWaveletContainer.Factory.class, LocalWaveletContainerImpl.class))
     .in(Singleton.class);

  invokes

@Inject
  public LocalWaveletContainerImpl(@Assisted WaveletName waveletName) 
  {
    super(waveletName);
  }

bind(RemoteWaveletContainer.Factory.class).toProvider(FactoryProvider.newFactory(
         RemoteWaveletContainer.Factory.class, RemoteWaveletContainerImpl.class))
         .in(Singleton.class);

   invokes

@Inject
  public RemoteWaveletContainerImpl(@Assisted WaveletName waveletName) {
    super(waveletName);
    state = State.LOADING;
  }

[/code]

As you can see we reduced the amount of boilerplate code by removing the private Factory classes that contained the create() method.  I did not have to modify the code that called the create method in WaveServerImpl [wc = localWaveletContainerFactory.create(waveletName);]. 

Of course, what is most important to me is that Guice now creates the RemoteWaveletContainerImpl and LocalWaveletContainerImpl classes thereby allowing me to intercept methods through AOP.  Not only does Guice now allow me to intercept the local and remote classes, but I can also add my annotations to the super class WaveletContainerImpl where the actual commit and restore of wavelet deltas occurs! Yes, I think this is super cool and one reason I am a huge fan of Guice.

Another issue was deciding exactly what type of information to persist.  I had to decide whether I wanted to store only the deltas or attempt also to store the substantive data (e.g. - WaveletData).  I made the decision to only store processual information such that when I load wavelet deltas, I must then apply the underlying operations to seed my data containers.  I am quite interested in knowing if Google also stores their data objects along with the deltas.

The last significant roadblock was determining all the sections of code that implicitly functioned without regard to persistence.  For example, the fedone.waveserver.ClientFrontendImpl.PerWavelet and fedone.waveserver.UserManager are in-memory objects used by the ClientFrontendImpl to keep track of the index wave. When the server is restarted these objects are empty.  One issue that arises from this is if a remote participant updates a wave hosted by our newly started wave server the delta will be applied and the callback ClientFrontendImpl.waveletUpdate() will be invoked.  An IllegalStateException will now occur because the expected version that is pulled from the perWavelet object will have a version of 0 while the start version of the delta sequence will be non-zero.  From this point the local participants will have a different version from the latest version and will never be able to update the wave again.

I don’t want to delve too far into this topic, but I have serious reservations about the way the index wave is handled, that there are no recovery mechanisms if versions are ever out of sync, and that in the current implementation of FedOne participants must specify an end version (even though the spec indicates it is optional) negating the possibility of restoring with the latest deltas.

Next Steps

My goal was to establish a flexible framework for persistence just as much as it was to actually have the ability to store and retrieve information in my wave server.  I have many questions regarding the most efficient storage of wavelets and I do not claim that I have chosen the optimal places to store/retrieve information.  My hope is that through the framework any modifications I need to make will be clearly defined and concise.  For now, I want to do more testing and get feedback on from Google and the community on my strategy for persistence.

Related posts:

  1. Google Wave’s Federation Protocol Under the Hood, Part 5 Purpose [Updated 4/3/2010] This is the fifth and final post...
  2. Google Wave’s Federation Protocol Under the Hood, Part 3 Purpose [Updated 4/3/2010] This is the third post in...
  3. Google Wave’s Federation Protocol Under the Hood, Part 4 Purpose [Updated 4/3/2010] This is the fourth post in...
  4. Google Wave’s Federation Protocol Under the Hood, Part 1 Purpose [Updated 4/3/2010] This post is the first in...
  5. Google Wave’s Federation Protocol Under the Hood, Part 2 Purpose [Updated 4/3/2010] This post is the second in a...

Related posts brought to you by Yet Another Related Posts Plugin.

Tags: , , , , , , , ,

10 Responses to “MongoWave: Persistence on Google FedOne Wave Server with mongoDB”

  1. glvn.li

    03. Mar, 2010

    it’s great.

    “When the server is restarted these objects are empty. One issue that arises from this is if a remote participant updates a wave hosted by our newly started wave server the delta will be applied and the callback ClientFrontendImpl.waveletUpdate() will be invoked. An IllegalStateException ”

    :(

    I have also encountered.
    It seems there is no way at present.

    Reply to this comment
    • Anthony

      03. Mar, 2010

      Thanks for your comment. To “solve” this issue I had to check the persistent store upon receiving a waveletUpdate() to pull all participants for the wavelet and restore the perWavelet and perUser data for those users. This will keep the index wave for all participants in sync and avoid the IllegalStateException. This is only done once per wave and only when the wave is updated (i.e. - I do not pull information from all waves ever submitted to server upon a restart).

      Reply to this comment
  2. glvn.li

    03. Mar, 2010

    Do you mean persistent wave index in waveletUpdate.

    Whether it has been completed and look forward to your good news.

    Reply to this comment
    • Anthony

      03. Mar, 2010

      Sorry for the confusion. When I said persistent store I was referring to pulling information from the mongodb database. The information I am pulling is based from the index wave.

      I have completed the initial version. :) My company will be releasing it to the community for a code review soon. I look forward to your input.

      Reply to this comment
  3. glvn.li

    03. Mar, 2010

    I understand that you have completed persistence (including the wave list).

    :) It is very useful to me.

    I would like to hear that you have to solve the problem when the wave server restart

    Reply to this comment
  4. glen

    25. Mar, 2010

    After reboot, log back in the console, why did not the title of the list?
    I have to re-enter the console after a new message.
    The title was displayed.

    Reply to this comment
  5. glen li

    09. Jun, 2010

    The current state feedback:

    The new open-source can not be used in wave-protocol-io2010.
    The Patch Set 7 code can not compile for io2010.
    They have a lot of changes.
    wave io2010 change so that we do have a new patch.
    :)

    Reply to this comment
    • antwatkins

      09. Jun, 2010

      Hi Glen,

      I am working on an updated patch as well as getting my head wrapped around all of the new code. It is a bunch of new code, but I don’t think most of the changes affect persistence.

      Reply to this comment
  6. glen

    11. Jun, 2010

    that’s great.
    Persistence layer structure has not changed.
    We update the corresponding new changes for tip:dff264eac2.

    Reply to this comment

Leave a Reply

Subscribe without commenting