RavenDB: dealing with stale data

One of the hardest concept to take on board when dealing with a document database like RavenDB is the concept of stale data.

First of all one a couple of facts:

RavenDB is a transactional, ACID, database (period).
we are deeply used to staleness (even if we do not realize), every single piece of information we “see” is stale by design, it is a picture of the past;

As we have already introduced RavenDB in order to query data needs an index, so in order to look for the following document:

we cannot express a query, well we can because in RavenDB there is the dynamic index feature that under the hood automatically creates an index for us, such as the following:

var query = from person in docs.People
where person.FirstName == “Mauro” && person.Address.City.StartsWith( “f” )
select person;

we cannot simply because we need to define an index that tells to the database engine where we want to search, something like this:

We are using the linq query syntax to express the index definition, in which we tell:

docs.People: use all the documents from the “People” collection;
FirstName = doc.FirstName: index the FirstName field on each document;
Address_City = doc.Address.City: index the nested field City on each nested Address document, the “_” is the convention to map nested fields;

Now if we issue the above query from the studio using the Lucene syntax we get the expected results:

Behind the scene RavenDB has indexed all the documents the the People collection using as a thumbprint the linq query we designed.

Staleness

“…Behind the scene RavenDB has indexed…” means that the indexing process is performed, at least and currently, in a different thread asynchronously. So when a new document is inserted, or an existing one is changed, the indexing services are triggered for an index rebuild.

Since the process is asynchronous queries issued while the server is re-indexing are stale by design meaning that data returned by the query can be non consistent with the expected state:

a document has been changed;
the changed document should be included in the query results;
…but since the indexing process has not yet completed the document is not included in the results;
the query is stale (and we have an API we can use to discover the query status);

that’s it

Is it a problem? well…no :-) take a report, it is stale by design, queries in general are stale by design. The problem, the real problem, is that we have convinced the user that:

search for some data;
not found what you are looking for?

Insert the new data;

refresh the search;
the new data are immediately there…

grrrr….mainly because in order to simplify our job we started to use transactions to lock the user in the process…blocking the calling process. Now think to a distributed system where everything is asynchronous, our application issues an insert request (even to a relational database) and its process is immediately left to move not without any blocking construct…the query immediately issued after the insert is not guaranteed to retrieve the previously inserted data, it is “likely” to retrieve.

The solution is not a technical solution, the solution is a User eXperience solution. The technical problem is how long does an index takes to get ready…we’ll see :-)