Friday, April 20, 2012

Dealing with a lot of data: the problem

where “a lot” means 20+ mil “records/documents”.
Dealing with a lot of data is usually a problem, but the interesting stuff is that the main problem is not a technical problem. It is quite obvious that all the architectural choices we are used to are not guaranteed at all to work, it is obvious that generalization is much more evil than what generally generalization is, it is obvious that every single technical choice must be chosen with performance in mind because generally low performance with those numbers generally mean that nothing works.
But a much more interesting problem is the daily development workflow, dealing with those numbers means that:
  • the size on disk of those data is something like 100Gb;
  • generating test data takes a long time: hours;
  • having more than one test environment on the same machine is a problem;
  • changing the shape of the data to test a different scenario takes ages;
  • changing the data indexing strategy takes a long time;
All these interesting problems when faced are a key driver for the application that must deal with those huge data.

1 comment:

  1. 100GB of data means that you are near the limit of an entry level SSD, this is a quite high risk, becase using an SSD as storage can save a lot of time to run test based on data.

    I'll add others problem, bakupping the data is usually a problem of bandwidth with the NAS, restoring a backup is a long process, etc etc.

    working with a big amount of data sucks :D