where “a lot” means 20+ mil “records/documents”.
Dealing with a lot of data is usually a problem, but the interesting stuff is that the main problem is not a technical problem. It is quite obvious that all the architectural choices we are used to are not guaranteed at all to work, it is obvious that generalization is much more evil than what generally generalization is, it is obvious that every single technical choice must be chosen with performance in mind because generally low performance with those numbers generally mean that nothing works.
But a much more interesting problem is the daily development workflow, dealing with those numbers means that:
  • the size on disk of those data is something like 100Gb;
  • generating test data takes a long time: hours;
  • having more than one test environment on the same machine is a problem;
  • changing the shape of the data to test a different scenario takes ages;
  • changing the data indexing strategy takes a long time;
All these interesting problems when faced are a key driver for the application that must deal with those huge data.