Some months ago, I got involved in a project where I needed to generate quite big reports (more than 1 million rows) extracted mainly from an SQL table growing at a very fast pace. This table played a central role in the everyday usage of the system.
In order to avoid losing performance, we scheduled a database backup and a process to empty that large troublesome table every 3 months. But the problem with the reports remained the same. At some point we ended up with a table with more than 3 million rows and sometimes we had to generate reports with more than 1 million rows. Even if the report was small, querying this table was taking way too much time (some times more than 10 minutes).
Here is what we did. First we decided to design kind of an ETL process using only SQL (somebody will argue that this is not ETL, but I think it is). How? We developed a module to execute SQL scripts defined at run-time every day at 3:00 am.
These scripts were basically “ INSERT INTO table” statements. They took data from some tables and inserted the result into another table. Data in the destination table was more suitable for reporting, so we gained some time when generating the reports by moving processing (expensive JOINs mostly) from working hours to 3:00 am, when nobody, even the CEO, was using the system. Reports were way faster (around 20 seconds for the biggest one).
A couple of months after going to production with this scheme, we faced another problem. The destination table, which was supposed to be more suitable for reporting, started to be a problem as well. As you may guess, it was too big. We were facing our initial situation again. The solution: Split that large data set into smaller ones. How? Every row in that table had a time stamp, so we divided the data into semesters by making one table per semester. The script executor module was modified to put the data into the correct table according to current date. The reporting application was also updated to allow users to select the semester (so the app was able to query the correct table).
Reports are pretty fast at the time of writing this. This strategy gave us some time to think about implementing a Big Data solution.
Enjoyed this post? I can help your team implement similar solutions—contact me to learn more.