Tips For Working In The Database - Tip 1. Always Write It Down

Every now and then I have to shift my entire work flow from my standard day-to-day programming to focusing entirely on database queries and performance monitoring. This can be a welcome break from scratching my head as to how to solve applying an alpha change to the parent object of a composite object that doesn't affect any of the children in a reasonable amount of time. However, each time I have to make the switch I have to shake off the cobwebs. Any skill that you aren't actively practicing will slowly move to the back of the line in your brain. For me that's SQL and HTML. Fortunately the amount of SQL I do, albeit not nearly as much as my regular game programming, far exceeds the amount of HTML I do because HTML is the bane of my existence. Whenever I have to get in there and start writing those queries, it's because we have a large amount of data that we suddenly have to look at and we don't know exactly what we're looking for.

What I mean by that is we might start by running a query to see how many users did so and so. Based on those results we might either be surprised or not which, if we are, we will then have to run more queries and sift through the data to explain what is going on which could very quickly become an iterative process. This can become very painful quickly due to many on-the-fly tables being created which I forget how I created because of the sheer number of queries I am running. 

At the beginning of the year we had to analyze upwards of 40,000,000 records for a very important client of ours. This was my first time dealing with such a large amount of data that no amount of schooling (at least from what I've taken) could have prepared me. The classes I've taken focused on how to write a query rather than being prepared for the query timing out. Issues like that can be avoided with a good schema, proper indexing, etc. However, sometimes you need to simply process every single record and then the job becomes dependent on how good the hardware is. The schemas were designed to reflect how the data was handed to us and only included the information from each file that we cared about. The schemas were NOT designed to analyze interesting trends that each column could produce. A table should only have a handful on indeces otherwise the size of the database can increase very rapidly as well as take longer to process since the database will have to update the indexes for every new entry.

The problem happens when you need to start looking at the data based on one of those other columns and the data needs to be grouped and the data needs to be sorted. First off, GROUP BY is a wonderful SQL command to aid in counting the number of records there are for a particular piece of data you are trying to search on such as how many teachers, students, and lunch ladies there are in the Person table. Where it isn't wonderful is when working with said 40,000,000 records and the query failing due to the amount of memory used exceeding what is available on the sort portion after already having waited an hour for the query to finish processing.

This brings me to the first tip... 

Always write down what you are doing and why you are doing it 

To solve the aforementioned issue you can create new tables to insert parts of the processed data into so that further processing can be enacted. As you begin doing this your database might start becoming filled with many of these temporary tables and you forget what they were for or what data was used to populate them. You can pick a good name for these tables to help clarify this, but the name you choose will only be appropriate at the time you come up with that name. You might then have to come up with a slightly different named table that fulfills a slightly different purpose but still identical to the first. Now you are confused a day later because there are 2 similarly named tables and you can't remember which is which. Trying to make sense of any of this is especially hard after running these queries on a Friday and returning back to work on a Monday.

Even if you're not making many tables, you might find out that the final set of data you get is wrong. Yesterday I finished producing data from some queries which I thought looked good, yet today when the data was reviewed I was told that the data does not make sense which I tended to agree after looking at the numbers I reported. It was perplexing to me and so I looked at some secondary tables I created to partition some of the data I cared about and those numbers weren't entirely clear to me as well, but I couldn't remember exactly where the data originally came from so I ended up starting over. Luckily the data I produced today did make sense and I felt very good about it.

It also helps to write down the queries you used to generate particular sets of data just so you can understand what you did in the event of the data not looking correct. This way you can analyze the query and identify any issues. My particular today issue ended up being that the UPPER and LOWER functions are useless when trying to see if the characters in a column are either upper case or lower case if the collation ends in "ci" or rather "case insensitive". Nobody explained this to me in class. The entire time I've been using latin1_swedish_ci just because the engineer who I learned from was using that as well. Apparently using a collation that ends with "cs" or rather "case sensitive" such as latin1_general_cs is the way to go if case sensitivity matters. This is a huge deal in terms of IDFAs and GAIDs but, that's not for today.

In my experience with databases, there are many tips to keep in mind that I have to constantly reiterate to myself so I'll probably have many more of these kind of posts, hence the "Tip 1" part of the title. I'm already thinking of another like "reboot the RDS if you have an unexpectedly long job processing and you're nearing your deadline so you should probably try something else..." but that might just be a personal issue and not worth writing about.

Comments

Popular posts from this blog

Learning Native IOS Week 2

Tips For Optimizing Collision