At some point in your data science career you will have to move away from csv files that are handled to you by the operation department. Doing data science analysis directly on a production database may sound daunting, but the simple recipe introduced in this tutorial will show you how to get started.
If you prefer to learn with a video tutorial you can check out my video version of this article over here: Data Science on Production Database.
The setup is very minimalist composed of only 7 steps.
- 1. Setup Read-Replica
- 2. Setup Secure Version Control
- 3. Investigate DB Visually
- 4. Ask a Lots of Questions
- 5. Start Simple
- 6. Get Partial Result Out ASAP
- 7. Write a Report as You Go
It is meant to be followed in a recursive fashion from step 3 to 7. Let’s jump into the first and most important step of all…
1. Setup Read Replica Database
The very first thing you should aim at is securing access to the data source. Working on a data science project is already difficult, slow and error prone. If you have to go through hoops every time you need to access data it will put a serious dent in your productivity. However, this shouldn’t come at the expanse of your production database. Accessing directly the production database for data science purposes is highly discouraged, for the following reasons:
- It is a security risk. If you are working directly with the production database it means that you have the credentials to access it remotely. This is problematic, because if you leak these credential someone will be able to read and write to this database. The problem with writing is that it can seriously cause havoc in the platform and it can be difficult to trace down the source of the problem. You can also introduce change in the database yourself while working with the production database which can cause varying amount of problem for the product team.
- It increase the load on the production database. If you are accessing the data inside a database, it means you are making request to it to serve you some data. This will increase the load on the database. This might not be too much of a problem if the database is small and you are requesting only a few data points, however this sort of work-methodology doesn’t scale well. If there are multiple data scientist doing the same thing, the pressure on the database will increase with time and cause a load that could be easily avoided all-together.
A read-replica of your production database solves a few of these pain points!
A read replica of a production database is a clone of it that can only be read to. It is not possible to write to a read replica hence the name. The benefit of having a read replica for data science purpose is that you get the benefit of having access to fresh data almost instantly, while avoiding stressing the production database with too much read request. Furthermore, by having only read access there is simply no way to corrupt the state of the database which a security risk less.
Here are a list of how to setup a read-replica in the three major cloud providers:
- AWS: https://www.youtube.com/watch?v=COsx7UrMGL4
- Google Cloud: https://cloud.google.com/sql/docs/mysql/replication/create-replica
- Azure: https://docs.microsoft.com/en-us/azure/postgresql/concepts-read-replicas
If you know other useful tutorial for setting up read-replica in other context don’t hesitate to post it in the comment section I’ll add them to the list!
However, the issue of leaking customer data is still real and a simple process to mitigate this risk will be discussed next.
2. Setup Secure Version Control
A version control system is a must when working with anything that is changing over time that you may need to recover at some point. Text, code or data analysis. Setting yourself up from the start to have a solid tracking of your analysis over time is 100% worthwhile.
However, you have to remember that your analysis needs to have access to the credentials to access the read-replica database in order to work. This is problematic because once the credential are sent to the VCS it will be visible in the history to anyone that have access to your remote git repository. If someone get access to the remote git repo, the data from your production database are automatically compromised.
This seems like a thorny problem, either you push your whole analysis to the remote git repo and you add increase the attack surface or you don’t put your analysis on the remote git repo and your risk losing it.
This is a solved problem in software engineering especially in web development. The solution make us of a .gitignore, a .env file and a decoupling library to decouple your code that will be sent to the remote repo and your secret that should stay on your computer.
Add a .gitignore:
The very first element you should setup after you created your repository for you analysis is a solid .gitignore file. In order to avoid forgetting to include a file for a particular analysis I always start by using a .gitignore generator like gitignore.io
This will generate you a nice .gitignore file which will not include files like virtualenv files, common names for .env files and other file that should stay in the local development machine. Once you get that .gitignore add it to your project at the top level.
Create a .env file:
This file will contain the secret you do not want anyone to be able to access in your git repository. In the context of this tutorial it included the different variable that are used to access your read-replica database:
The .env file shown above is for Red Shift Database on AWS, but other cloud provider should follow a similar structure as the database are usually similar (i.e. postgresql or mysql).
Add this .env file at the root level of your project right next to your .gitignore file. Since you’ve went through creating a .gitignore file you should see the file as not comittable in your IDE. Something like this:
Load secret into your code using a decoupling library:
Depending on the programming language you are using, you will have different option here. In python a great library to use is python-decouple:
It is simple to install in any python project and is very easy to use.
First install it using either conda or pip, don’t forget to activate your virtual environment:
pip install python-decouple
Then you just need to import the right function from decouple:
from decouple import config
Finally you can use it and collect all your secret variable that are sitting in the .env file.
# Secret Loading from an .env file
It’s as simple as that! Now you will be able to access the database while not having to worry of committing secrets by accident in the remote repository! If someone want to work with you on the project you will only need to send the .env file using a secure channel of communication and voila 🎊!
You are now all setup and ready to start analyzing!
3. Investigate DB Visually
Once you have access to the database, the natural tendency is to start working on the analysis and write some code to explore the data. It’s not a bad thing to do per say, but I would say that this is still too premature in the life cycle of the project.
Working very hard and smart on the wrong problem is wasteful. You need to make 100% sure that wherever you are going with your analysis it’s in the right direction.
To do so you need to look at the data with as much flexibility as you can. Thankfully, SQL client are readily available as a tool for this job and simple enough to setup and use.
Above you can see me using the community version of DBeaver, a free SQL client to navigate and explore lots of kind of database.
After setting up the connection with the read-replica, check out the data and try to pinpoint table that will be relevant for your analysis. Once you note down a few of them check out how many data points you have, what kind of column you can play with, what values these columns have or anything that seems to be out of the ordinary. Note down what you do understand and what you don’t understand about the database. This will be very useful for the next step, which is to ask LOTS of questions.
4. Ask Lots of Questions
If I had one step to emphasis heavily is this one. Most of the problems and time sink in a data science project stem from a miscommunication. Something crucial wasn’t communicated to the data scientist or a stakeholder thought the analysis was going in one direction while it went in completely the opposite way.
Therefore, you should take your time to ask all the relevant people for your analysis as much questions as needed in order to be 100% aligned about all aspect of the project. This includes:
- What is the true purposes for the analysis (an analysis is always embedded in some greater scheme). Often what could happen is that by knowing this, you can think of alternative or faster way to get to a result thus changing the course of the project at its start.
- Any questions about the data that you will be using. If something looks odd to you, ask and document the answer it will come handy afterward. Also, use multiple source for your answers. Don’t assume that all the knowledge of the data being collected by a complex system can sit perfectly in 1 developer mind.
- Any questions about the system generating that data. This is important. Understand where the data come from, who is generating these data points and how the system is generally used. This extra-context always comes handy when something that seems out of the ordinary pops up in an analysis.
After the first round of questions you are usually itching to get down to the analysis and code-away. Watch out, you should always…
5. Start Simple
Start simple! Don’t over-complicate burden your analysis with the most complex framework or a very complicate analysis right at the start. Starting with the most simple tools at first and then iteratively increasing the complexity whenever necessary is a much better angle to go to get result fast.
Usually the increase in tool/analysis complexity in your project when you start simple will come naturally and will in fact lead to a much cleaner overall analysis. If you about it the opposite way and start too big, scoping down will most likely never happen and it will lead to long,complex, dragging projects.
My tools of choice for starting a data science projects are:
- psycopg2 to connect to a postgresql database.
- pandas to structure my data frames.
- numpy to do analysis that pandas isn’t suited for.
- matplotlib for the figure.
- jupyter notebook to iterate and experiment.
That’s it. I start with these and get to a result as fast as possible. If I feel that I’m struggling with one of these tool I can swap it to something that make more sense. For instance if I’m working with clusters I might decide to move to something like Dask. This brings us to the next point which is to…
6. Get Partial Result Out ASAP
Get something out as soon as possible. You shouldn’t wait until you have something clean and polished before iterating with the stakeholders. Waiting too long in a highly exploratory project with lots of unknown is a sure way to get lost in the reeds.
Put something together with matplotlib and a bunch of table to show where you could get to / what are the next steps and show this report to whoever is requesting the analysis. Sometime, just this tiny steps toward the goal will lead to great discussion, more questions that will be answered or even a change in direction for the projects. Talking about a project in theory and seeing the results gets there in practice is a vastly different thing and having these details lead to a much more worthwhile discussion for everyone involved.
Here it is important to stress out that you shouldn’t be blurping numbers and graph without cohesion. You need to prepare something that is high level enough to be digestible by the stakeholder and that will support whatever discussion you need to have. It is not the place to show off all the minutiae and details that goes into your analysis. If you fail to bring the discussion to a level the stakeholder is expecting it will hinder all following discussion and will lead to a much more difficult project overall. In order to make sure that the communication can go smoothly and that enough details are there without spending hours putting together a power point, you should…
7. Write a Report as You Go
Seriously, write the report before you even start doing any sort of analysis. All the insight that you got from looking at the database, all the assumptions that you’ve cleared, all the questions that you’ve asked and got answer from should be documented in your appendix so that you can reference them if needed.
Also make sure that that report can be collectively contributed to and that it is low overhead to distribute. Something like a google doc that is shared with everyone that is involved will ensure that your questions get answered, that the answers get documented and that the stakeholders can discuss freely among themselves if there is any disagreement.
This document is not only vital for the final results that you will hands, it is an important source of data for all non-data scientist involved in the project. Having better insight about the system someone else is analyzing is a great way to find bugs or interesting trend to leverage!
Repeat these steps enough time and you will be address the hypothesis in the best way possible 🎉!
I cannot stress enough how important it is to go through the iteration quickly. The most important part of a data science project is not really the analysis per say, but the structuring of the knowledge about the data. Analysis will need to be coded, statistical model might need to be trained and graph produced, but it is much more important to highlight and structure the knowledge that is generated by the problem.
It’s rare that an analysis will go as planned initially and that the first understanding of the problem space was right. Most often something was overlooked, not known at all or learned along the way.
This kind of uncertainty about what a problem will lead you to find is what make data science a field that is so rewarding to work in. However, if not properly balanced with a rigorous research methodology it can leads to very frustrating situation.
About the Author
Yacine Mahdid is the Chief Technology Officer at Axya. He is leading the technical development of the platform and the R&D division along his marvelous team of talented developers and scientists. He is also a graduate student at McGill University trained in computational neuroscience (B.A.Sc.) with a specialization in machine learning. Currently working at the Biosignal Interaction and Personhood Technology Lab, his area of research is focused on creating predictive and diagnostic models to detect consciousness in individuals who are not able to speak or move.