Table of Contents
- Don’t just write code, write good code
- start with a function
- Delete Print statement
- Using style guides
- Create a test
- Conducting code reviews
- Coding level up
Don’t just write code, write good code
I know what you think (if you’re a data scientist and asked if you can code). “Of course I know how to code, is it insane to ask that?”
Every day you write hundreds of lines of code in Jupyter notebooks. You can clearly write code. We’re not training machine learning models by hand or in Excel (although that’s possible).
So what exactly do I want to say?
It’s hard to say, but most of the coding that data scientists do is not what I would consider programming. Data scientists use programming languages as tools to explore data and build models. However, the programs you create are just to get the job done, and you don’t really care.
A data scientist’s code is often messy and may not even run continuously (thanks to notebooks). You’ve also never written a unit test and know very little about how to write good reusable functions.
However, as data science becomes more embedded in real products, such code simply cannot keep up. You can’t trust bad code, and putting untrusted code into your product creates a lot of technical debt and a bad user experience.
“Okay, okay, but I’m a data scientist, not a software engineer,” you say. My job is to build the model, it’s someone else’s job to clean up the code. Some companies might do it that way, but for now I think it’s a much better idea for data scientists to learn how to write better code. You may not be an elite-level software engineer, but a data scientist can write reliable code and, with a little effort, get that code into production.
start with a function
To take your code to the next level, start by writing functions. Most code is a collection of functions (or potentially classes), and being able to write good functions goes a long way towards improving the quality of your code.
The function code should be minimal, like this:
- Do only one thing (task)
- create a document
- use good variable names
Many books have been written on how to write clean functions, but let’s start with these three points.
Functions that try to do more than one thing should be avoided at all costs. Signs that a function may be doing too much include:
- It’s longer than a full screen, or longer than about 30 lines of code (in my experience).
- It’s trying to do a lot of things, which makes it difficult to name functions clearly.
- You have a lot of code inside if/else blocks. In these cases, you really need to split them into separate functions.
Functions that do only one thing are important for making code easier to understand, maintain, and test (more on testing later).
Functions that are released to production require a set of documentation. Such documentation should contain information about what the function does, the input parameters, and may give a simple example of how to use the function. Having a documented function will make you thank yourself later, and it will make it much easier for others to understand your code.
Finally, try to use descriptive and friendly variable names. Many data scientists get stuck in a vicious circle by using variable names like ‘a’, ‘a1’, ‘a2’. Short but unfriendly variable names are quicker to type when experimenting, but choose variable names that help others understand your code when you put it into production.
Delete Print statement
Data scientists often use print statements to display information about what is happening. However, in a production environment, these print statements should be removed or converted to log statements when they are no longer needed.
Logging (logging) should be a way of communicating information and errors from your code. Loguru is a Python library to make logging simpler. Loguru handles most of the heavy lifting about logging for you, making it feel like using a simple print statement.
Using style guides
Style guides in programming are used to allow many people to work on the same code, yet make most of that code look like it was written by one person.
Why are style guides important?
Having a consistent style makes the code much easier to navigate and understand. Using a style guide makes it incredibly easy to find bugs. Adhering to standard code writing practices will also help you navigate your code as easily as everyone else. This means that you don’t have to spend a lot of time parsing how it’s formatted to understand the code, instead you can focus on what the code does and whether it works correctly. Noda.
PEP 8 is probably the most widely used style guide for Python. But there are many style guides out there. Another popular source for style guides is Google, who publish their internal style guides.
The key is to pick one style and stick to it. One way to make this easier is to give your IDE the ability to check for style errors and set up style checks to prevent code from being pushed if it doesn’t follow the style guide. be. You can even use an autoformatter to automatically format your code. It allows you to write your code the way you like it and then automatically format your code so that it conforms to standards when you run it. A popular autoformatter for Python is Black.
Create a test
I think most data scientists fear testing because they don’t know how to start.
In fact, many data scientists already run what I call temporary tests. It seems common among them to quickly run some “sanity checks” on new functions in their notebooks. In other words, it runs a few simple test cases to make sure the function works as expected.
Software engineers call this process unit testing.
The only difference between data scientist and software engineer testing is that the former deletes the temporary test and moves on. Instead of deleting the test, you should save it and always run the test before the code is pushed to make sure nothing is broken.
To get started testing with Python, I use pytest. With pytest you can easily create tests and run them all at once to see if they pass. A simple way is to create a directory called “tests” and put Python files starting with “test” in that directory. For example, you can use “test_addition.py” like this:
# content of test_addition.py
def add(x, y):
return x + y
def test_add():
assert add(3, 2) == 5
if any condition is not met .
Typically, you would write the actual function in a separate Python file and import it into your test module. Also, you probably don’t need to test additional Python code like this, but I wanted to show a very simple example (so I did).
These test modules can store all the “sanity checks” of your functions. In general, it is good practice to test edge cases and potential error cases as well as normal cases.
Conducting code reviews
Finally, top of the list of things to do to write better code is code review.
A code review is having another person familiar with writing code in the domain you work on review your code before submitting it to the main branch. This step ensures that best practices are followed and hopefully catches bad code and bugs.
The code reviewer should preferably be someone who can write code as well as you, but even having someone less experienced than you review your code can be very beneficial.
It’s human nature to be lazy, and that laziness can creep into your code. Knowing that someone will review your code is also a great incentive to spend time writing good code. Code review is also the best code improvement method I’ve found. Having an experienced colleague review your code and give suggestions for improvement is invaluable.
Try to keep the amount of new code small to make it easier for people reviewing your code. Small, frequent code reviews work well. A huge amount of code review that you do only occasionally is the worst. No one wants to be sent 1,000 lines of code to be reviewed. Such reviews tend to give poor feedback because they don’t take the time needed to understand a lot of code at once.
10 lines of code (reviews) = 10 issues (pointed out)
500 lines of code (review) = “looks good”
The real intention of the above tweet is that a 10-line code review can examine the code in detail and point out minor problems, but a 500-line code review will not require reviewers to spend much effort on code review, so they will not examine the code. It is that the crude result of “somewhat good” is returned to
Coding level up
I hope this article inspires you to take the time to learn how to write better code. Getting good at coding is never difficult, but it does take time and effort to get good at it.
If you follow the five suggestions above, you should see a huge improvement in the quality of your code.