Tuesday, May 28, 2024
HomeData ScienceA data scientist must know how to write code!

A data scientist must know how to write code!

Table of Contents

  • Don’t just write code, write good code
  • start with a function
  • Delete Print statement
  • Using style guides
  • Create a test
  • Conducting code reviews
  • Coding level up

Don’t just write code, write good code

I know what you think (if you’re a data scientist and asked if you can code). “Of course I know how to code, is it insane to ask that?”

Every day you write hundreds of lines of code in Jupyter notebooks. You can clearly write code. We’re not training machine learning models by hand or in Excel (although that’s possible).

So what exactly do I want to say?

It’s hard to say, but most of the coding that data scientists do is not what I would consider programming. Data scientists use programming languages ​​as tools to explore data and build models. However, the programs you create are just to get the job done, and you don’t really care.

A data scientist’s code is often messy and may not even run continuously (thanks to notebooks). You’ve also never written a unit test and know very little about how to write good reusable functions.

However, as data science becomes more embedded in real products, such code simply cannot keep up. You can’t trust bad code, and putting untrusted code into your product creates a lot of technical debt and a bad user experience.

Technical debt is the cost of fixing the damage that software development flaws like complex source code and undocumented specifications can cause. As long as there are few defects, the repair cost will be low, but if defects accumulate, the repair cost will increase , which is likened to a liability.

“Okay, okay, but I’m a data scientist, not a software engineer,” you say. My job is to build the model, it’s someone else’s job to clean up the code. Some companies might do it that way, but for now I think it’s a much better idea for data scientists to learn how to write better code. You may not be an elite-level software engineer, but a data scientist can write reliable code and, with a little effort, get that code into production.

start with a function

To take your code to the next level, start by writing functions. Most code is a collection of functions (or potentially classes), and being able to write good functions goes a long way towards improving the quality of your code.

The function code should be minimal, like this:

  1. Do only one thing (task)
  2. create a document
  3. use good variable names

Many books have been written on how to write clean functions, but let’s start with these three points.

Functions that try to do more than one thing should be avoided at all costs. Signs that a function may be doing too much include:

  • It’s longer than a full screen, or longer than about 30 lines of code (in my experience).
  • It’s trying to do a lot of things, which makes it difficult to name functions clearly.
  • You have a lot of code inside if/else blocks. In these cases, you really need to split them into separate functions.

Functions that do only one thing are important for making code easier to understand, maintain, and test (more on testing later).

Functions that are released to production require a set of documentation. Such documentation should contain information about what the function does, the input parameters, and may give a simple example of how to use the function. Having a documented function will make you thank yourself later, and it will make it much easier for others to understand your code.

Finally, try to use descriptive and friendly variable names. Many data scientists get stuck in a vicious circle by using variable names like ‘a’, ‘a1’, ‘a2’. Short but unfriendly variable names are quicker to type when experimenting, but choose variable names that help others understand your code when you put it into production.

Delete Print statement

Data scientists often use print statements to display information about what is happening. However, in a production environment, these print statements should be removed or converted to log statements when they are no longer needed.

Logging (logging) should be a way of communicating information and errors from your code. Loguru is a Python library to make logging simpler. Loguru handles most of the heavy lifting about logging for you, making it feel like using a simple print statement.

Using style guides

Style guides in programming are used to allow many people to work on the same code, yet make most of that code look like it was written by one person.

Why are style guides important?

Having a consistent style makes the code much easier to navigate and understand. Using a style guide makes it incredibly easy to find bugs. Adhering to standard code writing practices will also help you navigate your code as easily as everyone else. This means that you don’t have to spend a lot of time parsing how it’s formatted to understand the code, instead you can focus on what the code does and whether it works correctly. Noda.

PEP 8 is probably the most widely used style guide for Python. But there are many style guides out there. Another popular source for style guides is Google, who publish their internal style guides.

(*Translation Note 2) Click here for the Japanese site that summarizes the contents of PEP 8, which is the coding convention for Python code included in the Python standard library .

The key is to pick one style and stick to it. One way to make this easier is to give your IDE the ability to check for style errors and set up style checks to prevent code from being pushed if it doesn’t follow the style guide. be. You can even use an autoformatter to automatically format your code. It allows you to write your code the way you like it and then automatically format your code so that it conforms to standards when you run it. A popular autoformatter for Python is Black.

(*Translation Note 3) An IDE (Integrated Development Environment) is a comprehensive software development environment that integrates a text editor, compiler, and debugger .
Here is the GitHub page for Black, a Python autoformatter .

Create a test

I think most data scientists fear testing because they don’t know how to start.

In fact, many data scientists already run what I call temporary tests. It seems common among them to quickly run some “sanity checks” on new functions in their notebooks. In other words, it runs a few simple test cases to make sure the function works as expected.

Software engineers call this process unit testing.

The only difference between data scientist and software engineer testing is that the former deletes the temporary test and moves on. Instead of deleting the test, you should save it and always run the test before the code is pushed to make sure nothing is broken.

To get started testing with Python, I use pytest. With pytest you can easily create tests and run them all at once to see if they pass. A simple way is to create a directory called “tests” and put Python files starting with “test” in that directory. For example, you can use “test_addition.py” like this:

# content of test_addition.py
def add(x, y):
return x + y

def test_add():
assert add(3, 2) == 5

In the Python file above, after defining add(x, y), which returns the sum of any two values, we also define test_add() to test the add function. The execution of test_add() confirms with the assert statement that if 3 is substituted for x in add(x, y) and 2 is substituted for y, it is equal to 5. Note that the assert statement raises an AssertionError
if any condition is not met .

Typically, you would write the actual function in a separate Python file and import it into your test module. Also, you probably don’t need to test additional Python code like this, but I wanted to show a very simple example (so I did).

These test modules can store all the “sanity checks” of your functions. In general, it is good practice to test edge cases and potential error cases as well as normal cases.

Note: There are many different types of tests. Unit tests seem to be the best place for data scientists to start testing.
An edge case is a test case where any single parameter is set to extreme values . For example, this corresponds to the case where the volume is maximized with the stereo speakers. A similar concept is the corner case . Corner cases are also extreme cases of parameters, but differ from edge cases in that multiple parameters are involved .

Conducting code reviews

Finally, top of the list of things to do to write better code is code review.

A code review is having another person familiar with writing code in the domain you work on review your code before submitting it to the main branch. This step ensures that best practices are followed and hopefully catches bad code and bugs.

The code reviewer should preferably be someone who can write code as well as you, but even having someone less experienced than you review your code can be very beneficial.

It’s human nature to be lazy, and that laziness can creep into your code. Knowing that someone will review your code is also a great incentive to spend time writing good code. Code review is also the best code improvement method I’ve found. Having an experienced colleague review your code and give suggestions for improvement is invaluable.

Try to keep the amount of new code small to make it easier for people reviewing your code. Small, frequent code reviews work well. A huge amount of code review that you do only occasionally is the worst. No one wants to be sent 1,000 lines of code to be reviewed. Such reviews tend to give poor feedback because they don’t take the time needed to understand a lot of code at once.

(*Translation Note 7) Translating the above tweet image, it becomes as follows.
10 lines of code (reviews) = 10 issues (pointed out)
500 lines of code (review) = “looks good”
The real intention of the above tweet is that a 10-line code review can examine the code in detail and point out minor problems, but a 500-line code review will not require reviewers to spend much effort on code review, so they will not examine the code. It is that the crude result of “somewhat good” is returned to

Coding level up

I hope this article inspires you to take the time to learn how to write better code. Getting good at coding is never difficult, but it does take time and effort to get good at it.

If you follow the five suggestions above, you should see a huge improvement in the quality of your code.



Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments