Table of Contents
- Growth and learning for anyone who aspires to become a data scientist
- myself as a data scientist
- What are the types of data scientists in Silicon Valley?
- What skills do data scientists need?
- What should not be done as a data scientist?
- 1. Algorithms are the ONE and the only ONE
- 2. Don’t miss real business needs!
- 3. Don’t ignore the big picture
- next thing
Growth and learning for anyone who aspires to become a data scientist
Today, when you search for a data scientist on Google or what the skills it takes to become a data scientist, you are overwhelmed by the sheer volume of information posted on Medium, LinkedIn, news, private coaching sites, etc. would Everyone says data scientists are a much-needed profession in the 21st century. To become a data scientist, you need to master statistics, programming, and machine learning. But does all that information really give us so much value? What kind of work does a data scientist in the actual industry do?
Work is a combination of reward and self-realization. Every choice is about finding the point of maximum satisfaction in this two-dimensional space. Whether you want to be a data scientist or a better data scientist depends on what value you can provide in your job and how you can fulfill yourself in your career.
Germans say that the world is concrete. Today I’m going to talk specifically about the Silicon Valley tech industry that I’m familiar with, the different requirements and skills required for a data scientist job, how to grow in this position, and what Charlie Munger said. I think.
All I want to know is where I will die and I will never go there
This quote can be interpreted as saying that you should absolutely avoid making decisions that could result in fatal injury (such as death) , and that knowledge of such fatal risks is important. The article’s author, Miao, is quoting Mangar’s quote, which seems to suggest that it’s important to know the risks facing data scientists .
myself as a data scientist
It’s already been four years since I joined Microsoft’s Silicon Valley office after graduating from the University of Illinois at Urbana-Champaign. Our group is mainly in charge of the “language model” that is the core of the speech recognition service provided by the company’s cloud platform “Azure”.
As a data scientist, what kind of impact and value have I created for the company over the past few years? I will list the results below.
- By designing and building a Disfluency Tagging system using deep learning, Teams Live Caption and Office Word Dictation improved the F1 score of Disfluency Tagging by 26.2% , and the BLEU score for English-German translation improved by 2.54.
- Improved speech recognition accuracy and user experience for Bing Voice Search , reducing Surprise Metric by 7% and Word Error Rate by 2%.
- Piloted pre-training of multilingual neural network language models in 26 European regions to build an all-in-one subword tokenizer.
My work is closer to machine learning in natural language processing, a branch of data science commonly referred to as NLP. In Silicon Valley, the characteristics of data scientists and skills required by companies and groups are different. In the next chapter, I want to talk about the types of data scientists in demand in Silicon Valley.
What are the types of data scientists in Silicon Valley?
Generally, there are three types of positions as a data scientist.
- data analyst
- data engineer
- machine learning engineer
First, the skills required for these three positions are different.
- Data analysts use SQL and other languages to process data, summarize data, visualize statistical data, derive business insights, and create reports based on data analysis. He has a career path in Data Scientist Analytics at Facebook, primarily responsible for designing statistical experiments as A/B tests. For example, if you were designing a fresh news recommendation system today, how would you know if this new system would help you increase user stickiness, grow your subscriber base, and generate revenue? . Online evaluations play that role, in which data analysts perform experimental design and statistical analysis of a series of A/B tests.
- Data engineers are technically a branch of software engineers and are primarily responsible for designing and building large-scale data infrastructures. On Instagram, for example, real-time user feedback data, such as the time a user viewed or clicked on a product, or the time elapsed between viewing two products, is deposited in the data system. This data helps build user portraits and make personalized product recommendations more accurate. Storing, processing, querying, and maintaining such large-scale online data is all the work of data engineers.
- Machine learning engineers are responsible for designing and developing large-scale machine learning systems. This position requires mastery of machine learning, deep learning , and good programming skills, and in my view, the most important pillar of the job is to turn business problems into machine learning problems. is to convert to Machine learning engineers design quantitative metrics to define problems, collect and process large-scale data, and use machine learning to improve overall performance through iterative optimization of intelligent algorithms. Realize automatic decision-making by Ubiquitous recommendation systems in our daily lives, such as YouTube video recommendations, Spotify daily playlists, and Amazon product recommendations, are typical applications of machine learning. Other applications include intelligent speech recognition by Google Assistant and Alexa, human-computer interaction, machine translation, intelligent driving assistance, and internet advertising. Behind these are machine learning systems.
Next, looking at the compensation side of each occupation, engineering-related occupations are generally higher than analytics, but compensation is determined mainly by supply and demand (for each occupation), which is a basic theory of economics. The learning curve for chart visualization using a suite of off-the-shelf software is relatively lower than machine learning or programming, and people with real industry experience are scarce. From the demand side, the penetration of cloud computing platforms into many industries has enabled the digitization and fluidization of large-scale data, and the rapid expansion of data intelligence has led to a shortage of talent in this field. demand is increasing. The remuneration for each occupation can be expressed by the following inequality.
Machine Learning Engineer >= Data Engineer >> Data Analyst
Of course, a software development environment is what Silicon Valley needs most. Machine learning is like the icing on the cake for many products. For example, we all use Zoom for video conferencing. Some of the capabilities provided by machine learning can help improve user experience and customer retention. However, starting from the first principle, we first need low-latency, barrier-free video communication software. Indeed, all new products and services these days seem to be based on data intelligence. Such products collect user behavior data in real time, drive intelligence systems through iterative machine learning, and establish technical systems that attract more users and collect more data. For example, TikTok is pushing the data->model->product cycle. It is commonly said that machine learning in industry is essentially an engineering problem, not a purely science one. I hope to provide concrete examples of this in future blog posts.
What skills do data scientists need?
The skill tree above contains the hard skills I consider essential for a data scientist, but there will be different focuses (to put together skills) depending on the needs of each position. Of course, hard skills aren’t everything to succeed in the real world. To truly grow, soft skills become even more important. How to communicate with colleagues, how to collaborate, how to write emails, how to effectively communicate what you have done, how to demonstrate leadership, how to deal with yourself and your direct manager, your influence. and so on. These are the areas that I reflect on every day and am constantly learning. I will share more of my ideas (on soft skills) in future blogs.
Interviews generally screen job applicants on four dimensions:
- Coding: Python ( algorithms /data structures) + SQL
- machine learning system design
- A/B test design + statistical analysis
- experience project resume
What should not be done as a data scientist?
1. Algorithms are the ONE and the only ONE
With many mature products already loaded with machine learning, it’s hard to imagine that coming up with a brand new, crazy algorithm in a short period of time would suddenly improve the performance of the product. Opportunities for exponential performance improvements exist during the transition from traditional machine learning to deep learning, especially for products with massive data scales. Also, in the long term, we can expect the arrival of the algorithmic breakthrough moment that occurs every five years. But in many real-world cases, it’s the data that creates value from new information properly processed that helps improve system performance .
We are living in an era where computing power and universal algorithms become infrastructure like water and electricity. Cloud services make it easy for anyone to take advantage of machine learning and create their own data products. Accurate, novel, digitized data that has never been mined before is at the core of your work.
2. Don’t miss real business needs!
If you spend a lot of time building a deep learning model, you could improve your offline metrics by 1%. However, these 1% improvements offline are not reflected in online ratings and may not be useful for real business needs. How to translate real business needs into data + machine learning solutions, and how to shape what the model learns to meet the ultimate business objectives are the first things we think about in our work. It is a must.
3. Don’t ignore the big picture
If you are caught up in the project you are involved in and do not grasp the overall picture of the product, first you will lose the opportunity to discover new growth points, and secondly the marginal profit rate of your efforts will drop significantly. If it takes half a year to improve a certain offline index by 1%, and no one pays much attention to this 1% improvement, then the cost-effectiveness of this project is in a deep trough. If you always think about the overall direction, you will be able to find untouched wilderness and lead from 0 to 1.
AINOW translated article preaching the rules of a data scientist
This article is my first post reflecting on my role as a data scientist. In the next series, I will talk about machine learning algorithms, what real machine learning systems look like, my (painful) LeetCode practice, and new things I learned in my daily work. Discuss in detail.