So many potential data scientists are interested in knowing what it is that those on the other side keep themselves busy with all day, and so I thought that having a few connections provide their insight might be a useful endeavor.
What follows is some of the great feedback I received via email and LinkedIn messages from those who were interested in providing a few paragraphs on their daily professional tasks. The short daily summaries are presented in full and without edits, allowing the quotes to speak for themselves.
Andriy Burkov is Global ML Team Leader at Gartner, located in Quebec City.
My typical day starts art 9am with a 15-30 min long Webex meeting with my team: my team is distributed, half in India (Bangalore and Chennai) and half in Canada (Quebec City). We discuss the advancement of the projects and decide on how to overcome difficulties.
Then I read my emails received during the night and react if necessary. After that I work on my current project, which currently is a salary extractor from job announcements. I need to create a separate pair of models for each pair country-language we support (around 30 of country-language pairs). The process consists of dumping the job announcements for a certain part country-language, clustering them, then getting the subset of training examples. Then I annotate these examples manually and build the model. I iterate build/test/add data/rebuild until the test error is low enough (~98%).
In the afternoon, I help my team members to improve their models by testing the current model on the real data, identifying the false positives/negatives and creating new training examples to fix the problem. The decision when to stop improving the model and deploy in production depends on the project. For some cases, especially user-facing, we want a very low level of false positives (less than 1%): the user always see that the extraction of some element from their text was wrong, but not always remark the lack of extraction.
The day ends at around 17:30pm with a 30min of catch up of the tech news/blogging.
Colleen Farrelly is a Data Scientist at Kaplan, located in Miami.
Here’s a little background on me and what a day in my life is like:
I switched into data science and machine learning during an MD/PhD program after a joint humanities and sciences undergraduate degree, and my day-to-day projects are highly interdisciplinary as a rule. Some projects include simulating epidemic spread, leveraging industrial psychology to create better HR models, and dissecting data to obtain risk groups for low socioeconomic status students. The best part of my job is the variety of projects and a new challenge every day.
A typical day for me starts around 8:00 am, when I catch up on my social media accounts related to machine learning and data science. I switch into work projects around 8:30 am and finish around 4:30 pm to 5:00 pm with a break for lunch. About 40% of my time is spent on research and development, with a strong focus in mathematics (topology, in particular)–involving anything from developing and testing new algorithms to writing mathematical proofs to simplify data problems. Sometimes, the results are confidential and stay within the company (shared through monthly Lunch & Learn presentations within the company); other times, I’m allowed to publish or present at external conferences.
Another 30% of my time is spent building relationships across departments at my company and seeking new projects, which often identify problems related operating procedures, problems related to data capture, or connections between previous projects that provide a more comprehensive view of operations. This is probably one of the most crucial aspects of the job. People I meet often bring up problems they are seeing or mention how neat it would be to have a predictive model for sales/student outcomes/operations, and I’ve found it opens the door to conversations and best practice suggestions down the road. As a data scientist, it’s important to communicate with a wide range of stakeholders, and it’s helped me simplify my explanations of machine learning algorithms to a layman’s level.
The remaining 30% of my time is typically spent on data analysis and writing up results. This includes forecast models, predictive models of key metrics, and data mining for subgroups and trends within a given dataset. Each project is unique, and I try to let the project and its initial findings guide me to next steps. I mainly use R and Tableau for projects, though Python, Matlab, and SAS are occasionally helpful with specific packages or R&D requests. I can usually recycle the code, but each problem has its own assumptions and data limitations with respect to the mathematics. Projects can usually be simplified using tools from topology, real analysis, and graph theory, which speeds up the project and allows for the use of extant packages, rather than a need to code from scratch. As the only data scientist at a large company, this allows me to cover more projects and uncover more insight for our internal customers.
Marco Michelangeli is a Data Scientist at Hopenly, and resides in Reggio Emilia, Italy.
When Matthew asked me to write few paragraphs about my “typical” day as data scientist, I have started thinking about my routine and daily job, but then I have stopped and realised: “I do not really have a routine!” and this is the best thing about being a data scientist! Every day it is different, a new challenge comes up and a new problem sits there waiting to be solved. I am not just talking about coding, math and statistics, but about the complexity of the business world: I often discuss with business people and clients to understand their real needs, I help the marketing with contents on our products, I participate in meetings about new ETL workflows and architecture design for a new product to be realised; I even found myself screening data scientist CVs.
Being a data scientist means to be flexible, open minded and ready to solve problems and embrace complexity, but do not take me wrong: I spend more than 80% of my time cleaning data! If you are just starting a career in Data Science, you have probably come around post of the type: “10 tips to master R and Python in Data Science” or “The best Deep learning library”, therefore I won’t give you any more technical suggestions, the only thing that I can say come from the professional data science manifesto and it is: “Data Science is about solving problems, not building models.” This means that if you can solve a client need with just a SQL query, do it! Do not frustrate yourself over complex machine learning models: be simple, be helpful.
Ajay Orhi is a Data Scientist at Kogentix Inc. in New Delhi. He has also written 2 books on R and one on Python.
My typical day begins at 9 AM with a scrum call. Our methodology of project working is to divide tasks into two week goals or sprints. This is basically the agile development method for software and it is different from CRISP-DM or KDD methodologies.
A bit of context is necessary to explain why we do so. My current role is a data scientist in a team implementing Big Data Analytics in a southeast Asian Bank. We have data engineers, admin/ infrastructure people, data scientists and of course customer engagement managers in the team catering to each specific need of the project. My current organization is an AI startup named Kogentix , not only having Big Data Services but also a Big Data Product named AMP which acts like a GUI on PySpark and tries to automate Big Data. AMP is quite cool and I will come to it soon. This leads to the focus of my startup to get as many clients as possible as well as test and implement out our Big Data Product. This means demonstrating success in our client engagements- one of our client was shortlisted for an award last month. Am I sounding too marketing oriented- you bet I am. The work a data scientist does is usually of a strategic consequence to the client.
What do I do on a daily basis? It could be many things - including not just emails and meetings. I could be using Hive to pull data, using it to merge data (or using Impala), I could be using PySpark (Mllib) to make churn models or do k means clustering. I could be pulling data in an excel file to make summaries and I could be making data visualizations. Some days I prototype in R using some machine learning packages. I also help with testing of AMP, our Big Data Analytics product and work with that team for feature enhancement of the product (if you forgive the pun- since the product is used for feature enhancement). When I code Big Data, I could be using the GUI for Hadoop HUE or I could be using command line programming including batch submitting of code.
Prior to this, when I working for India’s 3rd biggest software company Wipro my role was quite opposite. Our client was India’s Ministry of Finance (the arm that deals with taxes). Junior data scientists pulled data using SQL from an RDBMS (due to legacy issues), and I validated the results.The reports were then sent to the various clients. On an ad-hoc basis we also used SAS Enterprise Miner as a concept test to show time series forecasts of imports and exports for India. Timelines are quite slow and bureacratic when working for a federal government vis a vis working for the private sector. I remembered one presentation when the bureaucrat in charge was astonished we were executing machine learning and why the government did not use it earlier. But SAS/VA (for Dashboards),SAS Fraud Analytics (which I trained on and which was in process of implementation) and Base SAS (the analytics workhorse) are amazing software and I doubt how anything resembling SAS Domain Specific Bundles can be made soon.
Prior to this for ten years I ran Decisionstats.com. I blogged, sold ads (not very good), wrote 3 books in data science, scores of articles for Programmable Web, StatisticsViews and did some data consulting. I even wrote a few articles for KDnuggets. You can see my profile here https://en.m.wikipedia.org/wiki/Ajay_Ohri
Eric Weber is a Senior Data Scientist at LinkedIn, located in Sunnyvale, California.
A day in the life at LinkedIn. Well, I think I can say there is no “typical” day. Keep that in mind as you read!
First, a little bit about me and my major responsibilities. I’m fortunate to work on our LinkedIn Learning team, which is the newest data science group in the organization. Specifically, I support Enterprise level sales for LinkedIn Learning. What does that mean? Think about it like this: we use data, models and analytics to make decisions on how to sell effectively. Of course, the details on how we do that are internal but you can imagine that we want to answer questions like: which accounts do we try to sell into? We work to understand what makes certain accounts stand out from the rest.
Second, a key aspect of everyday is communication. I’ve written about this extensively on LinkedIn but I believe that effective communication with teammates and business partners is a defining characteristic of a great data scientist. On a typical day, this involves providing updates on key projects to both immediate team members, managers and senior leaders, as appropriate. One thing I find fascinating about this aspect of the job is the need for brevity. A company like LinkedIn has tons of internal communication happening so everything that goes out must be distilled into clear and concise results/talking points.
Finally, an important part of each day is failure. I’m a big believer that if you are not failing, you are not learning. This does not mean catastrophic failure of course. It means that each day I work on things that expand my understanding of analytics, data science and the organization itself. I learn from my mistakes and watch how others do things more efficiently or in different ways from me. When I wake up each day, I seek failure as part of the job because it makes me better the next day. Analytics and the rapid pace of expansion of data science sure provides plenty of these opportunities!
Hopefully these accounts have provided you with some deeper insight into what data scientists do on a daily basis.