# DA/DS 求职刷题指南（上）- 含内推机会

Data Scientist/Data Analyst 通常需要集中准备的分为以下几块内容:

• Machine Learning
• 统计，概率与 A/B testing
• Online coding(Python + R)
• SQL
• Product sense
• Project
• Extra Skills

1. 常见面试问题
• What is overfitting? / Please briefly describe what is bias vs. variance.
• How do you overcome overfitting? Please list 3-5 practical experience. / What is ‘Dimension Curse’? How to prevent?
• Please briefly describe the Random Forest classifier. How did it work? Any pros and cons in practical implementation?
• Please describe the difference between GBM tree model and Random Forest.
• What is SVM? what parameters you will need to tune during model training? How is different kernel changing the classification result?
• Briefly rephrase PCA in your own way. How does it work? And tell some goods and bads about it.
• Why doesn’t logistic regression use R^2?
• When will you use L1 regularization compared to L2?
• List out at least 4 metrics you will use to evaluate model performance and tell the advantage for each of them. (F1 score, ROC curve, recall, etc…)
• What would you do if you have > 30% missing value in an important field before building the model?
1. 相关资料准备
• Coursera 上 Andrew Ng 的 Machine learning 课程: https://www.coursera.org/learn/machine-learning 算得上考古级别的课程了，内容有些老旧但是很经典，很适合商学院 BA 专业的从 0 开始补齐 ML 的背景知识
• 【15 hours of expert ML videos】: https://www.dataschool.io/15- hours-of-expert-machine-learning-videos/
• 《ISLR》(一个免费链接直通车)，入门神书
• Practical Statistics for Data Scientists: 50 Essential Concepts》，很实用的一本书， 专讲一些细小知识，不深但是读完会感觉多了些对 ML 的理解。
• Medium-Towards Data Science 专题，比如 Machine Learning 101 (Machine Learning 101 – Medium)这个小专题，非常浅显易懂，适合初学者用具象的方式理解抽象算法
• StackOverflow(https://stackoverflow.com/)自然也是不能漏掉的，学 data 或者编程总会遇到很细枝末节的问题，这些一般文章里没有，所以就需要求助社群的力量了。
• DataCamp:Machine Learning A-Zhttps://lnkd.in/gXqdBsQ

1. 常见面试问题
• What is p-value? What is confidence interval? Explain them to a product manager or non-technical person… (很明显人家不想让你回答: 画个正态分布然后两边各卡 5%
• How do you understand the “Power” of a statistical test?
• If a distribution is right-skewed, what’s the relationship between medium, mode, and mean?
• When do you use T-test instead of Z-test? List some differences between these two.
• Dice problem-1: How will you test if a coin is fair or not? How will you design the process(有时会要求编程实现)? what test would you use?
• Dice problem-2: How to simulate a fair coin with one unfair coin?
• 3 door questions. (自行 google 吧，经典题之一)
• Bayes Questions: Tom takes a cancer test and the test is advertised as being 99% accurate: if you have cancer you will test positive 99% of the time, and if you don’t have cancer, you will test negative 99% of the time. If 1% of all people have cancer and Tom tests positive, what is the prob that Tom has the disease? (非常经典的 cancer screen 的题，做会这一道，其他都没问题了)
• How do you calculate the sample size for an A/B testing?
• If after running an A/B testing you find the fact that the desired metric(i.e, Click Through Rate) is going up while another metric is decreasing(i.e., Clicks). How would you make a decision?
• Now assuming you have an A/B testing result reflecting your test result is kind of negative (i.e, p-value ~= 20%). How will you communicate with the product manager?
• If given the above 20% p-value, the product manager still decides to launch this new feature, how would you claim your suggestions and alerts?
1. 相关资料准备
• A/B testing 的资料首推的是 Udacity 上免费的 A/B testing(by Google)的课, 同学们的评 价都还不错，很适合全面的了解一下 A/Btesting。
• 其余的 A/B testing 的内容大多来自于 Medium 上的好文，原因是 A/B testing 是一个 要和实际的业界应用场景结合的东西，只知道原理和基本不懂没啥区别。所以要去看 一看业界的人写的关于 A/B testing 的文章，只 da 有带着案例看，才能懂面试中的问题都应该怎么样回答。
• 还有就是如果有在工作的学长姐，长辈等等，一定要不吝啬的问 A/B 方面的问题。他们说个十几二十分钟，能省下你很多时间去到处扒资料，原因同上条不解释。
• Stats 的话，有一个非常快的捡起一些统计学基础的内容是 Coursera 上 intro to stats and prob 课程，很快，一个下午就可以看完。
• Udemy 课程:Data Science Career Guide - Interview Preparation, 还是很不错的。课 程轻量，学起来无压力。
• 概率题对于大多数中国学生来说都没问题，都是高中学过的，稍加捡起就行。Udemy 的课就可以帮你捡起来

1. 面试问题(这个考的五花八门，所以不敢说是最常见的)
• Report the biggest sum of a continuous 3 numbers in a list? with the related index?
• Dynamic programming problem: Now you have 5 types of coins(1,2,3,5,8) and a total sum(a big number, say 589). How many different combinations of coins can you find to reach this total sum?
• Please write a function to reverse the key and value in a dictionary. When you have repeated values, please only keep the first key as the new value.
• Similarly to the “gather” and “spread” functions in the tidyr package, write a one by yourself and test it using XXX dataset.
• Given a log file with rows featuring a date, a number, and then a string of names, parse the log file and return the count of unique names aggregated by month. (我的不是这个原题，但是意思很像)
• Using python to calculate a 30-day rolling profit. (大致就是要用 python 写一个 rolling window)
1. 相关资料准备

### 剩餘內容，下集待续…

1. 硅谷南湾智能电动汽车"EV"公司。设计，开发，制造和销售与先进的互联网，人工智能和自动驾驶技术无缝集成的智能电动汽车。致力于内部研发和智能制造，以为客户创造更好的出行体验。致力于通过技术和数据改造智能电动汽车，塑造未来的出行体验

2. 南加州Banking App研发商，致力于创造可增强美国集体潜力的金融机会。其金融工具，包括借记卡和支出帐户，可帮助超过800万客户进行银行业务，制定预算，避免透支费用，找到工作并建立信贷。合作方包含Mark Cuban，Norwest Venture Partners，Section 32和Financial Venture Studios等。

[Job Descriptions/Requirements]

Data Engineer

• Provide seamless and timely data access for your users;
• Build reliable and dependable ETL;
• Build and maintain production machine learning infrastructure;
• Troubleshoot complex issues in distributed systems;
• Debate data processing philosophies and methodologies with your team;
• Familiar with Python, Java, SQL

Machine Learning Engineer

• Profile large-scale training jobs and identify/resolve bottlenecks;
• Increase training speed by mixed-precision, faster database design and preprocess optimization;
• Work with Infra to build hyper-parameter tuning pipeline and experiments database;
• Work with Infra to build release pipeline that include model-pruning, model to car release and writing GPU.

Data Analyst

• Assist product manager to deal with daily product development, delivery, and client communication duties
• Conduct research on different business issues in a group based on data in Google Analytics
• Work in a group to improve new product marketing copy-writing for used as an introduction and new products directly to customers
• Put together pages of competitive product analysis independently through collecting and analyzing required information from the finical annual report and official website and presented reports on the weekly meetings
• Work in Business Processing Re-Engineering group to remove redundancy and optimize current sales, marketing processes by using ARIS Express
• Prepare Dashboards using calculations, parameters, calculated fields, groups, sets, and hierarchies in Tableau
• Publish Tableau dashboard on Tableau Server or Tableau Online and embedded them into the portal.