Facebook DS 完全准备贴

nnn · 2019 年7 月 24 日 20:24

楼主是18年11⽉经refer进⼊FB DS的⾯试流程，历时3个多⽉的准备和⾯试，在19年2⽉HR电话通知onsite过，FB开始准备offer。然⽽等了3周以后依然卡在了immigration team。这期间经历了各种开⼼，等待，愤怒，不甘心，⽆奈，最终还是认命了，签了另外⼀家⼩公司开始申请新⼀年度的h1b。说来好笑，FB HR⾄今没有⼀个说法，immigration team⼀直卡着，要明确拒信也没有，罢了，就这样吧。
先说⼀下timeline，不知道有没有参考价值，感觉⼆⾯后⼀切都变的很缓慢:
18/11/20:HR 邮件联系
18/11/21:HR 电话，给了⼀⾯资料，安排电⾯
18/12/3: ⼀⾯
18/12/3:下午HR说⼀⾯的product不尽⼈意，sql可以，安排⼆⾯
18/12/13: ⼆⾯
18/12/18: HR说⼆⾯过，开始安排onsite
18/12/27:转给了onsite负责⼈
HR中间要了两次immigration材料，确认OPT还有25+个⽉
19/1/14: 安排了19/1/25的onsite
19/2/15:onsite负责⼈打电话说过了，准备offer
然后中间⽤别的offer催了⽆数次，onsite负责人⾄今还在说immigration team缺⼈⼿，各个部门没有协调好，所以还是没个说法，作罢。
因为楼主12⽉20号回国休假，1⽉4号回来，所以脱产全职准备了两周多，再加上之前的两个⽉周末和⼯作⽇晚上都在认真看，onsite的四场⾯试我觉得都还⽐较顺利，没有没准备过的题⽬。虽然这段时间⼼情很差，今天还是决定给在准备的⼩伙伴们⼀个FB DS综合准备贴，感恩论坛⾥⼤家的分享。
因为没有正经做过FLAG的DS，所以理解如果有偏差欢迎⼤家指正。
1. Overview
FB DS⾯试我个⼈觉得像⾼考⼀样，都有标准答案，即便是product题⽬也是如此。⼏乎所有的题⽬都是题库，因此⾯试表现和准备花的时间完全正⽐。如果答不到点⼦上⾯试官会⼀直what’s else what’s else,让⼈⾮常frustrated. 先说⼀下题型:
FB的onsite分为四个部分，每部分30分钟结束。

Analysis Case: Product Interpretation
Analysis Case: Applied Data
Quantitative Analysis
Technical Analysis
(电⾯的时候通常45分钟，sql+product，即上⾯的1和4的组合)
这上⾯四个部分，1是对产品的理解, 2是产品和数据的结合，3是基本概率统计，4是sql。接下来对四个部分说⼀说我的理解。
2. Analysis Case: Product Interpretation
第⼀次电⾯之前我对这⼀块完全摸不着头脑，FB给的参考资料是 https://medium.com/stellarpeers 看了关于FB的帖⼦以后感觉更晕了，不知道怎么能⼀个问题回答那么多。所以⼀⾯我的product就不太好。
然后开始看⾯经，这⼀阶段我觉得写的最好的两个帖⼦是:

https://www.1point3acres.com/bbs … D311%26sortid%3D311
https://www.1point3acres.com/bbs/thread-462895-1-1.html
这⾥⾯都提到了要clarify question, define metrics,回答问题要structured.
但是有了这个指导思想以后还不够，需要⾃⼰去练习，地⾥product⾯经⾮常丰富。只是通常都只有题⽬没有答案，要⾃⼰去思考。
常见的⾯经题⽬:

Best friend如何判断 https://www.1point3acres.com/bbs/thread-465021-1-1.html
加feature给market place
https://www.1point3acres.com/bbs/thread-465018-1-1.html
https://www.1point3acres.com/bbs/thread-466534-1-1.html
SPAM
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=446618&extra=page%3D5%26filter%3Dsortid%26sortid%3D311%26sortid%3D311
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=449091&extra=&page=1
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=446679
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=459951&extra=
产品health
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=373072&extra=&page=1
https://www.1point3acres.com/bbs/thread-464691-1-1.html
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=405396&extra=&page=1
⽗母加⼊FB
https://www.1point3acres.com/bbs/thread-282664-1-1.html
还有⼀些帖⼦就不⼀⼀列举了. 这些⾼频题基本都是⼀个套路，把定义问题和选择metrics答出来就成功了⼀半了，后⾯就是发散思维，从各个segment展开分析了。
帖⼦之后，我又仔细了看了三遍著名的a collection of Data Science take-home challenges（地⾥现在有免费下载版，我当时是⾃⼰买的，⼗分⾁疼），第⼀遍的时候边看边觉得花了⼏百⼑就买了这个？看到后⾯⼤概有点product的感觉了，后⾯又仔细看了两遍，觉得写书的⼈确实是业界出⾝有丰富的DS经验，回答问题⾔简意赅，很值得参考。
因为书快100页了，我⼤概缩减成了⼏类问题，以下答案都是书⾥的浓缩，:
经典题型 15% Drop in FB group usage:
Clarify: what specifically dropped (metric used), by how much (practically significant/statistically significant)? – if not significant then no need to go on
Then High Level:
a. Is it one-time or progressively? (One-time significant drop could be tech issue)
– One time is highly likely a tech issue. Seasonal is also ok
b. Does the drop happen in other features?
– If also other features then we have a bigger problem
c. Cannibalization
d. Drop also happen in competitor products? Maybe competitor launched something new?
– if yes then may be a cross platform industry issue
Then Deep Dive (if anything changes in one of the segments; or maybe nothing changed but the distribution changed):
a. New user vs old user (Cohort)
b. Language
c. Country
d. Platform
How to improve the product?
The question is not asking you to be visionary. But to check if you can find things from datasets as a data scientist. Always try to incentivize “good” and dis-incentivize “bad”.
Firstly, define the target. Say engagement (in order to move long-term retention and revenue)
Then choose metric used to evaluate engagement: i.e. the proportion of users who take at least one action per day interacting with the site.
Pick variables that would move the metric: use both user characteristics and user behavior
Use model (random forest is good here) to check the relationship between segment and engagement. Come up with several scenarios to explain and make suggestions based on the results (improve which segment)
3) Fake/Fraud detection:
Key with fraud is, not happening only once. People who commit fraud would like to repeat it if not being caught. all variables are really about something that should be unique but is not or extreme values. Hence two main ways to capture fraud:
Same device IP/Bank account/phone number as existing accounts;
Anomaly detection-find outliers (extremely low price)
Ø More specifically with market place posting, we can address the listing and seller. For listing, pictures cannot be stolen from elsewhere/descriptions cannot be copied/resolution should not be too low/price should not be two
Ø With fake profile (say fake school): using ML algorithm or anomaly detection to find outliers. For instance, you may include the percentage of connections went to the same school/interaction with people from the same school/acceptance rate for the same school request as variables. In order to
minimize the fake profile, you may want to use 2-step verification for risky users (minimize false negative you may not apply this to all users).
What features to add?
Again, not tempted to be a visionary. Starting from the datasets. Look at current data and check where you want to incentivize people to do. Then simplify the procedures. You can also learn from customer needs through complaints or comments. Then A/B testing to see if it can satisfy your needs.
Eg: figure out a way for a user to finish things in one click/ check use case to find opportunities
Should we introduce XXX feature?
Layer of logic:
If add, what benefits will we get?
Do we have customer needs? (check from comments or user behavior)
A/B testing process
product⾥⾯有时会穿插⼀些AB testting的概念。这⼀部分我主要是看了Udacity的课程 https://classroom.udacity.com/courses/ud257 看了三遍，
⼤概对AB testing有了基本了解:
A/B Testing Process
Goal (increase revenue? Engagement? new user? old user?)
Metrics (invariant + evaluation)
a. Long-term use short-term proxy
b. Invariant is for sanity check
c. Think about how spam/bots would influence your metric
d. ! when choosing metrics, make sure the directional change of the metric is in line with your expectation and the change is unlikely due to bot
behavior and it would not take too long to evaluate
Unit of diversion and unit of analysis?
a. If not the same, then need to use empirical variability
Size and Duration
a. Size is determined by alpha (significance level)/power(1-beta)/Variability of the metric
b. Using size and proportion of traffic applied, we can get duration (if greater than 14 days you’re done; if less you may still need 14 days to know
the weekly patterns)
c. ! Note that once size and duration are determined, you cannot stop halfway because the test result looks great and promising. The size is predetermined in order to reach the alpha, beta, significance level needed. (same thing as a competition with 9 games, you cannot stop and announce the winner simply because one player wins 3 games in a row)
Analyze result
a. Sanity check (make sure test and control are comparable)
b. One metric is easier: construct confidence interval using diff + SE
c. Multiple metrics: false positive become more common as the number of metrics increases. use Bonferroni correction
Make suggestion
a. Do I have statistical/practical significance?
b. Do I understand the change? Who is going to be impacted?
c. Is it worth it? Cost vs benefit?
3 Analysis Case: Applied Data
这⼀部分，某种意义上讲和上⼀部分⼀脉相承，主要区别在于，这⼀部分要从数据的⾓度分析问题。举⼀个我⾃⼰onsite的例⼦:FB推荐餐厅，你觉得怎么推荐。
我就回答可以有三种1)⾃⼰的点赞历史 2)朋友的历史 3)所在地区排名。那么follow-up就会问每⼀种你需要的数据结构是怎样的，可能会有什么问题。这⾥主要的问题就是1和2的数据⼤部分为空，所以这种为空的情况就主要⽤地域信息来推荐。这⼀部分可能考sql也可能不考(我的就没有sql)，但是需要对解决问题需要的数据，数据可能的问题以及解决方法有基本的了解。
4 Quantitative Analysis
这⼀块我觉得学过概率论和数理统计的同学们应该毫⽆问题。
贝叶斯公式P(A|B)基本必考，然后⼀些confidence interval, p-value,A/B testing会考，再就常见metrics的distribution(exponential分布就是答案)
还有那个25做⼀次⼴告和4%做⼀次⼴告的期望和⽅差也是近⼏个⽉经常见。
我觉得基本的统计看⼀遍就没什么问题。常见分布的期望⽅差都要熟悉。
5 Technical Analysis
这⼀部分就是⼤量反复的练习，sql其实很简单，但是临场会紧张，时间又短，所以必须平时经常写。我⾃⼰统计了⼀下，⼤概常见题每题写了3-5遍，总共在⽩板上练习了150+的sql. 即便是觉得⾃⼰毫⽆问题，现场还是有些紧张，如果平时写的不多的话，很有可能会卡掉。
我⾃⼰的练习顺序是
mode analytics复习基本https://mode.com/sql-tutorial/ 过的很快，重点是window function的review
a collection of data science take-home challenges⾥的sql写了两遍
临onsite前又有⼀个综合帖⼦，⾥⾯整理的也不错，sql也练习了两三遍。
https://www.1point3acres.com/bbs/thread-472684-1-1.html
基本上sql常写常新，有时写着写着就发现有点问题，光看别⼈写的可能看不出来。
6 Summary
⼩结就是⼀分耕耘⼀分收获，勤练总有好结果。虽然最后因为immigration team耽误了，还是觉得这个过程收益良多。
听朋友见FB DS今年的total compensation能到200k+ 羡慕嫉妒恨在⾯试的朋友们加油啊！

nnn · 2019 年7 月 24 日 20:25

请问lz提到的exponential分布是指哪个metric？
看很多人提到comment/DAU这个metric，确定时间内事件发生的 …

帖子的总回复数转发数之类的分布不是确定时间内的
社交网络的这些metric通常都是exponential Exponential distribution - Wikipedia 因为绝大部分帖子都没人回复没人转发一些名人的回复转发非常多长尾巴左边0的特别多

nnn · 2019 年7 月 24 日 20:26

楼主为啥会卡在immigration team？？

估计因为拿到offer时还有23.5个月OPT 但是首先这个也是因为旷日持久的interview process造成的其次是要拒就拒了算了结果现在都不给个爽快的无语

Yjc · 2019 年10 月 6 日 04:40

非常有帮助！谢谢楼主辛苦整理

data123 · 2020 年8 月 7 日 22:14

简直是神仙帖子谢谢

orange97 · 2020 年11 月 3 日 00:28

楼主这篇文章真的太详细了，非常棒！！感谢！！但同时我想厚脸皮一下，求楼主分享一下 Collection of Data Science Take-Home Challenges！感激不尽