Facebook DS 完全准备贴

楼主是18年11⽉经refer进⼊FB DS的⾯试流程,历时3个多⽉的准备和⾯试,在19年2⽉HR电话通知onsite过,FB开始准备offer。然⽽等了3周以后依然卡在了immigration team。这期间经历了各种开⼼,等待,愤怒,不甘心,⽆奈,最终还是认命了,签了另外⼀家⼩公司开始申请新⼀年度的h1b。说来好笑,FB HR⾄今没有⼀个说法,immigration team⼀直卡着,要明确拒信也没有,罢了,就这样吧。
18/11/20:HR 邮件联系
18/11/21:HR 电话,给了⼀⾯资料,安排电⾯
18/12/3: ⼀⾯
18/12/13: ⼆⾯
18/12/18: HR说⼆⾯过,开始安排onsite
19/1/14: 安排了19/1/25的onsite
然后中间⽤别的offer催了⽆数次,onsite负责人⾄今还在说immigration team缺⼈⼿,各个部门没有协调好,所以还是没个说法,作罢。
因为楼主12⽉20号回国休假,1⽉4号回来,所以脱产全职准备了两周多,再加上之前的两个⽉周末和⼯作⽇晚上都在认真看,onsite的四场⾯试我觉得都还⽐较顺利,没有没准备过的题⽬。虽然这段时间⼼情很差,今天还是决定给在准备的⼩伙伴们⼀个FB DS综合准备贴,感恩论坛⾥⼤家的分享。
1. Overview
FB DS⾯试我个⼈觉得像⾼考⼀样,都有标准答案,即便是product题⽬也是如此。⼏乎所有的题⽬都是题库,因此⾯试表现和准备花的时间完全正⽐。如果答不到点⼦上⾯试官会⼀直what’s else what’s else,让⼈⾮常frustrated. 先说⼀下题型:

  • Analysis Case: Product Interpretation
  • Analysis Case: Applied Data
  • Quantitative Analysis
  • Technical Analysis
    这上⾯四个部分,1是对产品的理解, 2是产品和数据的结合,3是基本概率统计,4是sql。接下来对四个部分说⼀说我的理解。
    2. Analysis Case: Product Interpretation
    第⼀次电⾯之前我对这⼀块完全摸不着头脑,FB给的参考资料是 https://medium.com/stellarpeers 看了关于FB的帖⼦以后感觉更晕了,不知道怎么能⼀个问题回答那么多。所以⼀⾯我的product就不太好。

https://www.1point3acres.com/bbs … D311%26sortid%3D311
这⾥⾯都提到了要clarify question, define metrics,回答问题要structured.

  1. Best friend如何判断 https://www.1point3acres.com/bbs/thread-465021-1-1.html
  2. 加feature给market place
  3. SPAM
  4. 产品health
  5. ⽗母加⼊FB
    还有⼀些帖⼦就不⼀⼀列举了. 这些⾼频题基本都是⼀个套路,把定义问题和选择metrics答出来就成功了⼀半了,后⾯就是发散思维,从各个segment展开分析了。
    帖⼦之后,我又仔细了看了三遍著名的a collection of Data Science take-home challenges(地⾥现在有免费下载版,我当时是⾃⼰买的,⼗分⾁疼),第⼀遍的时候边看边觉得花了⼏百⼑就买了这个?看到后⾯⼤概有点product的感觉了,后⾯又仔细看了两遍,觉得写书的⼈确实是业界出⾝有丰富的DS经验,回答问题⾔简意赅,很值得参考。
  6. 经典题型 15% Drop in FB group usage:
  7. Clarify: what specifically dropped (metric used), by how much (practically significant/statistically significant)? – if not significant then no need to go on
  8. Then High Level:
    a. Is it one-time or progressively? (One-time significant drop could be tech issue)
    – One time is highly likely a tech issue. Seasonal is also ok
    b. Does the drop happen in other features?
    – If also other features then we have a bigger problem
    c. Cannibalization
    d. Drop also happen in competitor products? Maybe competitor launched something new?
    – if yes then may be a cross platform industry issue
  9. Then Deep Dive (if anything changes in one of the segments; or maybe nothing changed but the distribution changed):
    a. New user vs old user (Cohort)
    b. Language
    c. Country
    d. Platform
  10. How to improve the product?
    The question is not asking you to be visionary. But to check if you can find things from datasets as a data scientist. Always try to incentivize “good” and dis-incentivize “bad”.
  11. Firstly, define the target. Say engagement (in order to move long-term retention and revenue)
  12. Then choose metric used to evaluate engagement: i.e. the proportion of users who take at least one action per day interacting with the site.
  13. Pick variables that would move the metric: use both user characteristics and user behavior
  14. Use model (random forest is good here) to check the relationship between segment and engagement. Come up with several scenarios to explain and make suggestions based on the results (improve which segment)
    3) Fake/Fraud detection:
    Key with fraud is, not happening only once. People who commit fraud would like to repeat it if not being caught. all variables are really about something that should be unique but is not or extreme values. Hence two main ways to capture fraud:
  15. Same device IP/Bank account/phone number as existing accounts;
  16. Anomaly detection-find outliers (extremely low price)
    Ø More specifically with market place posting, we can address the listing and seller. For listing, pictures cannot be stolen from elsewhere/descriptions cannot be copied/resolution should not be too low/price should not be two
    Ø With fake profile (say fake school): using ML algorithm or anomaly detection to find outliers. For instance, you may include the percentage of connections went to the same school/interaction with people from the same school/acceptance rate for the same school request as variables. In order to
    minimize the fake profile, you may want to use 2-step verification for risky users (minimize false negative you may not apply this to all users).
  17. What features to add?
    Again, not tempted to be a visionary. Starting from the datasets. Look at current data and check where you want to incentivize people to do. Then simplify the procedures. You can also learn from customer needs through complaints or comments. Then A/B testing to see if it can satisfy your needs.
    Eg: figure out a way for a user to finish things in one click/ check use case to find opportunities
  18. Should we introduce XXX feature?
    Layer of logic:
  19. If add, what benefits will we get?
  20. Do we have customer needs? (check from comments or user behavior)
  21. A/B testing process
    product⾥⾯有时会穿插⼀些AB testting的概念。这⼀部分我主要是看了Udacity的课程 https://classroom.udacity.com/courses/ud257 看了三遍,
    ⼤概对AB testing有了基本了解:
    A/B Testing Process
  22. Goal (increase revenue? Engagement? new user? old user?)
  23. Metrics (invariant + evaluation)
    a. Long-term use short-term proxy
    b. Invariant is for sanity check
    c. Think about how spam/bots would influence your metric
    d. ! when choosing metrics, make sure the directional change of the metric is in line with your expectation and the change is unlikely due to bot
    behavior and it would not take too long to evaluate
  24. Unit of diversion and unit of analysis?
    a. If not the same, then need to use empirical variability
  25. Size and Duration
    a. Size is determined by alpha (significance level)/power(1-beta)/Variability of the metric
    b. Using size and proportion of traffic applied, we can get duration (if greater than 14 days you’re done; if less you may still need 14 days to know
    the weekly patterns)
    c. ! Note that once size and duration are determined, you cannot stop halfway because the test result looks great and promising. The size is predetermined in order to reach the alpha, beta, significance level needed. (same thing as a competition with 9 games, you cannot stop and announce the winner simply because one player wins 3 games in a row)
  26. Analyze result
    a. Sanity check (make sure test and control are comparable)
    b. One metric is easier: construct confidence interval using diff + SE
    c. Multiple metrics: false positive become more common as the number of metrics increases. use Bonferroni correction
  27. Make suggestion
    a. Do I have statistical/practical significance?
    b. Do I understand the change? Who is going to be impacted?
    c. Is it worth it? Cost vs benefit?
    3 Analysis Case: Applied Data
    我就回答可以有三种1)⾃⼰的点赞历史 2)朋友的历史 3)所在地区排名。那么follow-up就会问每⼀种你需要的数据结构是怎样的,可能会有什么问题。这⾥主要的问题就是1和2的数据⼤部分为空,所以这种为空的情况就主要⽤地域信息来推荐。这⼀部分可能考sql也可能不考(我的就没有sql),但是需要对解决问题需要的数据,数据可能的问题以及解决方法有基本的了解。
    4 Quantitative Analysis
    贝叶斯公式P(A|B)基本必考,然后⼀些confidence interval, p-value,A/B testing会考,再就常见metrics的distribution(exponential分布就是答案)
    5 Technical Analysis
    这⼀部分就是⼤量反复的练习,sql其实很简单,但是临场会紧张,时间又短,所以必须平时经常写。我⾃⼰统计了⼀下,⼤概常见题每题写了3-5遍,总共在⽩板上练习了150+的sql. 即便是觉得⾃⼰毫⽆问题,现场还是有些紧张,如果平时写的不多的话,很有可能会卡掉。
  28. mode analytics复习基本https://mode.com/sql-tutorial/ 过的很快,重点是window function的review
  29. a collection of data science take-home challenges⾥的sql写了两遍
  30. 临onsite前又有⼀个综合帖⼦,⾥⾯整理的也不错,sql也练习了两三遍。
    6 Summary
    ⼩结就是⼀分耕耘⼀分收获,勤练总有好结果。虽然最后因为immigration team耽误了,还是觉得这个过程收益良多。
    听朋友见FB DS今年的total compensation能到200k+ 羡慕嫉妒恨 在⾯试的朋友们加油啊!

看很多人提到comment/DAU这个metric,确定时间内事件发生的 …

帖子的总回复数 转发数之类的分布 不是确定时间内的
社交网络的这些metric通常都是exponential https://en.wikipedia.org/wiki/Exponential_distribution 因为绝大部分帖子都没人回复 没人转发 一些名人的回复转发非常多 长尾巴 左边0的特别多

楼主为啥会卡在immigration team??

估计因为拿到offer时还有23.5个月OPT 但是首先这个也是因为旷日持久的interview process造成的 其次是要拒就拒了算了 结果现在都不给个爽快的 无语



楼主这篇文章真的太详细了,非常棒!!感谢!!但同时我想厚脸皮一下,求楼主分享一下 Collection of Data Science Take-Home Challenges! 感激不尽