FB Data Scientist Analytics 电面跪经

Table: user_actions
ds (STRING) | user_id (BIGINT) |post_id (BIGINT) |action (STRING) | extra (STRING)
‘‘2018-07-01’’| 1209283021 | 329482048384792 | ‘‘view’’ |
‘‘2018-07-01’’| 1209283021 | 329482048384792 | ‘‘like’’ |
‘‘2018-07-01’’| 1938409273 | 349573908750923 | ‘‘reaction’’ | ‘‘LOVE’’
‘‘2018-07-01’’| 1209283021 | 329482048384792 | ‘‘comment’’ | ‘‘Such nice Raybans’’
‘‘2018-07-01’’| 1238472931 | 329482048384792 | ‘‘report’’ | ‘‘SPAM’’
‘‘2018-07-01’’| 1298349287 | 328472938472087 | ‘‘report’’ | ‘‘NUDITY’’
‘‘2018-07-01’’| 1238712388 | 329482048384792 | ‘‘reshare’’ | ‘‘I wanted to share with you all’’

Q1: How many posts were reported yesterday for each report Reason?

Table: reviewer_removals
ds (STRING) | reviewer_id (BIGINT) | post_id (BIGINT) |
‘‘2018-07-01’’| 3894729384729078 | 329482048384792 |
‘‘2018-07-01’’| 8477594743909585 | 388573002873499 |

Q2: What percent of daily content that users view on Facebook is actually Spam?

Q3: Facebook has decided to be proactive about SPAM, instead of merely reactive. We decide to address the SPAM problem through a Machine Learning solution predicting whether a given post is indeed SPAM. We want to use the predictions in order to downrank/deprioritize suspected SPAM from news feed.

赞!FB的DS Analytics看来要往侧重ML的方向走了!只做SQL太无聊了

Q3的问题是如何来评估这个machine leaning有没有用.

他其实只是问你会去想知道哪些信息来衡量这个ml, 不探讨任何具体这个ml怎么工作

那些指标用上呢?

第二题得join两次,有点坑

为什么要join两次 left join 不就行了吗?

得先join从右表找出哪些被reported as spam的
在join viewed和新表

Thanks for sharing

我知道你是什么意思了 但是我觉得不用的 所有在removal 肯定都是spam it doesn’'t matter if they are reported as spam