分享3个月NLP实习用到的专题知识

Roys · 2019 年10 月 3 日 22:53

楼主斯坦福MS，上个暑假在一个全美资VC做NLP。没错是一个VC，不经感叹道科技的力量已经从二级市场蔓延到投资领域的各个角落。做的项目十分有趣，用snorkel做了一个weak supervision的labeling，做了一个text classifier，以及一个text clustering。具体每个project的内容就不详细说了，其实在其他领域应用早就有了，大家上网也可以找点例子，接下来我也会分享一些链接。只不过在VC做的还是比较新，据老板说在DI & Sourcing方面至少领先同行业两年。
接下来分享一些用到知识点的链接，大家以后准备NLP相关岗位面试的时候可以参考。仅限于我实习项目而言的知识点（中英都有）：
首先非常推荐斯坦福NLP的IR-book，基本一应俱全：
https://nlp.stanford.edu/IR-book/html/htmledition/

SQL tools:

https://wiki.postgresql.org/wiki/Psycopg2_Tutorial
Pandas (非常重要，一定要熟练):
https://pandas.pydata.org/pandas … _started/10min.html
https://pandas.pydata.org/pandas … ataFrame.apply.html
https://pandas.pydata.org/pandas … g.html#merging-join
https://scikit-learn.org/stable/ … with_text_data.html

有监督NLP
ML pipeline (Industry 必用，跟course project最大不同):
https://scikit-learn.org/stable/ … eline.Pipeline.html
https://juejin.im/entry/5ad6b20a6fb9a028e46f293a

NLP:

Stop words:

PorterStemmer:

https://tartarus.org/martin/PorterStemmer/
General ways to solve NLP problem:
https://blog.insightdatascience. … 8e4e?imm_mid=0faff0
https://github.com/hundredblocks … /NLP_notebook.ipynb
Convolutional Neural Networks for Sentence Classification:

LSTM:
https://blog.csdn.net/Jerr__y/article/details/58598296
https://www.jianshu.com/p/9dc9f41f0b29

Feature extraction:
https://scikit-learn.org/stable/ … ext.TfidfVectorizer
https://scikit-learn.org/stable/ … ext.CountVectorizer
Bert:

https://huggingface.co/pytorch-transformers/model_doc/auto.html#
https://mccormickml.com/2019/07/22/BERT-fine-tuning/
Freeze BERT:

https://discuss.pytorch.org/t/ho … the-training/7088/5

Gensim:

Word2vec:
https://rare-technologies.com/parallelizing-word2vec-in-python/
https://machinelearningmastery.c … ings-python-gensim/

https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

Git:
https://www.atlassian.com/git/tu … ith-bitbucket-cloud

调参：
https://scikit-learn.org/stable/ … omizedSearchCV.html

Data matching:
https://recordlinkage.readthedocs.io/en/latest/
Words distance:
https://blog.csdn.net/chaoswork/article/details/5489877
https://blog.csdn.net/asty9000/article/details/81384650
https://blog.csdn.net/chaoswork/article/details/5489877
Smith-Waterman algorithm:
https://baike.baidu.com/item/%E5 … 22800982?fr=aladdin
Damerau/Levenshtein Distance
https://www.jianshu.com/p/6cc29bc31eb9
https://blog.csdn.net/vcbin/article/details/52121062
https://blog.csdn.net/asty9000/article/details/81384650
Jaro-Winkler Distance
https://blog.csdn.net/vcbin/article/details/52121062

XGboost:

https://blog.csdn.net/zc02051126/article/details/46711047
https://huggingface.co/pytorch-t … trained_models.html
gtree glinear:

SVD PCA 潜在语义分析:
https://medium.com/@jonathan_hui … is-pca-1d45e885e491
https://scikit-learn.org/stable/ … n.TruncatedSVD.html
https://medium.com/@chrisfotache … d-more-b83451a327e0
https://nlp.stanford.edu/IR-book … tic-indexing-1.html
https://blog.csdn.net/qq_27009517/article/details/79361439

弱监督学习器、半监督：
Snorkel:
https://snorkel.readthedocs.io/e … rityLabelVoter.html
https://hazyresearch.github.io/s … h_tf_blog_post.html
https://www.snorkel.org/blog/babble
https://www.snorkel.org/use-cases/01-spam-tutorial

Semi supervised
https://scikit-learn.org/stable/modules/label_propagation.html

Active learning :
https://scikit-learn.org/stable/ … ctive_learning.html

无监督NLP：
K-means:
https://medium.com/@MSalnikov/te … tf-idf-f099bcf95183
https://scikit-learn.org/stable/ … ent_clustering.html
https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
AHC:
https://nlp.stanford.edu/IR-book … e-clustering-1.html
https://www.geeksforgeeks.org/ml … ivisive-clustering/
https://towardsdatascience.com/m … python-1e18e0075019（这个是真的舒服）

最后，知识点肯定无法涵盖NLP的各个角落，特别是一些deep Learning的应用，比如NMT，QA等等。但是一般面试一定会把最基础的东西问得很透彻（老板也只记得这些），所以越是基础的NLP越要牢牢掌握。像BERT，XLnet这些state-of-art，只要结构能说清楚，充其量用过其pre-trained model已经ok了。

data123 · 2020 年7 月 1 日 12:55

我去，姐姐简直不要太优秀

Shan_Jiang1 · 2020 年7 月 9 日 00:43

我去。。。很强！

Alex_Ren · 2021 年11 月 19 日 19:36

楼主说的是哪家VC？瞻仰一下。