分享3个月NLP实习用到的专题知识

楼主斯坦福MS,上个暑假在一个全美资VC做NLP。没错是一个VC,不经感叹道科技的力量已经从二级市场蔓延到投资领域的各个角落。做的项目十分有趣,用snorkel做了一个weak supervision的labeling,做了一个text classifier,以及一个text clustering。具体每个project的内容就不详细说了,其实在其他领域应用早就有了,大家上网也可以找点例子,接下来我也会分享一些链接。只不过在VC做的还是比较新,据老板说在DI & Sourcing方面至少领先同行业两年。
接下来分享一些用到知识点的链接,大家以后准备NLP相关岗位面试的时候可以参考。仅限于我实习项目而言的知识点(中英都有):
首先非常推荐斯坦福NLP的IR-book,基本一应俱全:
https://nlp.stanford.edu/IR-book/html/htmledition/

SQL tools:


https://wiki.postgresql.org/wiki/Psycopg2_Tutorial
Pandas (非常重要,一定要熟练):
https://pandas.pydata.org/pandas … _started/10min.html
https://pandas.pydata.org/pandas … ataFrame.apply.html
https://pandas.pydata.org/pandas … g.html#merging-join
https://scikit-learn.org/stable/ … with_text_data.html

  1. 有监督NLP
    ML pipeline (Industry 必用,跟course project最大不同):
    https://scikit-learn.org/stable/ … eline.Pipeline.html
    https://juejin.im/entry/5ad6b20a6fb9a028e46f293a

NLP:


Stop words:

PorterStemmer:

https://tartarus.org/martin/PorterStemmer/
General ways to solve NLP problem:
https://blog.insightdatascience. … 8e4e?imm_mid=0faff0
https://github.com/hundredblocks … /NLP_notebook.ipynb
Convolutional Neural Networks for Sentence Classification:

LSTM:
https://blog.csdn.net/Jerr__y/article/details/58598296
https://www.jianshu.com/p/9dc9f41f0b29

Feature extraction:
https://scikit-learn.org/stable/ … ext.TfidfVectorizer
https://scikit-learn.org/stable/ … ext.CountVectorizer
Bert:


https://huggingface.co/pytorch-transformers/model_doc/auto.html#
https://mccormickml.com/2019/07/22/BERT-fine-tuning/
Freeze BERT:



https://discuss.pytorch.org/t/ho … the-training/7088/5

Gensim:

Word2vec:
https://rare-technologies.com/parallelizing-word2vec-in-python/
https://machinelearningmastery.c … ings-python-gensim/


https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

Git:
https://www.atlassian.com/git/tu … ith-bitbucket-cloud

调参:
https://scikit-learn.org/stable/ … omizedSearchCV.html

Data matching:
https://recordlinkage.readthedocs.io/en/latest/
Words distance:
https://blog.csdn.net/chaoswork/article/details/5489877
https://blog.csdn.net/asty9000/article/details/81384650
https://blog.csdn.net/chaoswork/article/details/5489877
Smith-Waterman algorithm:
https://baike.baidu.com/item/%E5 … 22800982?fr=aladdin
Damerau/Levenshtein Distance
https://www.jianshu.com/p/6cc29bc31eb9
https://blog.csdn.net/vcbin/article/details/52121062
https://blog.csdn.net/asty9000/article/details/81384650
Jaro-Winkler Distance
https://blog.csdn.net/vcbin/article/details/52121062


XGboost:


https://blog.csdn.net/zc02051126/article/details/46711047
https://huggingface.co/pytorch-t … trained_models.html
gtree glinear:

SVD PCA 潜在语义分析:
https://medium.com/@jonathan_hui … is-pca-1d45e885e491
https://scikit-learn.org/stable/ … n.TruncatedSVD.html
https://medium.com/@chrisfotache … d-more-b83451a327e0
https://nlp.stanford.edu/IR-book … tic-indexing-1.html
https://blog.csdn.net/qq_27009517/article/details/79361439

  1. 弱监督学习器、半监督:
    Snorkel:
    https://snorkel.readthedocs.io/e … rityLabelVoter.html
    https://hazyresearch.github.io/s … h_tf_blog_post.html
    https://www.snorkel.org/blog/babble
    https://www.snorkel.org/use-cases/01-spam-tutorial

Semi supervised
https://scikit-learn.org/stable/modules/label_propagation.html

Active learning :
https://scikit-learn.org/stable/ … ctive_learning.html

  1. 无监督NLP:
    K-means:
    https://medium.com/@MSalnikov/te … tf-idf-f099bcf95183
    https://scikit-learn.org/stable/ … ent_clustering.html
    https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
    AHC:
    https://nlp.stanford.edu/IR-book … e-clustering-1.html
    https://www.geeksforgeeks.org/ml … ivisive-clustering/
    https://towardsdatascience.com/m … python-1e18e0075019(这个是真的舒服)

最后,知识点肯定无法涵盖NLP的各个角落,特别是一些deep Learning的应用,比如NMT,QA等等。但是一般面试一定会把最基础的东西问得很透彻(老板也只记得这些),所以越是基础的NLP越要牢牢掌握。像BERT,XLnet这些state-of-art,只要结构能说清楚,充其量用过其pre-trained model已经ok了。

5 Likes

我去,姐姐简直不要太优秀

我去。。。很强!

楼主说的是哪家VC?瞻仰一下。