系统设计系列讲解 Design a Web Crawler

Xavier · 2020 年1 月 17 日 08:00

欢迎给视频点赞和订阅YouTube频道

donnemartin/system-design-primer/blob/master/solutions/system_design/web_crawler/README.md

# Design a web crawler

*Note: This document links directly to relevant areas found in the [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) to avoid duplication.  Refer to the linked content for general talking points, tradeoffs, and alternatives.*

## Step 1: Outline use cases and constraints

> Gather requirements and scope the problem.
> Ask questions to clarify use cases and constraints.
> Discuss assumptions.

Without an interviewer to address clarifying questions, we'll define some use cases and constraints.

### Use cases

#### We'll scope the problem to handle only the following use cases

* **Service** crawls a list of urls:
    * Generates reverse index of words to pages containing the search terms
    * Generates titles and snippets for pages
        * Title and snippets are static, they do not change based on search query

This file has been truncated. show original

Xavier · 2020 年1 月 17 日 08:33

Xavier · 2020 年1 月 18 日 00:20

Xavier · 2020 年1 月 18 日 21:42

Xavier · 2020 年1 月 19 日 01:03

Xavier · 2020 年1 月 23 日 07:12

Xavier · 2020 年1 月 23 日 07:13

Xavier · 2020 年1 月 23 日 07:13

Xavier · 2020 年1 月 23 日 07:13

Xavier · 2020 年1 月 23 日 07:14

Jun_Li1 · 2021 年1 月 17 日 05:18

第一个视频里面提及robot.txt不对，应该是robots.txt啦

Xavier · 2021 年1 月 17 日 05:35

嗯嗯，谢谢指正

Jun_Li1 · 2021 年1 月 18 日 05:25

thank U! 看你的视频收益良多！

xxxxxx · 2021 年2 月 4 日 03:17

求问楼主，url dedup和document decup怎么处理concurrent issue呀，如果两个process同时处理url ABC，process A准备写入到url set，但是此时，process B去查，会发现还不存在，那这两个process是否就会同时处理同一个url

Xavier · 2021 年2 月 4 日 03:48

这里的workflow都是sequential的，写到 url set 之前需要调用 dedup 然后才能放进去
应该不存在parallel 的情况

xxxxxx · 2021 年2 月 4 日 22:42

谢谢楼主回复，那如果是sequential的，是否这个就不是分布式的呀，也就是TPS可能会有比较低的上限

Xavier · 2021 年2 月 4 日 22:48

一定要concurrent的话，你可以partition，每个process管一部分url prefix

xxxxxx · 2021 年2 月 4 日 23:12

明白了～谢谢楼主！

tbjc · 2022 年7 月 5 日 20:24

我觉得blooming filter 那个部分非常的confusing。我可以理解作为一个快速查重的方法。但问题是你如何处理refresh web page的情况，就是你一周前已经查过一次，然后你现在又得重查了？你无法从blooming filter得知url的时间状态？

然后是data partition 那部分该怎么做，比如你有多个fetcher，每个有独立的IP，你会希望每个fetcher比较平均地fetch同一个网站，而不是单独有同一个fetcher handle，这个时候该如何进行partition？