系统设计问题分享-Design a P2P file sharing application like BitTorrent

Haven’t seen this question discussed before. It’s been asked at Twitch, anyone else encounter it elsewhere? Pretty unique from most systems design questions.

There’s some insight on Geeksforgeeks. Pasted below.

Introduction
In Computer Networking, P2P is a file sharing technology, allowing the users to access mainly the multimedia files like videos, music, e-books, games etc. The individual users in this network are referred to as peers. The peers request for the files from other peers by establishing TCP or UDP connections.

How P2P works(Overview)
A peer-to-peer network allows computer hardware and software to communicate without the need for a server. Unlike client-server architecture, there is no central server for processing requests in a P2P architecture. The peers directly interact with one another without the requirement of a central server.

Now, when one peer makes a request, it is possible that multiple peers have the copy of that requested object. Now the problem is how to get the IP addresses of all those peers. This is decided by the underlying architecture supported by the P2P systems. By means of one of these methods, the client peer can get to know about all the peers which have the requested object/file and the file transfer takes place directly between these two peers.

Three such Architectures exist:

  1. Centralized Directory
  2. Query Flooding
  3. Exploiting Heterogeneity

1. Centralized Directory

  • It is somewhat similar to client server architecture in the sense that it maintains a huge central server to provide directory service.
  • All the peers inform this central server of their IP address and the files they are making available for sharing.
  • The server queries the peers at regular intervals to make sure if the peers are still connected or not.
  • So basically this server maintains a huge database regarding which file is present at which IP addresses.

Working

  • Now whenever a requesting peer comes in, it sends its query to the server.
  • Since the server has all the information of its peers, so it returns the IP addresses of all the peers having the requested file to the peer.
  • Now the file transfer takes place between these two peers. The first system which made use of this method was Napster, for the purpose of Mp3 distribution.

image

The major problem with such an architecture is that there is a single point of failure. If the server crashes, the whole P2P network crashes. Also, since all of the processing is to be done by a single server so a huge amount of database has to be maintained and regularly updated.

2. Query Flooding

  • Unlike the centralized approach, this method makes use of distributed systems.
  • In this, the peers are supposed to be connected into an overlay network. It means if a connection/path exists from one peer to other, it is a part of this overlay network.
  • In this overlay network, peers are called as nodes and the connection between peers is called an edge between the nodes, thus resulting in a graph-like structure.

Working

  • Now when one peer requests for some file, this request is sent to all its neighboring nodes i.e. to all nodes which are connected to this node. If those nodes don’t have the required file, they pass on the query to their neighbors and so on. This is called as query flooding.
  • When the peer with requested file is found (referred to as query hit), the query flooding stops and it sends back the file name and file size to the client, thus following the reverse path.
  • If there are multiple query hits, the client selects from one of these peers.

Gnutella was the first decentralized peer to peer network.

image

This method also has some disadvantages like, the query has to be sent to all the neighboring peers unless a match is found. This increases traffic in the network.

3. Exploiting heterogeneity

  • This P2P architecture makes use of both the above discussed systems.
  • It resembles a distributed system like Gnutella because there is no central server for query processing.
  • But unlike Gnutella, it does not treat all its peers equally. The peers with higher bandwidth and network connectivity are at a higher priority and are called as group leaders/super nodes. The rest of the peers are assigned to these super nodes.
  • These super nodes are interconnected and the peers under these super nodes inform their respective leaders about their connectivity, IP address and the files available for sharing.

KaZaA technology is such an example which makes use of Napster and Gnutella both.
Thus, the individual group leaders along with their child peers form a Napster-like structure. These group leaders then interconnect among themselves to resemble a Gnutella-like structure.

Working

  • This structure can process the queries in two ways.
  • The first one is that the super nodes could contact other super nodes and merge their databases with its own database. Thus, this super node now has information of a large number of peers.
  • Another approach is that when a query comes in, it is forwarded to the neighboring super nodes until a match is found, just like in Gnutella. Thus query flooding exists but with limited scope as each super node has many child peers. Hence, such a system exploits the heterogeneity of the peers by designating some of them as group leaders/super nodes and others as their child peers.

image