How to make your DHT Crawler in 3 steps

How to make your DHT Crawler in 3 steps

Why ?


You all know that torrents indexers and trackers are the target of a witch hunting by the Copyright owners because they've decided to declare the war to free and open file sharing of content.

Lastly thepiratebay founders were the target of such a fierceness and paid it the full price with enormous fines and sentences. But the community reacted to this by releasing a database dump of the website and a project to create your own indexer, openbay. Guys from isohunt were behind this release. After that the pirate bay reopened for a time then went down and up regularly.

All this, to say that torrent repositories became nowadays something western governments wants to censure. In an act of resistance, I've decided to create a DHT crawler based on nodeJS so everyone could install his own pirate bay that you'll be able to install easily on basic hardware.

In a time were big corporations and governments wants to centralize and control data streams, it is a dangerous task to defend free information transfers.

How ?


I wanted to make a project that rely on only one technology to limit dependencies, so I chose nodeJS as a base. Through npm it's so easy to install extra modules and offers all the third party necessary to our project. And I fucking like async.

So the idea is to use DHT decentralized protocol to get torrents downloaded by other users so we can index them in our database to generate a substantial list of torrents in order for us to search for a movie or a book you want to read/see.

The project is composed of 3 modules totally independants :
  1. crawlDHT : this module listen on the DHT network for torrent hashes and store them on a redis queue called DHTS. It also save DHT routing table every 10 minutes as recomended by BEP specifications. It uses the dht.js module an implementation of DHT protocol.
  2. loadDHT : this one is responsible to load hashes from redis and try to download the corresponding torrent metadata using the DHT network or directly from torcache or torrage. A specificity of the last ones is that torrent are returned as gzip files but thanks to aria2 options it is very easy to save it directly decompressed. Thanks to node-aria2, we're able to communicate with aria2 using rpc calls.
  3. loadTorrent : the last module just wait for downloaded torrent metadata to parse it and index it in our mongodb dhtcrawler database. The read-torrent module is in this case intensively used for torrent parsing.

Thanks to our queue system and mongo dynamic data format, you can easily use only one specific module or even add new features to the one you want.

The portal folder contains the website contents and is based on an express instance. The front part is done using angular library and can be accessed in the public folder.

Help ?


Yes, the project is at its very beginning, but I hope that you'll fork it and make pull requests with your improvements to share it with everyone. For example, it would be great to have the peer and seed number.

So if you want to give it a try, go check : https://github.com/FlyersWeb/dht-bay

And remember to have fun :)

Post Script


After 3 days working, my little hackberry with a limited bandwith indexed almost 1000 torrents, would be great to have an idea of what are the most shared things on DHT networks.

For those who likes to watch movies an rainy sundays, you can use peerflix to stream a torrent download to your vlc or any other player. Have a look.

Most seen