How to make your DHT Crawler in 3 steps
You all know that torrents indexers and trackers are the target of a witch hunting by the Copyright owners because they've decided to declare the war to free and open file sharing of content.
Lastly thepiratebay founders were the target of such a fierceness and paid it the full price with enormous fines and sentences. But the community reacted to this by releasing a database dump of the website and a project to create your own indexer, openbay. Guys from isohunt were behind this release. After that the pirate bay reopened for a time then went down and up regularly.
All this, to say that torrent repositories became nowadays something governments wants to censure. So, I've decided to create a DHT crawler using nodeJS so everyone could install his own pirate bay that (should) be easy to install on basic hardware.
I wanted to make a project that rely on only one technology to limit dependencies, so I chose nodeJS as a base. Through npm it's so easy to install extra modules and offers all the third party necessary to our project.
So the idea is to use DHT decentralized protocol to listen to torrents downloaded by other users in order to index them in our database. Then we want to search for a movie or a book in our database.
The project is composed of 3 modules :
- crawlDHT : this module listen on the DHT network for torrent hashes and store them on a redis queue called DHTS. It also save DHT routing table every 10 minutes as recomended by BEP specifications. It uses the dht.js module an implementation of DHT protocol.
- loadDHT : this one is responsible to load hashes from redis and try to download the corresponding torrent metadata using the DHT network or directly from torcache or torrage. A specificity of the last ones is that torrent are returned as gzip files but thanks to aria2 options it is very easy to save it directly decompressed. Thanks to node-aria2, we're able to communicate with aria2 using rpc calls.
- loadTorrent : the last module just wait for downloaded torrent metadata to parse it and index it in our mongodb dhtcrawler database. The read-torrent module is in this case intensively used for torrent parsing.
Thanks to our queue system and mongo dynamic data format, you can easily use only one specific module or even add new features to the one you want.
The portal folder contains the website contents and is based on an express instance. The front part is done using angular library and can be accessed in the public folder.
Yes, the project is at its very beginning, but I hope that you'll fork it and make pull requests with your improvements to share it with everyone. For example, it would be great to have the peer and seed number.
So if you want to give it a try, go check : https://github.com/FlyersWeb/dht-bay
And remember to have fun :)
After 3 days working, my little hackberry with a limited bandwith indexed almost 1000 torrents, would be great to have an idea of what are the most shared things on DHT networks.
For those who likes to watch movies an rainy sundays, you can use peerflix to stream a torrent download to your vlc or any other player. Have a look.