Trouble to Support SOCKS Proxy for aria2

aria2 cares application layer much and transport layer not enough, causing no easy code point to add transport layer SOCKS proxy.

Created & updated on 2021-12-19.

aria2socks-proxyproxycppinheritance

TOC

Motivation

Recently, to give a proof of my C++ ability, I have tried to add SOCKS1 proxy for aria2.

It may surprise you that until 2021 aria2 still does not have SOCKS proxy support. The corresponding feature request issue was opened at 2013 and now still have not be closed, even has active discussions which are sent in the past 1 year.

In the feature request issue people have actually provided many enough external solutions. One of the most convinient solutions IMO is to use GOST, which is a state-of-the-art L4/L7 protocol tunnelling tool in Golang with 8K stars on GitHub, to convert your SOCKS proxy server into a HTTP proxy server for aria2.

However, why can not we add builtin support of SOCKS proxy for aria2? It is obviously easier to use and can enhance aria2 as a network library meanwhile. As a aria2 user and fan, I decided to spend some time on it.

Finally, I succeeded to add SOCKS proxy support for BT UDP including DHT and UDP tracker connections via this PR. However, when I am going to generalize it to the fundmental HTTP (and currently ignoring other aria2-supported application protocols like FTP and SFTP), I found that it is much more difficult.

The post is just to summarize the tough points in it, avoiding duplicated time consumption on code analysis of following people who also want to work on it, while also maybe helping other library builders to make their libraries more extensive and reusable, at least in the view of an ordinary developer.

Code Analysis of BT UDP

I literally need to set proxy for aria2 BT traffic (you know, I mean, well, to avoid DMCA stuff), plus it is only a part of aria2 and connectionless UDP should be easier, my attempt is started at aria2 BT UDP SOCKS proxy.

The first and maybe most difficulty we faced is code of aria2, all laying flat in src folder, with so few comments. ls | wc -l = 1355 sources and headers make it really hard to jump across them without searching.

Another feature of aria2 code is, as a C++ library and inheritting C++ tradition, it ships no 3rd party dependency as network helper, using socket provided by OS straightforward to send and receive. It wrapped raw OS-specified socket into class SocketCore, which works like Linux socket more, taking SOCK_STREAM or SOCK_DGRAM to work on TCP or UDP respectively.

To figure out code required modification to achieve our goal, which also means to find where it works on UDP in BT code, I searched SOCK_DGRAM and looped through all found points. Finally selected code is class DHTConnectionImpl.

class DHTConnectionImpl implements interface class DHTConnection, and is instantiated in class DHTSetup and passed into class DHTMessageFactoryImpl and class DHTInteractionCommand.

class DHTMessageFactoryImpl is used by many classes, including UDPTrackerClient surprisingly. The comment around it says:

For now, UDPTrackerClient was enabled along with DHT

which means if we got class DHTConnectionImpl done, we got both DHT connections (which are control packets among peers) and UDP trackers done. Similar things can also be found in class DHTInteractionCommand which says:

TODO This name of this command is misleading, because now it also handles UDP trackers as well as DHT.

Except these two UDP connections, other UDP connections of BT left are only LPD (local peer discovery). LPD should be disabled if you want to proxy BT traffic, because at the moment local peers are local for proxy server other than you, which makes them no (or less) advantages compared to other peers as proxy consumption should be the main part. While aria2 BT TCP accepts HTTP proxy (we will talk about it later) according to this issue answered by the main author of aria2 tatsuhiro-t (contribution 1,017,553 ++, 783,132 --), if we got class DHTConnectionImpl done, we can say BT in aria2 can fully go through proxy.

Modification for BT UDP

SOCKS proxy unlike HTTP proxy which is in application layer, is in transport layer, works differently for TCP and UDP, and requires to send and modify transport layer packets. To add SOCKS proxy support for aria2 BT UDP, we have to get the socket used by it to access raw transport layer packets.

Fortunately class DHTConnectionImpl just has a field to store the socket used by itself. So what we are going to do is add a new class DHTConnectionSocksProxyImpl inheritting from class DHTConnectionImpl and overwriting its sendMessage and receiveMessage method, and adding a new method startProxy to initialize SOCKS proxy stuff.

Here we do not add the SOCKS proxy logic into DHTConnectionImpl straightforward because the logic is not short, even requiring a new method startProxy to run all of it. The SOCKS proxy logic should be added on demond.

SOCKS proxy like HTTP proxy requires address, username, password options. We add these options referring to HTTP proxy options by searching HTTP_PROXY and looping thought all found code. This, including option validation, is much easier than previous work thanks to utilities of HTTP proxy options.

The final implementation is available on my PR to aria2.

BTW, in SOCKS proxy RFC RFC1928, page 7, it says:

The DST.ADDR and DST.PORT fields contain the address and port that the client expects to use to send UDP datagrams on for the association.

Do not miss the word ON, which means DST.ADDR and DST.PORT for SOCKS UDP proxy is like local bind address and port.

BTW, again, one interesting thing of C++11, which is the C++ version requirement of aria2, is that, while C++11 has std::make_shared, std::make_unique surprisingly is added in C++14, so if you want to use make_unique in aria2 code, you should include a2functional.h which implements it.

Code Analysis of HTTP

After supporting SOCKS proxy for aria2 BT UDP, I then continue to work on SOCKS proxy support for general HTTP in aria2. It is a tough trip and I have not fully figure out how to implement it.

In src folder we can see many classes with suffix Command, which after reading we can know that aria2 uses event loop to process all actions, which is normal and wise for a modern network library and a language without async/await helper (which is C++11).

The entry point is class HttpInitiateConnectionCommand. It has a comment to illustrate the workflow of HTTP, helped me to understand the event loop wokring way, which is that a Command class as an event, runs its stuff, then calls method getNextCommand and returns true to get the next event, or calls method addCommandSelf and returns false to rerun self and wait. Class specified in method getNextCommand can lead us to the next class we need to find and read.

In class HttpInitiateConnectionCommand we can also see pooled socket even with proxy goes to class HttpRequestCommand like direct connection. Maybe it means socket with proxy will not be reused, but I am not sure. Anyway, let us first work on simple situation.

As for HTTP, class HttpInitiateConnectionCommand instantiate class HttpProxyRequestConnectChain, which goes to class HttpProxyRequestCommand to start the connection chain. The head of the chain about HTTP proxy is HttpProxyRequestCommand -> HttpProxyResponseCommand -> HttpRequestCommand, and the following is normal HTTP stuff. HttpProxyRequestCommand -> HttpProxyResponseCommand is actually to send CONNECT to the proxy server to start the HTTP tunnel, and wait to receive the response. HttpProxyRequestCommand with HttpProxyResponseCommand does not have actual logic. The actual HTTP proxy logic is at their parent class AbstractProxyRequestCommand, in method executeInternal. In it we can see other than overwriting the destination of sending and receiving, it creates a request as a proxy request and sends it with methods setProxyRequest and sendProxyRequest of class HttpConnection.

We can add a new class HttpConnectionSocksProxy like class DHTConnectionSocksProxyImpl to inherit from class HttpConnection. Problem is unlike class DHTConnectionImpl, there is not clear methods to override.

Possible Modification for HTTP

My alpha plan is to add new connection chain class SocksProxyRequestConnectChain. In class HttpInitiateConnectionCommand if SOCKS proxy is specified, then go to this new connection chain. The head of the chain is SocksProxyRequestCommand -> HttpRequestCommand, then the following is normal HTTP stuff. In new class SocksProxyRequestCommand other than simplely inheriting from AbstractProxyRequestCommand, we also override method executeInternal to add our SOCKS proxy initialization logic like DHTConnectionSocksProxyImpl::startProxy.

The remaining is clearing class HttpConnection proxy flag to make it work like no proxy, but we overwrite the packet destination to send packets to the proxy server.

SOCKS TCP proxy is easier than UDP proxy according to RFC, especially for HTTP only. Unlike UDP proxy that needs to add a new header for the packet, TCP proxy just need to send the traffic via the connection to the proxy server other than to the destination. Though for FTP it requires extra steps, let us make HTTP work first.

An attempt unfinished modification can be found on my aria2 fork socks-proxy branch. Hope my analysis can help someone to complete the work.

Summary

From above we can see that aria2 cares application layer much more than tranport layer. It does not provide easy hooks to override tranport layer actions. So when we are going to add extra steps in tranport layer, we will find out that much application layer stuff also starts to bother us, making us hard to step out.

I do not mean it is bad. Foucing on application layer, implementing most protocols and features, aria2 has proved it IS wise. And if you are careful enough and know much enough about the codebase, it should be possible (even not so difficult maybe) to add transport layer actions for aria2, like SOCKS proxy support.

However, well, now I am a little tired and going to stop for some time.

Footnotes

  1. All SOCKS in the post means SOCKS5, specified in RFC1928. SOCKS4 and SOCKS4a are indeed rarely used. (personally speaking I see only support in some softwares, never seeing someone using it.)

Copyright (C) 2020, 2021, 2022 myl7
The posts are licensed under CC BY-SA 4.0 by default unless otherwise explicitly stated.
The posts with different licenses would contain a section named License to indicate their respective licenses.
The website source code and raw post text/image files are available on myl7/mylmoe
The website favicon is made and authorized for the use by Freepik from flaticon.com