TL;DR How can I connect a local driver to a spark cluster through a SOCKS-proxy.
We have an onsite spark cluster that is behind a firewall that blocks most ports. We have ssh access, so I can create a SOCKS proxy with ssh -D 7777 .... 
It works fine for browsing the web-UI's when my browser uses the proxy, but I do not know how to make a local driver use the it.
So far I have this, which obviously is not configuring any proxies:
val sconf = new SparkConf()
  .setMaster("spark://masterserver:7077")
  .setAppName("MySpark")
new SparkContext(sconf)
Which logs these messages 16 times before throwing an exception.
15/01/20 14:43:34 INFO Remoting: Starting remoting
15/01/20 14:43:34 ERROR NettyTransport: failed to bind to server-name/ip.ip.ip.ip:0, shutting down Netty transport
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/20 14:43:34 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/01/20 14:43:34 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Unlike HTTP proxies, which can only interpret and work with HTTP and HTTPS webpages, SOCKS5 proxies can work with any traffic. HTTP proxies are high-level proxies usually designed for a specific protocol. While this means you get better connection speeds, they're not nearly as flexible and secure as SOCKS proxies.
Click the Apple icon at the top left of the menu bar on your screen and select System Preferences. Select Network and then Proxies. Click the Advanced button to access the Network settings and navigate to the Proxies tab. Click the SOCKS Proxy checkbox and enter the host and port information.
A SOCKs5 proxy is a lightweight, general-purpose proxy that sits at layer 5 of the OSI model and uses a tunneling method. It supports various types of traffic generated by protocols, such as HTTP, SMTP and FTP. SOCKs5 is faster than a VPN and easy to use.
Your best shot may be to forward a local port to remote 7077, and then setMaster("spark://localhost:nnnn") where nnnn is the local port you have forwarded.
To do this use ssh -L (instead of -D).
I cannot guarantee that this will work, or if it works, that it will continue to work, but at least it will spare you using an actual proxy for this one port. Things that might break it, are mostly secondary connections that the initial connection might trigger. I didn't test this yet, but unless there are secondary connections, in principle it should work.
Also, this doesn't answer the TL;DR-version of your question, but since you have SSH-access, it's more likely to work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With