# Zipline in the Cloud

Thomas Wiecki

Source: http://www.mountainexposures.co.uk

• PhD student at Brown University.
• Computational Cognitive Neuroscience.
• Quantitative Researcher at Quantopian.

• Algorithm makes trading decision automatically, based on external events (price, volume, twitter trends...)

## Zipline http://zipline.io/

• Open-source backtester written in Python.
• Transaction costs
• Slippage
• Streaming of financial data:
• Batteries included: moving average, alpha, beta, Sharpe ratio...
• Used in production on Quantopian.

## Story-Time

Posted article on trading strategy on Quantopian forums:

In [9]:
from IPython.display import HTML
HTML("<iframe src=https://www.quantopian.com/posts/olmar-implementation-fixed-bug width=1100 height=350></iframe>")

Out[9]:

Here, I ported the algorithm to Zipline:

In [2]:
import zipline

STOCKS = ['AMD', 'CERN', 'COST', 'DELL', 'GPS', 'INTC', 'MMM']

def initialize(self, eps=1, window_length=5):
self.stocks = STOCKS
self.m = len(self.stocks)
self.price = {}
self.b_t = np.ones(self.m) / self.m
self.last_desired_port = np.ones(self.m) / self.m
self.eps = eps
self.init = True
self.days = 0
self.window_length = window_length
window_length=window_length)

self.set_slippage(zipline.finance.slippage.FixedSlippage())
self.set_commission(zipline.finance.commission.PerShare(cost=0))

def handle_data(self, data):
self.days += 1
if self.days < self.window_length:
return

if self.init:
self.rebalance_portfolio(data, self.b_t)
self.init = False
return

m = self.m

x_tilde = np.zeros(m)
b = np.zeros(m)

# find relative moving average price for each security
for i, stock in enumerate(self.stocks):
price = data[stock].price
# Relative mean deviation
x_tilde[i] = data[stock]['mavg']['price'] / price

###########################
# Inside of OLMAR (algo 2)
x_bar = x_tilde.mean()

# market relative deviation
mark_rel_dev = x_tilde - x_bar

# Expected return with current portfolio
exp_return = np.dot(self.b_t, x_tilde)
weight = self.eps - exp_return

variability = (np.linalg.norm(mark_rel_dev))**2

if variability == 0.0:
step_size = 0 # no portolio update
else:
step_size = max(0, weight/variability)

b = self.b_t + step_size*mark_rel_dev
b_norm = simplex_projection(b)

#print b_norm
self.rebalance_portfolio(data, b_norm)

# Predicted return with new portfolio
pred_return = np.dot(b_norm, x_tilde)

# update portfolio
self.b_t = b_norm

def rebalance_portfolio(self, data, desired_port):
#rebalance portfolio
desired_amount = np.zeros_like(desired_port)
current_amount = np.zeros_like(desired_port)
prices = np.zeros_like(desired_port)

if self.init:
positions_value = self.portfolio.starting_cash
else:
positions_value = self.portfolio.positions_value + self.portfolio.cash

for i, stock in enumerate(self.stocks):
current_amount[i] = self.portfolio.positions[stock].amount
prices[i] = data[stock].price

desired_amount = np.round(desired_port * positions_value / prices)
diff_amount = desired_amount - current_amount

for i, stock in enumerate(self.stocks):
self.order(stock, diff_amount[i]) #order_stock

def simplex_projection(v, b=1):
"""Projection vectors to the simplex domain

Implemented according to the paper: Efficient projections onto the
l1-ball for learning in high dimensions, John Duchi, et al. ICML 2008.
Implementation Time: 2011 June 17 by Bin@libin AT pmail.ntu.edu.sg
Optimization Problem: min_{w}\| w - v \|_{2}^{2}
s.t. sum_{i=1}^{m}=z, w_{i}\geq 0

Input: A vector v \in R^{m}, and a scalar z > 0 (default=1)
Output: Projection vector w

:Example:
>>> proj = simplex_projection([.4 ,.3, -.4, .5])
>>> print proj
array([ 0.33333333, 0.23333333, 0. , 0.43333333])
>>> print proj.sum()
1.0

Original matlab implementation: John Duchi (jduchi@cs.berkeley.edu)
Python-port: Copyright 2012 by Thomas Wiecki (thomas.wiecki@gmail.com).
"""

v = np.asarray(v)
p = len(v)

# Sort v into u in descending order
v = (v > 0) * v
u = np.sort(v)[::-1]
sv = np.cumsum(u)

rho = np.where(u > (sv - b) / np.arange(1, p+1))[0][-1]
theta = np.max([0, (sv[rho] - b) / (rho+1)])
w = (v - theta)
w[w<0] = 0
return w


## Simplified version

In [ ]:
import zipline

def initialize(self, eps=1, window_length=5):
# How aggresive should we rebalance the portfolio
self.eps = eps
# How many days to compute the moving average over
self.window_length = window_length
# Include mavg transform into simulation. Makes it available
# in handle_data() below.
window_length=window_length)

def handle_data(self, data):
# Calculate how much each price diverged from its moving average
for stock in data:
deviation_from_mavg[stock] = data[stock].price - data[stock].mavg

# Calculate new portfolio allocation
desired_portfolio = current_portfolio + self.eps * deviation_from_mavg

# Rebalance portfolio
self.rebalance_portfolio(desired_portfolio)


### Import data from yahoo finance.

In [3]:
start = datetime.datetime(2001, 8, 1)
end = datetime.datetime(2013, 2, 1)

In [4]:
data = zipline.utils.load_from_yahoo(stocks=STOCKS, indexes={}, start=start, end=end)

AMD
CERN
COST
DELL
GPS
INTC
MMM

In [6]:
data.head()

Out[6]:
AMD CERN COST DELL GPS INTC MMM
Date
2001-08-01 00:00:00+00:00 18.90 13.51 35.70 26.75 23.42 24.23 41.88
2001-08-02 00:00:00+00:00 19.76 13.47 36.67 27.98 23.55 25.30 41.99
2001-08-03 00:00:00+00:00 19.25 13.44 36.35 27.63 23.25 24.97 42.16
2001-08-06 00:00:00+00:00 17.62 13.88 35.18 27.40 23.03 23.87 41.27
2001-08-07 00:00:00+00:00 17.55 13.92 35.63 27.27 23.03 24.14 41.83

## Let's see how it performs...

In [4]:
def run_olmar(eps=1, window_length=5):
olmar = OLMAR(eps=eps, window_length=window_length)
results = olmar.run(data)
return results.portfolio_value

In [12]:
results_eps2 = run_olmar(eps=2)

[2013-03-07 22:11] INFO: Transform: Running StatefulTransform [mavg]
[2013-03-07 22:12] INFO: Transform: Finished StatefulTransform [mavg]
[2013-03-07 22:12] INFO: Performance: Simulated 2893 trading days out of 2893.
[2013-03-07 22:12] INFO: Performance: first open: 2001-08-01 13:30:00+00:00
[2013-03-07 22:12] INFO: Performance: last close: 2013-02-01 21:00:00+00:00

In [14]:
results_eps2.plot()

Out[14]:
<matplotlib.axes.AxesSubplot at 0xba48b8c>

## Can we do better?

In [25]:
results_eps1 = run_olmar(eps=1)

[2013-03-07 22:44] INFO: Transform: Running StatefulTransform [mavg]
[2013-03-07 22:45] INFO: Transform: Finished StatefulTransform [mavg]
[2013-03-07 22:45] INFO: Performance: Simulated 2893 trading days out of 2893.
[2013-03-07 22:45] INFO: Performance: first open: 2001-08-01 13:30:00+00:00
[2013-03-07 22:45] INFO: Performance: last close: 2013-02-01 21:00:00+00:00

In [26]:
results_eps2.plot()
results_eps1.plot(c='g')

Out[26]:
<matplotlib.axes.AxesSubplot at 0xd6b49cc>

## Parameter Optimization

Find parameter values eps and window_length that maximize objective function (i.e. cumulative wealth).

## A solved problem?

Lots of optimization algorithms exist, for example:

• Hill climbing / Gradient ascent
• Convex optimization

## Requirements and Properties:

• Continuous, differentiable.
• Convex
• Serial

## Our objective function:

• Might be discontinuous, window_length is discrete.
• Might be multi-modal.
• Slow to compute -> need parallelism.

## Support Vector Machine

• Parameters:
• Normal-vector of hyperplane $\leftarrow$ Convex
• Hyper-parameters:
• Soft margin parameter C.
• Kernel-type (RBF, polynomial) and associated parameter.

## Computational Neuroscience

• Parameters:
• Weights $\leftarrow$ Continuous.
• Hyper-parameters:
• Network architecture (number of hidden units and layers)

• Parameter:
• Portfolio-allocation.
• Hyperparameters:
• eps and window_length.

# The problem thus far

• Large-scale models.
• Expensive to compute.
• No knowledge how objective function changes w.r.t. hyper-parameters.

## Required

• Parallelism
• Grid-search
• Bayesian Optimization

# Parallel Computing in Python

## IPython Parallel

• Multi-CPU, multi-host
• Works with various backends (MPI, SSH)

### Launch IPython cluster.

• command line: ipcluster
• IPython notebook

### Connect to master

In [9]:
from IPython.parallel import Client
client = Client()
queue = client.direct_view()
print "Available workers: ", len(queue)

Available workers:  2

In [10]:
squared = queue.map_sync(lambda x: x**2, [1, 2, 3, 4])
print squared

[1, 4, 9, 16]


%%px magic allows execution of cell on all nodes

In [3]:
%%px
import numpy as np
print np.linspace(0, 1, 5)

[stdout:0] [ 0.    0.25  0.5   0.75  1.  ]
[stdout:1] [ 0.    0.25  0.5   0.75  1.  ]


# Great -- can run models in parallel on my laptop!

In [5]:
!cat ~/.starcluster/config_pydata

[global]
INCLUDE=~/.starcluster/aws, ~/.starcluster/keys
DEFAULT_TEMPLATE=pydata

[cluster pydata]
KEYNAME = mykey
CLUSTER_SIZE = 20
NODE_INSTANCE_TYPE = c1.xlarge
NODE_IMAGE_ID = ami-8e38a4e7 # zipline specific
PLUGINS = ipcluster

[plugin ipcluster]
SETUP_CLASS = starcluster.plugins.ipcluster.IPCluster

In [4]:
tic = time.time()

!starcluster -c ~/.starcluster/config_pydata start pydata

toc = time.time()

StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Using default cluster template: pydata
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 20-node cluster...
>>> Creating security group @sc-pydata...
Reservation:r-e5889e9f
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
2/20 |\\\\\\                                                           |  10%
2/20 |||||||                                                           |  10%
2/20 |//////                                                           |  10%
2/20 |------                                                           |  10%
2/20 |\\\\\\                                                           |  10%
2/20 |||||||                                                           |  10%
2/20 |//////                                                           |  10%
2/20 |------                                                           |  10%
2/20 |\\\\\\                                                           |  10%
2/20 |||||||                                                           |  10%
2/20 |//////                                                           |  10%
5/20 |----------------                                                 |  25%
5/20 |\\\\\\\\\\\\\\\\                                                 |  25%
9/20 ||||||||||||||||||||||||||||||                                    |  45%
11/20 |///////////////////////////////////                             |  55%
11/20 |-----------------------------------                             |  55%
15/20 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                |  75%
15/20 |||||||||||||||||||||||||||||||||||||||||||||||||                |  75%
15/20 |////////////////////////////////////////////////                |  75%
15/20 |------------------------------------------------                |  75%
15/20 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                |  75%
15/20 |||||||||||||||||||||||||||||||||||||||||||||||||                |  75%
15/20 |////////////////////////////////////////////////                |  75%
15/20 |------------------------------------------------                |  75%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 1.542 mins
>>> The master node is ec2-75-101-213-6.compute-1.amazonaws.com
>>> Configuring cluster...
>>> Running plugin starcluster.clustersetup.DefaultClusterSetup
>>> Configuring hostnames...
0/20 |                                                                 |   0%
6/20 |///////////////////                                              |  30%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating cluster user: sgeadmin (uid: 1001, gid: 1001)
0/20 |                                                                 |   0%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring scratch space for user(s): sgeadmin
0/20 |                                                                 |   0%
18/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||       |  90%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring /etc/hosts on each node
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
0/20 |                                                                 |   0%
15/20 |////////////////////////////////////////////////                |  75%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Starting NFS server on master
>>> Configuring NFS exports path(s):
/home
>>> Mounting all NFS export path(s) on 19 worker node(s)
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
19/19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.078 mins
>>> Configuring passwordless ssh for root
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Configuring SGE...
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 19 worker node(s)
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
19/19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.090 mins
>>> Installing Sun Grid Engine...
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
2/19 |------                                                           |  10%
4/19 |\\\\\\\\\\\\\                                                    |  21%
6/19 |||||||||||||||||||||                                             |  31%
8/19 |///////////////////////////                                      |  42%
8/19 |---------------------------                                      |  42%
12/19 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                        |  63%
18/19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||    |  94%
19/19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating SGE parallel environment 'orte'
0/20 |                                                                 |   0%
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Running plugin ipcluster
>>> Writing IPython cluster config files
>>> Starting the IPython controller and 7 engines on master
>>> Waiting for JSON connector file...  | / 
/home/whyking/.starcluster/ipcluster/SecurityGroup:@sc-pydata-us-east-1.json 100% || Time: 00:00:00  87.61 M/s
>>> Authorizing tcp ports [1000-65535] on 0.0.0.0/0 for: IPython controller
>>> Adding 152 engines on 19 nodes
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
0/19 |                                                                 |   0%
9/19 |//////////////////////////////                                   |  47%
16/19 |-----------------------------------------------------           |  84%
16/19 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\           |  84%
17/19 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||       |  89%
18/19 |////////////////////////////////////////////////////////////    |  94%
19/19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> IPCluster has been started on SecurityGroup:@sc-pydata for user 'sgeadmin'
with 159 engines on 20 nodes.

To connect to cluster from your local machine use:

from IPython.parallel import Client
client = Client('/home/whyking/.starcluster/ipcluster/SecurityGroup:@sc-pydata-us-east-1.json', sshkey='/home/whyking/.ssh/mykey.rsa')

See the IPCluster plugin doc for usage details:
http://star.mit.edu/cluster/docs/latest/plugins/ipython.html

>>> IPCluster took 0.632 mins
>>> Configuring cluster took 2.485 mins
>>> Starting cluster took 4.077 mins

The cluster is now ready to use. To login to the master node
as root, run:

$starcluster sshmaster pydata If you're having issues with the cluster you can reboot the instances and completely reconfigure the cluster from scratch using:$ starcluster restart pydata

When you're finished using the cluster and wish to terminate
it and stop paying for service:

$starcluster terminate pydata Alternatively, if the cluster uses EBS instances, you can use the 'stop' command to shutdown all nodes and put them into a 'stopped' state preserving the EBS volumes backing the nodes:$ starcluster stop pydata

WARNING: Any data stored in ephemeral storage (usually /mnt)
will be lost!

You can activate a 'stopped' cluster by passing the -x
option to the 'start' command:

\$ starcluster start -x pydata

This will start all 'stopped' nodes and reconfigure the
cluster.

In [6]:
print "Took secs:", toc-tic

Took secs: 246.641336918

In [ ]:
from IPython.parallel import Client
client_sc = Client('/home/whyking/.starcluster/ipcluster/SecurityGroup:@sc-pydata-us-east-1.json', sshkey='/home/whyking/.ssh/mykey.rsa')

In [16]:
queue_sc = client_sc.direct_view()
print "Available workers: ", len(queue_sc)

Available workers:  157

In [2]:
queue_sc.map_sync(lambda x: x**2, [1, 2, 3, 4, 5, 6])

Out[2]:
[1, 4, 9, 16, 25, 36]

## Putting it all together.

In [12]:
eps = np.linspace(1.1, 40, len(queue_sc))
print eps

[  1.1          1.89387755   2.6877551    3.48163265   4.2755102
5.06938776   5.86326531   6.65714286   7.45102041   8.24489796
9.03877551   9.83265306  10.62653061  11.42040816  12.21428571
13.00816327  13.80204082  14.59591837  15.38979592  16.18367347
16.97755102  17.77142857  18.56530612  19.35918367  20.15306122
20.94693878  21.74081633  22.53469388  23.32857143  24.12244898
24.91632653  25.71020408  26.50408163  27.29795918  28.09183673
28.88571429  29.67959184  30.47346939  31.26734694  32.06122449
32.85510204  33.64897959  34.44285714  35.23673469  36.03061224
36.8244898   37.61836735  38.4122449   39.20612245  40.        ]

In [8]:
wealth = queue_sc.map_sync(run_olmar, eps)

In [9]:
plt.plot(eps, wealth)

Out[9]:
[<matplotlib.lines.Line2D at 0xcf829ec>]

# Optimization techniques

Blog post

## Bayesian Optimization via Gaussian Processes.

See Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms, 1–12. Machine Learning; Learning. Retrieved from http://arxiv.org/abs/1206.2944

## Also supports parallel optimization

Open-source implementation

## HyperOpt package

In [ ]:
# define a search space
from hyperopt import hp
space = hp.choice('a',
[
('case 1', 1 + hp.lognormal('c1', 0, 1)),
('case 2', hp.uniform('c2', -10, 10))
])


# Conclusions

• Optimizing large-scale models is a different game.
• Technical solutions:
• IPython Parallel
• StarCluster
• Methodological solutions:
• Can borrow from Machine Learning.

# Thanks!

Thomas Wiecki

email: twiecki@quantopian.com