Friday, July 31, 2009

Price Path Probability (Again)

So I completed a model and calibration which determines the probability of a price going through a level within a given time period. If one can arrive at a high confidence level, this is incredibly useful for multi-leg execution and as a prop strategy in its own right.

The model uses a SDE with mean reversion, a trend component, and an evolving distribution to determine the price across time. The SDE is evaluated as a monte carlo simulation on a grid. We determine the conditional probability of going from one price level to the next for a given (small) time interval. The sum of the product of the probabilities along the price path represents the posterior probability of being at the given price node at a given time.

With the grid in hand, one can query the grid to determine the probability of a price being above or below a level within a given time, etc. For some markets, we are seeing a 75% confidence level, meaning we are right 3/4 of the time. There were some markets where there was no distinct edge in the approach, I have ideas on how to adjust this, but have not had the time to revisit.

The evolution of the distribution was the most complex to model. Unlike idealized option models, where the distribution is stationary and generally gaussian, the observed intra-day distribution over short periods is neither gaussian nor stationary. We noted that the first 3 or 4 moments have dynamics which can be modelled and fitted on top of an empirical distribution.

The trending and mean reversion functions were fitted using a maximum likelihood estimate, which was easily obtained from the distribution for each time step under given assumptions. The parameters were evolved with a GA to maximize the likelihood.

Switch Hitting

I've used many languages across the years, both functional and imperative. Since I live in the real-world (ie non-academic), I've generally had to use languages with a large user base, strong momentum, etc. In general my must-have criteria has been:

  • performance
  • access to wide range of libraries, both infrastructural and numerical
  • scalable to large, complex, applications
  • runs on one of the major VMs (JVM or CLR)

I would have liked to have had the following as well:

  • elegance
  • functional programming constructs
Changing Climate
Things have changed over the last few years though. We now have strong functional languages with high performance, access to diverse libraries, and a wider user base. There is now even talk in the wider (non-functional programming) community about functional languages.

On the imperative side have used (by frequency Java, C#, Python, C/C++, Fortran, etc). I have probably invested the most in Java at this point, as have found the complexity of C++ over the years distasteful and time consuming.

I write parallel-targetted numerical models and write trading strategies around them. The problem is that none of the above imperative languages map well to the way I think about problems and require a lot of unnecessary scaffolding.

Functional Languages on the VM
There have been some stealth functional language projects (ie not all that well known outside of the functional community), such as Clojure for the JVM. Clojure, however, does not have the performance level I am looking for and is not much elevated above lisp. I love scheme / Lisp, but feel that it doesn't scales well for large projects.

Relatively new on the scene is Scala. Scala aims to be a statically typed language with OO and functional features, bridging between the two paradigms. It has many nice features, but is more verbose then other functional languages. It is certainly much more concise than Java, which is a big plus, but does have some bizarre syntax and inconsistencies I'm not thrilled with. There are also some performance issues related to for-comprehensions that mean I cannot use it right now. Nevertheless, for the JVM, I think it is the only language I could consider for functional programming.

Enter F#. Ok, Microsoft scares me. That said, as with C# and the CLR, they have been leading the pack in innovation. They picked up on the Java mantle, fixed its flaws, and have since evolved into Java into something much better: C#.

F# is a new language (well a few years old) for the .NET CLR based on OCAML, not too distant from Haskell or ML. The language is concise, integrates well with the CLR and libraries, and performs very well.

Frankly, I think F# is the superior to any of the other languages available on the JVM or CLR. My concern now is: is it worthwhile switching from the JVM to the CLR. I have a large set of libraries written in Java. I am also concerned about the degree of portability and that the Web 2.0 APIs (which Google is largely defining) tend to be Java / JVM based.

Practical Details
Am I going to have to live in a hybrid world where I write, say, some GWT apps in java and do my core work in the .NET CLR? Is there any hope of anything like F# on the horizon for the JVM? Should I abandon eclipse and my other tools and look at Mono tools or worse be required to use Visual Studio?

Give me a F# equivalent on the JVM. Please!

Thursday, July 30, 2009

Trader Bots

I came across this site today. I'm not a huge believer in technical analysis as a basis for trading, however these guys are doing something interesting. They are generating / seeding strategies as a genetic program based on a combination of technical, momentum, and sentiment inputs into a neural net. These are then bred / cross-pollinated to refine further.

The next part is an extrapolation from the very little they have indicated. I suspect they are doing the following:

  1. Generate initial strategies using a random genetic program that selects inputs from a subset of available technical, sentiment, and momentum indicators.
  2. Calibrate to best possible trading signal (given inputs) using a ANN (neural net)
  3. Evaluate utility function across some years of historical data
  4. Based on results, refine by breeding the strategies with a GA
  5. Rinse and Repeat
It is an automated approach to strategy descovery, avoiding costly manual research. Though it does not appear to make use of more sophisticated inputs & models, the general approach is nice. It would not be a surprise to find that some of these strategies are successful.

The approach can be expanded to incorporate more sophisticated models as inputs (such as basis function based signal decomposition, stochastic state systems, etc).

Monday, July 27, 2009

Future For Commercial HPC

As I noted in an earlier post I have High Performance Computing requirements. Basically if you can give me thousands of processors, I can use them. The problem with HPC today is that it is one or more of the following (depending on where you are):
  • academic and only open on a limited basis to researchers based on their proposals
  • internal
  • available but not cost effective (8 core @ $7000 / compute year at amazon)
This flies in the face of what we know:
  • there are many thousands of under or un-utilized computers available
  • the true cost of computing power + ancillary costs (power, people) can be scaled to a much lower #
  • organizations should want to monetize this underutilized capacity

Why do I care about this? Well, I could use cheap computing power today, but also I used to be a parallel algorithm researcher back in the day, so have been waiting for this for a long time.

The solution needs to be to allow compute resource providers a means to auction their unused resources for blocks of time, immediate or future. HPC users that want to evaluate a massively parallel problem can collect a forward dated/timed group of nodes for execution, finding a group within their cost range or wait for lower cost nodes to become available.

How would this be accomplished?
  1. Exchanges are set up for geographical areas where providers can offer gflop-hr futures and consumers can buy computing futures or alternatively sell their unused futures.
  2. Contract requires standardized power metrics (SPECfprate2006 for instance)
  3. Contract requires standardized non-CPU resource (min memory, disk)
  4. Standard means of code and data delivery (binary form, encryption, etc)
  5. Safe VM in which to run code
  6. Checkpointing to allow for a computation to be moved (optional)

Research into auction-based scheduling and resource allocation began in the early 90s, perhaps earlier. The first paper I saw in this regard was in 1991. There are now hundreds of papers on this and a few academic experiments. There should be a big market for this amongst web hosting companies, etc.

Amazon and Google, although likely to be very efficient with resource utilization, are likely to have peak periods and slack periods like everyone else. The strategy would be to price resources lower during slack periods to attract "greedy" computations looking for cheap power.

I have specific ideas about how this would be implemented. Contact me if you are interested.

Feed Forward NN in "real life"

Turns out that the Nematode Caenorhabditis Elegans has a nervous system that is similar to a feed forward network. A feed forward network is one where neurons have no backward feedback from neurons "downsignal" (i.e. the neurons and synapses can be arranged as a directed acyclic graph). This is very analogous to the feed forward network first envisaged for Artificial Neural Networks.

The worm has exactly 302 neurons and ~5000 synapses, with little variation in connection between one worm and another. This implies on average less than 20 synapse connections per neuron. This is in contrast to the mammalian brain, where most neurons have a feedback loop back from other neurons downstream of the signal.

I am very enthusiastic about this area of research as it progresses us step-by-step closer to realizing mapping an organism brain onto a machine substrate. The nematode is quite tractable because of the fixed and very finite number of neurons.

ANNs are no longer in vogue, but I use feed forward ANNs for some regression problems. Of course my activation function is likely to be quite different from the biological equivalent. ANNs are not a very active area of research given their limitations, but one does find them convenient for massive multivariate regression problems where one does not understand the dynamics.

The regressions that I solve only have sparse {X,Y} pairs if at all and can only be evaluated as a utility function across the whole data set. This precludes the various standard incremental "learning" approaches. Instead I use a genetic algorithm to find the synapse matrix that maximizes the utility function.

SVM is more likely to be used in this era than ANNs for regression. Its drawback is that it requires one to do much trial and error to determine an appropriate basis function, transforming a nonlinear data set into a reasonably hyperlinear dataset in another space.

High Performance Computing on the Cheap

I have a couple of trading strategies in research that require extremely compute intensive calibrations that can run for many days or weeks on a multi-cpu box. Fortunately the problem lends itself to massive parallelism.

I am starting my own trading operation, so it is especially important to determine how to maximize my gflops / $. Some preliminaries:
  • my calibration is not amenable to SIMD (therefore GPUs are not going to help much)
  • I need to have a minimum of 8 GB memory available
  • my problem performance is best characterized by the SPECfprate benchmark
I started by investigating grid solutions. Imagine if I could use a couple of thousand boxes on one of the grids for a few hours. How much would that cost?

Commercial Grids
So I investigated Amazon EC2 and the Google app engine. Of the two only Amazon looked to have higher performance servers available. Going through the cost math for both Amazon and Google revealed that neither of these platforms is costed in a reasonable way for HPC.

Amazon charges 0.80 cents per compute hour, $580 / month or $7000 / compute year on one of their "extra-large high cpu" boxes. This configuration of box is a 2007 spec Opteron or Xeon. This would imply a dual Xeon X5300 family 8 core with a SPECfprate of 66, at best. $7000 per compute year is much too dear, certainly there are cheaper options.

Hosting Services
It turns out that there are some inexpensive hosting services that can provide SPECfprate ~70 machines for around $150 / month. That works out to $1800 / year. Not bad, but can we do better?

Just How Expensive Is One of these "High Spec" boxes?
The high-end MacPro 8 core X5570 based box is the least expensive high-end Xeon based server . It does not, however, offer the most !/$ if your computation can be distributed. The X5500 family performs at 140-180 SPECfprates, at a cost of > $2000 just for the 2 CPUs.

There is a new kid on the block, the Core i7 family. The Core i7 920, priced at $230 generates ~80 SPECfprates and can be overclocked to around 100. A barebones compute box can be built for around $550. I could build 2 of these and surpass the performance of a dual cpu X5500 system, saving $2000 (given that the least expensive such X5500 system is ~$3000).

Cost Comparison Summary
Here is a comparison of cost / 100 SPECfprate compute year, for the various alternatives. We will assume 150 watt power comsumption per cpu at 0.10 / Kwh, in addition to system costs.

  1. Amazon EC2
    $10,600 / year. 100/66 perf x 0.80 / hr x 365 x 24

  2. Hosting Service
    $2,700 / year. 100/70 perf x $150 x 12

  3. MacPro 2009 8 core dual X5570
    $1070 / year. 100 / 180 perf x $3299 / 2 + $160 power

  4. Core i7 920 Custom Build
    $430 / year. 100 / 80 perf x $550 / 2 + $88 power

  5. Core i7 920 Custom Build Overclocked
    $385 / year. 100 / 100 perf x $550 / 2 + $100 power

The Core i7 920 build is the clear winner. One can build 5-6 of these for the cost of every X5570 based system. Will build a cluster of these.