



This blog post is based on a paper I coauthored called “An algorithm to stabilize a sequence of thermal brain images” and published in · February 2007.
You can read the full paper here:
Kovalerchuk, Boris, Joseph Lemley, and Alexander M. Gorbach. “An algorithm to stabilize a sequence of thermal brain images.” Medical Imaging. International Society for Optics and Photonics, 2007.
]]>We had a fun group project that involved using R to analyze stock prices which later turned into a presentation at SOURCE 2016 when we added some machine learning techniques to make it more interesting.
There is a great R package called Quantmod which we used to get stock data. “http://www.quantmod.com/
It is very easy to use, for example:
library(quantmod)
library(ggplot2) # Include ggplot so we can graph it.
start < as.Date("19860301")
end <as.Date("20151230")
getSymbols(c('AAPL','MSFT','^IXIC','NDX'), from = start, to = end)
Loads the Quantmod package and gets stock price information between 19860301 and 20151230 for Apple, Microsoft, NASDAQ, and the Nasdaq Composite automatically.
Want to quickly graph the closing prices of Microsoft stocks during that time? That’s just 2 lines of code:
MSFT.df = data.frame(date=time(MSFT), Cl(MSFT))
ggplot(data = MSFT.df, aes(x = date, y = MSFT.Close)) + geom_point() + geom_smooth(se = F) + labs(x = "Date", y = "Close")
As you can see, R facilitates very fast data analytics.
We went on to make some simple predictive regression models and used the R packages RSNNS and GMDH package.
Like most R packages, it’s very easy to use RNSS:
library(quantmod)#for stock data
library(RSNNS) # Stuttguart neural network simulator.
The training and prediction code segment is here:
modelElman = elman(df$date, df$MSFT.Close, size=8, learnFuncParams=c(0.1),maxit=1000)
predictions = append(pre,predict(modelElman,n+1)[1])
We ran this in a loop to get a series of predictions for various dates.
It’s similarly easy to use the GMDH model:
#####create time series
n = nrow(df)
stock < ts(df, start=1, end=n, frequency=1)
#####predict
out = fcast(stock, input = 3, layer = 4, f.number = 1, tf = "all")
pre = append(pre,out$mean[1])
We then did a simulation to see which method performs the best on a range of stock values using a simple investment strategy:
Every time the model says the stock prices will go up tomorrow, buy 10 shares.
Every time the model says the stock prices will go down tomorrow: sell everything!
Continue for a year.
Elman neural networks gave the best results on a per stock basis, followed very closely by GMDH and regression far behind. Interestingly, however, if you were to follow this strategy with all the models in 2015, you would actually gain money from both Elman and regression. Surprisingly, GMDH lost money.
This is what you’d make if you used our model and investment strategy using YAHOO, JP morgan, CMS Energy Corporation, Verizon APPLE and Microsoft.
2015:
ELMAN $1334.999
REGRESSION $383.696
GMDH $623.0998
It’s surprising that an Elman neural network did this well with only closing prices. Obviously, closing prices alone are not very reliable predictors of future stock prices but it managed anyway.
Clearly no one should actually use such a simple method with real money at stake, but it’s still interesting.
]]>I defended my Masters Thesis on May 27 and am all set to graduate this spring. The thesis defense went very well. My wife and a friend recorded it. I’d upload a recording of it here but there is a temporary embargo on my thesis because we plan a third publication based on some of the work I’ve been doing the last month or so if it works out.
On June 11, I’ll be speaking in Atlanta at COMPSAC 2016 to present my research on finding large empty areas in high dimensional spaces.
I also published a full paper at the “Modern AI and Cognitive Science Conference” at the Modern AI and Cognitive Science Conference. The title was: Comparison of Recent Machine Learning Techniques for Gender Recognition from Facial Images and can be accessed here: Modern AI and Cognitive Science 2016 paper 21
I also made a presentation aimed at a more general audience, which I presented at SOURCE 2016 and which can be accessed here: http://digitalcommons.cwu.edu/source/2016/cos/2/
It feels strange (but nice) to not have any urgently pressing deadlines after months of nonstop urgency.
]]>Some interesting output for my algorithm on a 2600 dimensional dataset with around 19000 entries:
If(D2: Is Between 0 and 0.756863) AND
(D44: Is Between 0.069281 and 1) AND
(D225: Is Between 0 and 0.538562) AND
(D301: Is Between 0.0993464 and 1) AND
(D575: Is Between 0.0627451 and 1) AND
(D669: Is Between 0.538562 and 1) AND
(D752: Is Between 0.231373 and 1) AND
(D823: Is Between 0 and 0.598693) AND
(D1033: Is Between 0.454902 and 1) AND
(D1172: Is Between 0 and 0.635294) AND
(D1262: Is Between 0 and 0.945098) AND
(D1269: Is Between 0.0300654 and 1) AND
(D1418: Is Between 0 and 0.929412) AND
(D1509: Is Between 0 and 0.947712) AND
(D1577: Is Between 0 and 0.96732) AND
(D1615: Is Between 0.290196 and 1) AND
(D1629: Is Between 0.266667 and 1) AND
(D1787: Is Between 0.266667 and 1) AND
(D1977: Is Between 0 and 0.971242) AND
(D1986: Is Between 0 and 0.971242) AND
(D2130: Is Between 0 and 0.988235) AND
(D2177: Is Between 0 and 0.831373) AND
(D2287: Is Between 0.133333 and 1) AND
(D2416: Is Between 0.0261438 and 1) AND
(D2507: Is Between 0 and 0.836601) AND
(D2566: Is Between 0.0862745 and 1) AND
All other attributes rang from 0 to 1.
Then the hyperrectangle bound by those points is empty and has a volume of 0.00351057
If you counted every elementary particle in the universe, that number would be MUCH smaller than the number of holes in this dataset.
]]>So I figured I’d try it out. Here is what I did.
1. Grab Cygwin: https://cygwin.com/
2. From the setup, Install fortran, gcc, blas, linspace, and python 2.7 and any other tools you’d like. Don’t forget to install libraries libgfortranlib, gccfortran. To reduce the chances of future problems I choose to install the source for these packages also.
3. Get pip working. Use:
chmod 775 /usr/lib
chmod 775 /usr/bin
wget https://bootstrap.pypa.io/ez_setup.py O –  python
easy_install pip
4. pip install numpy
5. pip install Tempita
Install SciPy with:
mkdir scipy
cd scipy
Use wget https://github.com/scipy/scipy/archive/master.zip to download the very latest source for scipy.
unzip master.zip
cd scipymaster
$ python setup.py install –user
5. pip install scikitlearn
6. pip install six
7. Install theano from latest build.
pip install –upgrade –nodeps git+git://github.com/Theano/Theano.git –user
8. Find location where Theano stores cutils_ext (On my system it is ~/.theano/compiledir_CYGWIN_NT6.12.3.10.29153x86_6464bit–2.7.1064/cutils_ext)
Copy cutils_ext.pyd to a new file named cutils_ext.dll (if you don’t than importing theano will fail)
cp cutils_ext.pyd cutils_ext.dll
If you do #6 with just “pip install theano” it will try to uninstall your scipy and proceed to download a buggy one. Don’t let it!
Note: If you came here from google and tried to install scipy 0.16.1 on sunos or cygwin you probably are looking for this:
There is currently a bug in the above scipy branch related to a variable named infinity. It was fixed in november for the master build but not updated to the 0.16.1 archive.
If you use 0.16.1 from that link you’ll get a compile error on cygwin:
“error: Command “g++ fnostrictaliasing ggdb O2 pipe Wimplicitfunctiondeclaration fdebugprefixmap=/usr/src/ports/python/python2.7.101.x86_64/build=/usr/src/debug/python2.7.101 fdebugprefixmap=/usr/src/ports/python/python2.7.101.x86_64/src/Python2.7.10=/usr/src/debug/python2.7.101 DNDEBUG g fwrapv O3 Wall I/usr/include/python2.7 I/usr/lib/python2.7/sitepackages/numpy/core/include Iscipy/spatial/ckdtree/src I/usr/lib/python2.7/sitepackages/numpy/core/include I/usr/include/python2.7 c scipy/spatial/ckdtree/src/ckdtree_query.cxx o build/temp.cygwin2.3.1x86_642.7/scipy/spatial/ckdtree/src/ckdtree_query.o” failed with exit status 1″
See this commit for more info if you want to investigate that error: https://github.com/scipy/scipy/commit/832baa20f0b5
If you get error:
Traceback (most recent call last):
File “
File “theano/__init__.py”, line 72, in
from theano.scan_module import scan, map, reduce, foldl, foldr, clone
File “theano/scan_module/__init__.py”, line 41, in
from theano.scan_module import scan_opt
File “theano/scan_module/scan_opt.py”, line 65, in
from theano import tensor
File “theano/tensor/__init__.py”, line 7, in
from theano.tensor.subtensor import *
File “theano/tensor/subtensor.py”, line 26, in
import theano.gof.cutils # needed to import cutils_ext
File “theano/gof/cutils.py”, line 295, in
compile_cutils()
File “theano/gof/cutils.py”, line 260, in compile_cutils
preargs=args)
File “theano/gof/cmodule.py”, line 2014, in compile_str
return dlimport(lib_filename)
File “theano/gof/cmodule.py”, line 289, in dlimport
rval = __import__(module_name, {}, {}, [module_name])
ImportError: No module named cutils_ext
Try step 8 again.
]]>This quarter I’m taking Swim conditioning, Graduate research, “CS 540 – Algorithms for Biological Data Analysis”.
I’m also taking some Coursera courses in security and machine learning as a review.
I particularly like the machine learning course so far. This is the first time I’ve used IPython Notebook which I highly recommend. It is included with Anaconda . The course also introduced me to Graphlab’s SFrame which I’m very impressed by.
In just a few lines of code, I was able to create and test a regression model from a file and display it on a very nice graph.
I’m also the teaching assistant for CS 528 – Advanced Data Structures and Algorithms. I’m teaching the labs and grading the homework. Last Thursday I went over dictionaries, lists, functions (including lambda functions which I think are awesome) and discussed python’s string functions which will be needed for next weeks homework.
I’ve been working on getting a paper finished that I plan to submit to an international this month.
]]>Here is my abstract:
The problem of finding maximal empty rectangles in a set of points in 2D and 3D space has been well studied, and efficient algorithms exist to identify maximal rectangles in 2D space. Unfortunately, such efficiency is lacking in higher dimensions where the problem has been shown to be NP complete when the dimensions are included in the input. We compare existing methods and suggest a novel technique to discover interesting maximal empty hyperrectangles in cases where dimensionality and input size would otherwise make analysis impractical. Applications include big data analysis, recommender systems, automatic knowledge discovery, and query optimization.
Keywords: Maximal Empty Rectangle, Maximal Cuboid, Big Data
One of the big problems is that (until now) there are no algorithms that scale well for finding empty hyperrectangles.
This is related to something called “The curse of dimensionality”.
My wife is finishing up a drawing(she’s an artist) to include in my presentation and I’m to tired to keep working so I figured I’d post something about the curse here.
Here are some interesting lecture notes on The curse of dimensionality: http://math.arizona.edu/~hzhang/math574m/2014Lect10_curse.pdf
The most obvious method of factoring a number (by repeated division) is a very slow process (indeed the security of modern encryption depends on this process being slow)
Pollard’s Rho is a very fast, probabilistic method that can be used to find factors much faster than the standard method.
My implementation, based on the psudocode of page 976 in “Introduction to Algorithms, third edition” is as follows:
from random import randint
from fractions import gcd
from math import *
def PollardRho(n):
i=0;
xi=randint(0,n1);
k=2
y=xi
while i< n:
i=i+1
xi=((xi^2)1)%n
d=gcd(yxi,n)
if d!=1 and d!=n:
print d;
if i==k:
y=xi
k=2*k
PollardRho(1200)
The first two lines initialize i to 1 and x1 to a random integer between 0 and n1
The code xi=((xi^2)1)%n is the recurrence: mod n which produces the value of the infinite sequence.
Most psudocode has this running forever with a while(true) so I modified it to only try n times. This is probably overkill because if n is composite we can expect this method to discover enough divisors to factor n completely after about n^(1/4) updates.
]]>The fundamental idea of trapezoidal integration idea is to use a sequence of trapezoids to perform integration numerically. Numerical integration is also called numerical quadrature.
from math import *
f1 = lambda x: x**3x**2+3*xcos(x)*x+atan(sin(x)+1)
First we import the math library, nothing special about that.
The next line may be unfamiliar for people who have not explored Python. Python allows so called “Lambda Functions”. It’s a really cool feature that is great for declaring mathematical functions (as opposed to functions that perform tasks that are not mathematical)
The above code is equivalent to saying f(x) = x^3x^2+3xcos(x)*x+arctan(sin(x)+1)
def ti1(f,a,b):
return ((ba)/2)*(f(a)+f(b))
def diff2(f,x,h=1E6):
r=(f(xh)2*f(x) + f(x+h))/float(h*h)
return r
def trapezint(f,a,b,n):
h=(ba)/n
sum=0
part1=(0.5)*h*(f(a)+f(b))
for i in range(1,n):
xi=a+i*h
sum=sum+f(xi)
return part1+h*sum
def adaptivetrap(f,a,b,ep):
max=0
step=float(abs(ab)/1000)
i=0
while (i<1000):
i=i+1
adj=a;
adj=a+step*i;
dval=diff2(f,adj)
if(abs(dval)>max):
max=abs(dval)
h=sqrt(12*ep)*((ba)*max)**.5
n=(ba)/h
return trapezint(f,a,b,int(ceil(n)))
print adaptivetrap(f1,0.0,10.0,1E5)
The function, adaptivetrap, above takes the f1 we defined above and integrates it using the adaptive trapezoid rule from 0 to 10 with an error tolerance of 1E5.
]]>