* Matthias Lee
  * Dec 4th 2009
  * Massively Parallel Astronomic Event Detection
  * Event detection using Lightcurves of trans-neptunian-objects, matching them to pre generated templates
  * CS264

To run this program,
extract Event_detection.tar to a dir

then either execute:
$>./release/tmatch --index
$>./release/tmatch --run

the above two executables should run on resonance without compiling, but just in case im wrong,
the enclosed Makefile should allow you to compile by just typing, if not there is nothing special about it Makefile1 works on my machine(contact me with problems):
$>make

Both functions will give status information as they are running.
The final result will be saved in ./OUT/LCchunk.xls and ./OUT/Template.xls (these are just txt files)
in ./OUT/FinalResultwithGraph.xls you will find the answer with a graph.

if you have any question please contact me at: matthias.a.lee@gmail.com


Code:

There are 2 main functions:

extern "C" 
void PreProc_template(char* fname);
1. This function takes a directory as an argument, which with help of function: "addFolder()", which walks the sub directory structure of the given directory and then returns the number of files found and the files themselves in an array of strings.

2. The array of strings is then iterated over, read and normalized, and finally concatenated to one long array, which get saved in ./BIN/tmpltNCF.bin. These are the templates that pieces of a Light Curve, Time Series Data, will be compared to.

3. The last step is to save a dict of all the filenames saved in the above mentioned *.bin, this is dumped into ./BIN/tmplt.dict.


extern "C" 
void search_lc(char* ncf, char* metadict, char* lcfname)
1. This function gets passed 3 file names, ncf(path to the indexed templates), metadict(path to the dict file) and lcfname(path to the Lightcurve that will be compared to the templates).

2. The function then reads in the lightcurve and the templates bin and loads them into texture mem.

3. The next step is to launch the kernel. (see the __global__ gpucorrmultiplex() below)

4. After the kernel completed it copies back the final result, 1 float3 per launched block and finds the best match (highest correlation)

5. At the end if saves the results, the segment in which the event was detected and the best matching template, in 2 excel files in ./OUT/


__global__ void gpucorrmultiplex(int templ_len,int LC_len,float3* d_res, int LCchunk_cnt)
1. This function takes in the template length, the lighcurve length, a pointer to return the results in and finally the number of LighCurve chunks that will be iterated over

2. This function gets launched in n blocks of m threads, where n is the (lenght of the LightCurve)/300(the number of space allocated in shared mem for the calculation) and where m is the number of templates it will be comparing the chunk of LightCuver to.
The light curve comparison happens in a sliding window fashion, where the length of the templates, in this case 20, dictates the length of the segement of LC.
so the first correlation will take place between the templates and elements 0-19 then 1-20, 2-21 and so on.
at the end each segments correlation, each thread writes one final value to shared memory, where it then undergoes a MAX which finds the template with the highest correlation. 
At the end of looping over the 300 point chunk of LC, the next 300 point segment gets loaded into memory, this goes on until the entire LC has been consumed.
each block at the end writes one float3 to global memory which contains the correlation, the template index and the LC index of the best match.


WRITE UP
=============
1. The goal of this project is to find occultations/events of trans-neptunian-objects passing infront of a distant star, thus blocking or dialating the light we see from earth of background stars. This signature gets recorded as a Lightcurve by a proposed satillite called whipple. The satillite will be monitoring 10000 stars and at 40hz. Since the satillite does not exist yet I am working off of simulated data which is much less in size, but gives a proof of concept for the final satillite.
This program analyzes a LightCurve of 8420 points and compares it to 1950 templates. In the proposal i stated that I will be comparing it the LightCurve to 10000 or more templates and will use a tree structure to speed up the comparison by pruning out irrelavant templates. After some analysis of the situation and talking to the scientist associated with this project, we have concluded that ~2000 templates will be plenty to get an estimation of the object which caused the occultation. If interest presists in persuing a further analysis, the user may do further matching by using more specific templates.
2. The data, as stated above, are brightness readings taken at 40hz around the clock of stars by a proposed satillite(data is interpolated and slightly preprocessed at the satillite level, yielding ~1 data point/sec). Therefore the data is simulated and acts as a proof of concept for the real deal.
3. See the above "Code" section of the readme
4. See the top of this document 
5. The performance of this code is ~20ms per LC of ~8400 points, this would be proportional to ~200ms per full size lightcurve of 86400 points (if there is 1 point per second) This I much faster than the competing CPU code, but very hard to compare the results since the CPU implementation attacs the problem form a different angle, it fits algorithmically.
The optimizations taken were from a linear naive implementation, see naive.cu, was to minimize the access to textures and global memory, while keeping as much as possible in shared memory and registers. I also implemented sequential(coalessed) reading of texture and global memory. Another optimization was to try to avoid warp serializations by avoiding possible raise conditions and bank conflicts. the naive implementation compared ~390 templates to a LC of 8420 points in ~40ms, the optimized version compares 1950 templates to a LC of 8420 points in ~22.5ms.
6. There are many things to be learned from a project like this. Not waiting for your data, but starting even with improvised data, also there are many interesting twists to optimizing shared, global and texture memory, from bank conflicts to unexplained warp serializations.
7. There are many improvements that can be made to a program like this. First off im sure there are more optimizations that can be done by on this, but besides that improvements in the works will be more user friendly controlls, more templates and looping over multiple LC's, also maybe multi event detection in a single LC, but for that there need to be threshholds that would qualify as an event.
8. The most enjoyable part about this was knowing that I am working on something noone else has done with GPU's before, atleast not anywhere to be found. Also dealing with event detection and such massive data sets as this will be able to handle in the future is amazing. There were many challenges to avoid bogging down with memory access, and structuring the kernel so that it will be most efficient. Frustrating parts were definatly dealing with memory accesses and unexplained warp serialization. Next time I will probably start earlier, if i can actually get the data on time, and I would really like to try some different ways of structureing the kernels.