GlueX Lambda SLD analysis

Thursday, November 7, 2019

Relegator update

Kripa has produced some really nice plots of significance vs decision function threshold for the regressor. NICE.

We also have plots of analysis significance vs signal fraction for the three models on the moons dataset. These show that the binary softmax gives much worse results (sigma), but the relegator gives results similar to the regressor (with more variation).

Up next Kripa will investigate the optimal network architecture for the classification of the moons data.

Friday, November 1, 2019

Relegator working. Now need to test the following:

There is only one thing that I still don't understand about the relegator moons implementation: the significance that is calculated during training is lower (by a factor of 5-ish) than the ultimate significance when applied to the weighted dataset. This could be because of using the probabilities to weight events in the training signif calculation. Will look into it.

OK, to check:

run various models on various dataset (i.e., different signal fractions) to see if relegator does acutally give better performance than the binary softmax. Multiple trials on each dataset to check stability. run_master.py does this. Running on wintermute rn. Need to write a script to visualize results.
apply models that have been trained on various randomly generated weighted datasets. When sig_frac is small, there will be a large amount of statistical variation in n_S between weighted datasets. So, apply a trained model to many weighted datasets to investigate the variation in analysis power. Need to write a new script to do this.

Monday, October 21, 2019

Relegator: to do this week

1. Separate dataset generation and classifier training into different scripts. This way, the classification script can be run with different configurations on the same train and test datasets. The data generation should dump the train and test datasets to a pickle.

2. Add code that evaluates the significance during training on a weighted dataset. This will take a lot of reworking of the training script (I think) and the model classes.

Tuesday, October 15, 2019

Relegator UPDATE

Relegator (as well as regressor and binary classifiers) is now working on the massive moons dataset with Tensorflow2.0. Currently running with a modified cross-entropy plug inverse significance loss function. The classifier does not seem to want to assign any events to the relegation class. SO, a couple ideas:

Doe the significance part of the loss function need to be computed using tf functions (so that gradients are supplied)?
Need to try different ways to incorporate the significance into the loss function.
It might be necessary to train on TWO datasets: an even-population dataset for the accuracy (relegator cross-entropy) and a weighted-population dataset for the significance.
It might be necessary to make the moons dataset have relatively more background at the intersection. Does that make sense? I mean that the distribution of background events may need to be nonuniform around the arc, with a larger density of background events (and lower density of signal events) in the region of overlap.

Wednesday, September 25, 2019

Items for Kripa

OK, first off: I previously thought that my relegator code was written such that it would only work with tf2.0. This is apparently not the case, and Kripa is able to run it on her machine with tf1.13.

Over the next two weeks, I'm going to rework the code to run with tf2.0, primarily because I would like to use eager execution. This will make it easier to build custom loss functions that can have batch-dependent features.

For the time being, Kripa will work with tf1.13 and the old code. I'd like her to do the following:

run (lots of) fits with the regressor. Fits should be run with 10k train events, noise=0.2, angle=0.9. Signal fractions should take three values per decade (log spaced) from 0.001 up to 0.5. (That last value need not respect log spacing.) Run 25 fits for each value combination.
For each fit, output important result parameters to a master results file (see the 'write_results' loop in master_moons.py). You can make the file a pickle if you want to.
Add to the output file arrays of the decision function values and the resulting significance values so that you can make the plots in the next bullet.
Make plots of significance vs decision value threshold for each signal fraction. It would be nice to do this as a "band" that includes the results of all fits for a given signal fraction.

For the time being, let's use $s / \sqrt(s+b)$ as our significance estimator/figure of merit.

Monday, September 23, 2019

RELEGATOR update

Most of the code for investigating the relegator nn classifier on the moons+mass dataset is up and running. The plan is to check increase in analysis significance, s/sqrt(s+b), for various architecture/loss function combos and various signal fractions.

Arch/loss combos:

regressor with binary CE loss
regressor with binary CE + significance loss
binary nn with categorical CE loss
binary nn with categorical CE + significance loss
relegator with modified categorical CE (no penalty for relegation class)
relegator mod CCE + signif loss

It would be ideal to test each on the same generated dataset, but this might take a lot more coding. Perhaps some way to gen the dataset and THEN pass it into master_moons.py???

For the time being, we'll take the statistical approach --> run each MANY times, histogram the increases in signif.

So, next step is to add CCE+signif loss function capabilities to regressor and nn_binary, and to make the 5th model in the list above. This means that we'll have a total of SIX models to test.

Tuesday, July 30, 2019

Relegator ideas...

For now, work on simple two-class moons dataset (2 features) with an added third feature: peaking signal and exponentially distributed background.

Compare:
1. NN with single output node, logistic regression, with optimal cut based on ROC
2. NN with two output nodes, cat. cross-entropy, optimal cut based on S/sqrt(S+B)
3. relegator NN with three output nodes, tuned with S/sqrt(S+B) in loss function

Wednesday, July 24, 2019

Relegation classifier thoughts

Thinking about investigating a NN classifier for separating signal from a pernicious background. Typical approach is to train classifier, and then use decision cut value that give the best analysis power, often interpreted from ROC AUC. I suspect, though, that the NN could train differently if we "build in" that we want to optimize analysis power.

So, my idea is to investigate a binary classification problem with a NN that predicts probabilities for THREE classes: signal, background, and RELEGATION. The idea is that events which are too difficult to correctly characterize will be placed in the relegation class. The penalty for doing so is that the loss function will contain a term(s) that wants to keep S/sqrt(S+B) as high as possible.

For multiclass classification, we would use the categorical cross-entropy loss function:

$$\[\mathcal{L} = \displaystyle\sum_{c=1}^{M} y_{o,c} \ln (p_{o,c})\]$$

We will add to this a term:

A potential problem is that the total S and B can only be accurately calculated once per epoch. BUT the tuning of the network depends on the change in S and B (derivative for backprop.). These derivatives can be calculated on a per-event basis.

Wednesday, July 17, 2019

ODD: mum_ndf_dedx is ZERO

Getting a weird crash when trying to calculate the chisq/ndf for mum dedx info. All of the mum_ndf_dedx have zero value in my csv files. The values are all zero as generated by DSelector. (I am not setting this branch manually in DSelector.) Might need to ask about why this is.

For the time being, I'm going to NOT calculate the chi2/ndf in the sld_pipeline, though it will presumably be important for mu-/pi-/e- separation...

Monday, July 15, 2019

Next steps for Mikey

Adding the electronic SLD to the multiclass classifier. MC is nearly done. Have to turn into ascii files and add capability to tf code.

BDT code. Look into multi-class BDT.

Add the following functions to dnn_tools.py:
1. fcn that reads in data files and returns pandas dfs
2. fcn that sets up the dfs after being read in
3. fcn that generates all of the labels once the data frames are read in (this is all vague rn)

Sunday, July 14, 2019

Next steps for Megan

Megan showed some plots of kinematic quantities for the "raw-raw" and "raw" muonic sld MC on Friday. These look good.

Btw, what we mean by "raw-raw" is the generated p4 and x4 quantities with x4 for the primary vertex set to (0,0,0,0). We'll call this "generated" from now on.

The "raw" MC is the same p4 vectors, but with the primary x4 set to some position in the target with some physical event time. The Lambda vertex (and any other decay vertices) are fixed by this, too. This step is taken care of by hdgeant4, and all of these quantities are taken from the hddm files that hdgeant spits out.

See /w/halld-scifs17exp/home/mmccrack/mc_processing/gen_raw_vert_files for the code to generate these files.

So far, Megan has looked at 5 files worth (10k events) of this MC.

NOW it's time for Megan to start looking at some "accepted" MC files, meaning events after the detector simulation. So! I have to generate some files that will work for her, and align with the information that she already has. The sld_mu raw files that she's using were generated on May 22, and I haven't generated anything new for this reaction since.

I THINK that I can get away with modifying the protonTRUTH DSelector, and then using my TTree to ascii scripts to get Megan the info that she needs. She should get measured and KF p4, post-KF x4, AND it would be good to have the track ID for each particle (so that we can subtract out any K+ or mu decays that would screw up vertexing). Said DSelector is here:
/w/halld-scifs17exp/home/mmccrack/dsel_protonTRUTH

Actually, I have to do some digging to figure out the track ID stuff, so I'll do that later.

My sld to ascii script is here: /Users/mmccracken/office_comp/lambda_sld/jun2019/sld_ttree_2_megan.py

ACTUALLY, I'm just going to give Megan the same files that I'm working with, but cut all events from files numbers above 1004 (i.e. remove all events with number greater than or equal to 100500000.

Megan is going to look into differences between raw and accepted MC of the following quantities:
beam photons: energy
K+: px, py, pz, magnitude of p, energy
proton: same as K+
mu-: same as K+
primary vertex: x, y, z, t, and distance between raw and accepted vertex
Lambda (secondary) vertex: same as primary vertex.

For now, Megan will use the kinfit quantities in the accepted files.

Files are on Google Drive:
kL_acc_fastpi_1000-1004.ascii
kL_acc_ppim_1000-1004.ascii
kL_acc_sl_mu_1000-1004.ascii