Gaussian Mixture Model Augmentation

Example GMM Divide + Cost-sensitive Learning Calculations

The basic functionality for the Gaussian Mixture Model (GMM) sample division and cost-sensitive learning is contained in the GMMbasic class. The training of the respective GMMs is done internally within the class, and is effectively a wrapper around the sci-kit learn/GaussianMixture classes with all its functionality.

from gpz_pype.gmm import GMMbasic

# Define a function for calculating photo-z statistics
def calcStats(photoz, specz):
    cut = np.logical_and(photoz >= 0, specz >= 0.)
    dz = photoz - specz
    abs_dz = np.abs(dz)/(1+specz)

    p90 = (abs_dz < np.percentile(abs_dz, 90.))
    sigma_90 = np.sqrt(np.sum((dz[p90]/(1+specz[p90]))**2) / float(len(dz)))

    bias = np.nanmedian(dz[cut]/(1+specz[cut]))
    ol1 = (abs_dz > 0.15)
    nmad = 1.48 * np.median( np.abs(dz[cut] - np.median(dz[cut])) / (1+specz[cut]))
    ol2 = (abs_dz > (3*nmad))
    OLF1 = np.sum( ol1[cut] ) / float(len(dz[cut]))
    OLF2 = np.sum( ol2[cut] ) / float(len(dz[cut]))

    ol1_s, ol2_s = np.invert(ol1), np.invert(ol2)

    return nmad, sigma_90, OLF1, bias

The key inputs that we need to define for GMMbasic are as follows:

X_pop : The feature array for the reference population.
X_train : The feature array for the training sample.
Y_train : The labels for the training sample (i.e. the spectroscopic redshifts).
ncomp : Number of mixtures used for the GMM model.

For building the mixture models, we need to manually decide on and create a set of features (X_pop/X_train) with which we want to represent our sample. The 'best' features will be dependent on the input data, the scientific goals etc. These also do not necessarily need to be features used in the GPz training (e.g. luptitudes), or even things derived from luptitudes.

In the following example, given the limited number of features we will just use a combination of colours and magnitudes. But sizes or morphological information could also be sensible choices.

Since for our example the training set is a subset of the reference population, we can define one set of features and then split:

gmm_features = np.array([lupt_cols['lupt_F606W']-lupt_cols['lupt_F814W'],
                         lupt_cols['lupt_F814W']-lupt_cols['lupt_F125W'],
                         lupt_cols['lupt_F160W']]).T

X_pop = gmm_features[good] # The full reference population 
                           # (i.e. representative of the full sample we would like photo-z predictions for)
X_train = gmm_features[good * (np.isnan(lupt_cols['z_spec']) == False)] # The training subset

Instantiating the class with the inputs defined above will build the GMMs:

gmm = GMMbasic(X_pop=X_pop, 
               X_train=X_train, 
               Y_train=cat['z_spec'], 
               ncomp=4) # For larger samples and more features, more mixtures could be appropriate

GMM Weight

Since the cost-sensitive learning (CSL) is the most straight-forward to include in gpz++, we will first generate some CSL weights based on the GMMs we have just produced. To do so is as straight-forward as:

weights = gmm.calc_weights(X_train, X_pop, max_weight=100)

In principle, we could go straight to running GPz with these inputs. But its first useful to verify that the weights are producing sensible results. And if necessary, to quantitatively compare how well different features perform when trying to match the target distribution.

For the feature set we defined earlier, we can simply plot the distributions of the training sample features, $x_{i}$, before and after weighting is applied:

Fig, Ax = plt.subplots(1,3, figsize=(9,3))

Ax = Ax.flatten()

for i, ax in enumerate(Ax):
    c, bins, _ = ax.hist(X_pop[:,i], density=True, bins=25, 
                      range=np.percentile(X_pop[:,i], [0.1, 99.9]), histtype='step', lw=2,
                        label='All')
    ax.hist(X_train[:,i], density=True, bins=bins, histtype='step', lw=2, color='firebrick',
            label='Training')
    ax.hist(X_train[:,i], density=True, bins=bins, histtype='step', lw=2, color='firebrick', 
            ls='--', weights=weights,
            label='Weighted Training')

    ax.set_ylabel('PDF(x)')
    ax.set_xlabel(f'$x_{i+1}$')

Ax[0].legend(loc='upper right', prop={'size':7})
Fig.tight_layout()

png

Although not perfect, we can see that the fainter training sources have been significantly up-weighted when compared to the original distribution. The colours are also distributed more like those of the full sample.

To include the weights in GPz we can simply add them to our input catalogue and tell GPz to include them in the training:

cat['weights'] = weights # Add the weights to our previous catalogue

weights_run, paths = test.run_training( # All of these as above but with minor changes
      cat, 
      outdir='test_dir', 
      basename='candels_cosmos_weighted', # Change the prefix to keep separate
      bash_script=False,
      mag_prefix='lupt_',
      error_prefix='lupterr_', 
      id_col='ID',
      total_basis_functions=100,
      do_iteration=False, 
      verbose=False,
      weight_col='weights', # Now weight training by this column
    )

Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(weights_run['z_spec'], weights_run['value'], 
            yerr=weights_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['Simple', 'Weighted']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_weight = calcStats(weights_run['z_spec'], weights_run['value'])

stats = [stats_simple, stats_weight]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')

png

At face value, the weights have not resulted in an overall statistical improvement to the photo-$z$ estimates. But by eye, there is definitely some improvement at the higher redshift end. For now, we move onto the GMM-Divide option.

GMM Divide

Dividing the training sample into different mixtures is as equally simple. For this example with its relatively small sample sizes, the number of training sources that could end up assigned to a mixture could be very small. We therefore will lower the threshold above which a source will be assigned to a mixture to 0.2, approximately saying that the source has at least a 20% chance of belonging to that mixture. (Note the default 0.5 effectively assigns each source to its best match)

train_mixtures = gmm.divide(X_train[:,:], 
                            weight=False, # Do not include CSL weights yet
                            threshold=0.2) # Change the divide threshold

cat['weights'] = 1. # Re-set the weights to be equal

# Save for later
train_mixtures.write('cosmos_mixtures_ex.csv', format='ascii.csv', overwrite=True)

gmm_output = Table.read('cosmos_mixtures_ex.csv', format='ascii.csv')

divide_run, paths = test.run_training(cat, outdir='test_dir', 
                         basename='candels_cosmos_divide',
                         gmm_output=gmm_output,
                         bash_script=False,
                         weight_col='weights',
                         mag_prefix='lupt_', 
                         error_prefix='lupterr_', id_col='ID',
                         total_basis_functions=100, do_iteration=False, verbose=False)

GPz++ Run (mixture 1/4):   0%|                                                                                                 | 0/4 [00:00<?, ?it/s]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m0_model.dat


GPz++ Run (mixture 2/4):  25%|██████████████████████▎                                                                  | 1/4 [00:01<00:03,  1.08s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m1_model.dat


GPz++ Run (mixture 3/4):  50%|████████████████████████████████████████████▌                                            | 2/4 [00:02<00:02,  1.20s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m2_model.dat


GPz++ Run (mixture 4/4):  75%|██████████████████████████████████████████████████████████████████▊                      | 3/4 [00:03<00:01,  1.16s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m3_model.dat


GPz++ Run (mixture 4/4): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.17s/it]

As above, lets compare the results between the simple and GMM divided runs:

Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(divide_run['z_spec'], divide_run['value'], 
            yerr=divide_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['No divide', 'Divide']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_divide = calcStats(divide_run['z_spec'], divide_run['value'])

stats = [stats_simple, stats_divide]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')

png

This time we can see that overall outlier fraction and scatter are now improved. Visually, we can also see that the high-redshift end is also significantly improved.

Lets now do a run using both the GMM Divide and Weights. This can be done using the divide option, but setting weight = True in the options. To prevent confusion, we will also now remove the weight column we added above, since the weights will now be propagated through the catalogue included as gmm_output:

train_mixtures2 = gmm.divide(X_train[:,:], 
                            weight=True,
                            threshold=0.2) # Do not include CSL weights yet

if 'weights' in cat.colnames:
    cat.remove_columns(['weights'])  # Weights will be taken from the GMM outputs file now instead

Run the gpz training as normal:

combined_run, paths = test.run_training(cat, outdir='test_dir', 
                         basename='candels_cosmos_both',
                         gmm_output=train_mixtures2,
                         bash_script=False,
                         weight_col='weights',
                         mag_prefix='lupt_', 
                         error_prefix='lupterr_', id_col='ID',
                         total_basis_functions=100, do_iteration=False, verbose=False)

GPz++ Run (mixture 4/4): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.15s/it]

Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(combined_run['z_spec'], combined_run['value'], 
            yerr=combined_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['No divide', 'Divide + Weight']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_combined = calcStats(combined_run['z_spec'], combined_run['value'])

stats = [stats_simple, stats_combined]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')

png

Compared to the Divide-only run, our combined Divide+Weights is performing slightly worse for the two metrics we have chosen, but is still an improvement over the Weight-only option. So in this case, using only the GMM-Divide would be the optimal approach. But as training sample sizes increase and additional features are added to the GPz training, this might not remain the case.

Additionally, since the spectroscopic sample is obviously biased, the Weight/Divide+Weight runs might actually provide better statistics for the overall galaxy population (at the expense of the brigher sources). Assessing the statistics as a function of e.g. spectroscopic redshift or magnitude can start to make these assessments clearer.

Code reference

Bases: object

Gaussian Mixture Model (GMM) for Photo-z augmentation with sklearn.

Attributes:

Name	Type	Description
`ncomp`	`int`	Number of components for the GMM model.
`niter`	`int`	Number of iterations for the GMM model.
`tol`	`float`	Tolerance for the GMM model.
`random_state`	`int`	Random state for the GMM model.
`scale`	`bool`	If True, the data is scaled before training the GMM model.
`threshold`	`float`	Threshold for the GMM model.
`scaler`	`sklearn.preprocessing.RobustScaler`	Scaler for the GMM model.
`gmm_pop`	`sklearn.mixture.GaussianMixture`	GMM model for the population.
`gmm_train`	`sklearn.mixture.GaussianMixture`	GMM model for the training set.
`mixture_samples`	`dict`	Dictionary with the mixture samples for the population and the training set.
`preprocess`	`function`	Function to preprocess the data before training the GMM model.
`fit`	`function`	Function to train the GMM model.
`population`	`function`	Function to train the GMM model for the population.
`train`	`function`	Function to train the GMM model for the training set.
`divide`	`function`	Function to divide the training set into mixture samples.

`init(X_pop=None, X_train=None, Y_train=None, ncomp=10, threshold=0.5, niter=100, tol=0.001, random_state=0, scale=True)`

Parameters:

Name	Type	Description	Default
`X_pop`	`(array - like, shape(n_samples, n_features))`	The data array for the reference population.	`None`
`X_train`	`(array - like, shape(n_samples, n_features))`	The data array for the training sample.	`None`
`Y_train`	`(array - like, shape(n_samples))`	The labels for the training sample.	`None`
`ncomp`	`int`	Number of components for the GMM model.	`10`
`threshold`	`float`	Threshold for the GMM model.	`0.5`
`niter`	`int`	Number of iterations for the GMM model.	`100`
`tol`	`float`	Tolerance for the GMM model.	`1e-3`
`random_state`	`int`	Random state for the GMM model.	`0`
`scale`	`bool`	If True, the data is scaled before training the GMM model.	`True`

`calc_weights(X_train, X_pop=None, eta=0.001, max_weight=100)`

Calculate the weights for the training set.

Parameters:

Name	Type	Description	Default
`X_train`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	required
`X_pop`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	`None`
`eta`	`float`	Softening parameter for the weights.	`0.001`
`max_weight`	`float`	Maximum weight to apply to the training set.	`100`

Returns:

Name	Type	Description
`weights`	`(array - like, shape(n_samples))`	The weights for the training set.

`divide(X, weight=False, threshold=None, eta=0.001, max_weight=100, return_density=False)`

Divide the input array into mixture samples.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The input data to map onto the GMM.	required
`threshold`	`float`	Threshold for the GMM model. If None, the default threshold is used.	`None`
`weight`	`bool`	If True, the cost-sensitive learning weights are calculated.	`False`
`eta`	`float`	Softening parameter for the weights.	`0.001`
`max_weight`	`float`	Maximum weight for the weights.	`100`
`return_density`	`bool`	If True, the density of the sample for each mixture component is returned. For use with predictions with multiple components.	`False`

Returns:

Name	Type	Description
`mixture_samples`	`dict`	Dictionary with the mixture samples for the population and the training set.

`fit(X)`

Train the GMM model.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	required

Returns:

Name	Type	Description
`gmm`	`sklearn.mixture.GaussianMixture`	GMM model.

`load(filename)`

Load the GMM model from file.

Parameters:

Name	Type	Description	Default
`filename`	`str`	Name of the file to load the GMM model.	required

`population(X)`

Train the GMM model for the population.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	required

`rescale(X)`

Preprocess the data before training the GMM model.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	required

Returns:

Name	Type	Description
`X`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.

`save(filename)`

Save the GMM model to file.

Parameters:

Name	Type	Description	Default
`filename`	`str`	Name of the file to save the GMM model.	required

`train(X)`

Train the GMM model for the training set.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The data to train the GMM model.	required

Gaussian Mixture Model Augmentation

Example GMM Divide + Cost-sensitive Learning Calculations

GMM Weight

GMM Divide

Code reference

__init__(X_pop=None, X_train=None, Y_train=None, ncomp=10, threshold=0.5, niter=100, tol=0.001, random_state=0, scale=True)

calc_weights(X_train, X_pop=None, eta=0.001, max_weight=100)

divide(X, weight=False, threshold=None, eta=0.001, max_weight=100, return_density=False)

fit(X)

load(filename)

population(X)

rescale(X)

save(filename)

train(X)

`init(X_pop=None, X_train=None, Y_train=None, ncomp=10, threshold=0.5, niter=100, tol=0.001, random_state=0, scale=True)`

`calc_weights(X_train, X_pop=None, eta=0.001, max_weight=100)`

`divide(X, weight=False, threshold=None, eta=0.001, max_weight=100, return_density=False)`

`fit(X)`

`load(filename)`

`population(X)`

`rescale(X)`

`save(filename)`

`train(X)`