Skip to content

Gaussian Mixture Model Augmentation

Example GMM Divide + Cost-sensitive Learning Calculations

The basic functionality for the Gaussian Mixture Model (GMM) sample division and cost-sensitive learning is contained in the GMMbasic class. The training of the respective GMMs is done internally within the class, and is effectively a wrapper around the sci-kit learn/GaussianMixture classes with all its functionality.

from gpz_pype.gmm import GMMbasic

# Define a function for calculating photo-z statistics
def calcStats(photoz, specz):
    cut = np.logical_and(photoz >= 0, specz >= 0.)
    dz = photoz - specz
    abs_dz = np.abs(dz)/(1+specz)

    p90 = (abs_dz < np.percentile(abs_dz, 90.))
    sigma_90 = np.sqrt(np.sum((dz[p90]/(1+specz[p90]))**2) / float(len(dz)))

    bias = np.nanmedian(dz[cut]/(1+specz[cut]))
    ol1 = (abs_dz > 0.15)
    nmad = 1.48 * np.median( np.abs(dz[cut] - np.median(dz[cut])) / (1+specz[cut]))
    ol2 = (abs_dz > (3*nmad))
    OLF1 = np.sum( ol1[cut] ) / float(len(dz[cut]))
    OLF2 = np.sum( ol2[cut] ) / float(len(dz[cut]))

    ol1_s, ol2_s = np.invert(ol1), np.invert(ol2)

    return nmad, sigma_90, OLF1, bias

The key inputs that we need to define for GMMbasic are as follows:

  • X_pop : The feature array for the reference population.
  • X_train : The feature array for the training sample.
  • Y_train : The labels for the training sample (i.e. the spectroscopic redshifts).
  • ncomp : Number of mixtures used for the GMM model.

For building the mixture models, we need to manually decide on and create a set of features (X_pop/X_train) with which we want to represent our sample. The 'best' features will be dependent on the input data, the scientific goals etc. These also do not necessarily need to be features used in the GPz training (e.g. luptitudes), or even things derived from luptitudes.

In the following example, given the limited number of features we will just use a combination of colours and magnitudes. But sizes or morphological information could also be sensible choices.

Since for our example the training set is a subset of the reference population, we can define one set of features and then split:

gmm_features = np.array([lupt_cols['lupt_F606W']-lupt_cols['lupt_F814W'],
                         lupt_cols['lupt_F814W']-lupt_cols['lupt_F125W'],
                         lupt_cols['lupt_F160W']]).T

X_pop = gmm_features[good] # The full reference population 
                           # (i.e. representative of the full sample we would like photo-z predictions for)
X_train = gmm_features[good * (np.isnan(lupt_cols['z_spec']) == False)] # The training subset

Instantiating the class with the inputs defined above will build the GMMs:

gmm = GMMbasic(X_pop=X_pop, 
               X_train=X_train, 
               Y_train=cat['z_spec'], 
               ncomp=4) # For larger samples and more features, more mixtures could be appropriate

GMM Weight

Since the cost-sensitive learning (CSL) is the most straight-forward to include in gpz++, we will first generate some CSL weights based on the GMMs we have just produced. To do so is as straight-forward as:

weights = gmm.calc_weights(X_train, X_pop, max_weight=100)

In principle, we could go straight to running GPz with these inputs. But its first useful to verify that the weights are producing sensible results. And if necessary, to quantitatively compare how well different features perform when trying to match the target distribution.

For the feature set we defined earlier, we can simply plot the distributions of the training sample features, $x_{i}$, before and after weighting is applied:

Fig, Ax = plt.subplots(1,3, figsize=(9,3))

Ax = Ax.flatten()

for i, ax in enumerate(Ax):
    c, bins, _ = ax.hist(X_pop[:,i], density=True, bins=25, 
                      range=np.percentile(X_pop[:,i], [0.1, 99.9]), histtype='step', lw=2,
                        label='All')
    ax.hist(X_train[:,i], density=True, bins=bins, histtype='step', lw=2, color='firebrick',
            label='Training')
    ax.hist(X_train[:,i], density=True, bins=bins, histtype='step', lw=2, color='firebrick', 
            ls='--', weights=weights,
            label='Weighted Training')

    ax.set_ylabel('PDF(x)')
    ax.set_xlabel(f'$x_{i+1}$')

Ax[0].legend(loc='upper right', prop={'size':7})
Fig.tight_layout()

png

Although not perfect, we can see that the fainter training sources have been significantly up-weighted when compared to the original distribution. The colours are also distributed more like those of the full sample.

To include the weights in GPz we can simply add them to our input catalogue and tell GPz to include them in the training:

cat['weights'] = weights # Add the weights to our previous catalogue

weights_run, paths = test.run_training( # All of these as above but with minor changes
      cat, 
      outdir='test_dir', 
      basename='candels_cosmos_weighted', # Change the prefix to keep separate
      bash_script=False,
      mag_prefix='lupt_',
      error_prefix='lupterr_', 
      id_col='ID',
      total_basis_functions=100,
      do_iteration=False, 
      verbose=False,
      weight_col='weights', # Now weight training by this column
    )
Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(weights_run['z_spec'], weights_run['value'], 
            yerr=weights_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['Simple', 'Weighted']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_weight = calcStats(weights_run['z_spec'], weights_run['value'])

stats = [stats_simple, stats_weight]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')


png

At face value, the weights have not resulted in an overall statistical improvement to the photo-$z$ estimates. But by eye, there is definitely some improvement at the higher redshift end. For now, we move onto the GMM-Divide option.

GMM Divide

Dividing the training sample into different mixtures is as equally simple. For this example with its relatively small sample sizes, the number of training sources that could end up assigned to a mixture could be very small. We therefore will lower the threshold above which a source will be assigned to a mixture to 0.2, approximately saying that the source has at least a 20% chance of belonging to that mixture. (Note the default 0.5 effectively assigns each source to its best match)

train_mixtures = gmm.divide(X_train[:,:], 
                            weight=False, # Do not include CSL weights yet
                            threshold=0.2) # Change the divide threshold

cat['weights'] = 1. # Re-set the weights to be equal

# Save for later
train_mixtures.write('cosmos_mixtures_ex.csv', format='ascii.csv', overwrite=True)
gmm_output = Table.read('cosmos_mixtures_ex.csv', format='ascii.csv')

divide_run, paths = test.run_training(cat, outdir='test_dir', 
                         basename='candels_cosmos_divide',
                         gmm_output=gmm_output,
                         bash_script=False,
                         weight_col='weights',
                         mag_prefix='lupt_', 
                         error_prefix='lupterr_', id_col='ID',
                         total_basis_functions=100, do_iteration=False, verbose=False)
GPz++ Run (mixture 1/4):   0%|                                                                                                 | 0/4 [00:00<?, ?it/s]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m0_model.dat


GPz++ Run (mixture 2/4):  25%|██████████████████████▎                                                                  | 1/4 [00:01<00:03,  1.08s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m1_model.dat


GPz++ Run (mixture 3/4):  50%|████████████████████████████████████████████▌                                            | 2/4 [00:02<00:02,  1.20s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m2_model.dat


GPz++ Run (mixture 4/4):  75%|██████████████████████████████████████████████████████████████████▊                      | 3/4 [00:03<00:01,  1.16s/it]

WARNING:root:Removing existing model file test_dir/candels_cosmos_divide_m3_model.dat


GPz++ Run (mixture 4/4): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.17s/it]

As above, lets compare the results between the simple and GMM divided runs:

Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(divide_run['z_spec'], divide_run['value'], 
            yerr=divide_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['No divide', 'Divide']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_divide = calcStats(divide_run['z_spec'], divide_run['value'])

stats = [stats_simple, stats_divide]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')   

png

This time we can see that overall outlier fraction and scatter are now improved. Visually, we can also see that the high-redshift end is also significantly improved.

Lets now do a run using both the GMM Divide and Weights. This can be done using the divide option, but setting weight = True in the options. To prevent confusion, we will also now remove the weight column we added above, since the weights will now be propagated through the catalogue included as gmm_output:

train_mixtures2 = gmm.divide(X_train[:,:], 
                            weight=True,
                            threshold=0.2) # Do not include CSL weights yet

if 'weights' in cat.colnames:
    cat.remove_columns(['weights'])  # Weights will be taken from the GMM outputs file now instead

Run the gpz training as normal:

combined_run, paths = test.run_training(cat, outdir='test_dir', 
                         basename='candels_cosmos_both',
                         gmm_output=train_mixtures2,
                         bash_script=False,
                         weight_col='weights',
                         mag_prefix='lupt_', 
                         error_prefix='lupterr_', id_col='ID',
                         total_basis_functions=100, do_iteration=False, verbose=False)
GPz++ Run (mixture 4/4): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.15s/it]
Fig, Ax = plt.subplots(1,2,figsize=(9,4))

Ax[0].errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)

Ax[1].errorbar(combined_run['z_spec'], combined_run['value'], 
            yerr=combined_run['uncertainty'], 
            fmt='o', ms=3, color='steelblue', ecolor='steelblue', alpha=0.4)

labels = ['No divide', 'Divide + Weight']

stats_simple = calcStats(simple_run['z_spec'], simple_run['value'])
stats_combined = calcStats(combined_run['z_spec'], combined_run['value'])

stats = [stats_simple, stats_combined]

for i, ax in enumerate(Ax):
    ax.set_xlim([0, 7])
    ax.set_ylim([0, 7])

    ax.set_yscale('function', functions=(forward, inverse))
    ax.set_xscale('function', functions=(forward, inverse))
    ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
    ax.set_xlabel(r'$z_{\rm{spec}}$')
    ax.set_ylabel(r'$z_{\rm{phot}}$')
    ax.set_title(labels[i], size=11)

    ax.text(0.1, 6.0, f'$\sigma = $ {stats[i][0]:.3f}')
    ax.text(0.1, 5.0, f'OLF = {stats[i][2]:.3f}')   

png

Compared to the Divide-only run, our combined Divide+Weights is performing slightly worse for the two metrics we have chosen, but is still an improvement over the Weight-only option. So in this case, using only the GMM-Divide would be the optimal approach. But as training sample sizes increase and additional features are added to the GPz training, this might not remain the case.

Additionally, since the spectroscopic sample is obviously biased, the Weight/Divide+Weight runs might actually provide better statistics for the overall galaxy population (at the expense of the brigher sources). Assessing the statistics as a function of e.g. spectroscopic redshift or magnitude can start to make these assessments clearer.

Code reference

Bases: object

Gaussian Mixture Model (GMM) for Photo-z augmentation with sklearn.

Attributes:

Name Type Description
ncomp int

Number of components for the GMM model.

niter int

Number of iterations for the GMM model.

tol float

Tolerance for the GMM model.

random_state int

Random state for the GMM model.

scale bool

If True, the data is scaled before training the GMM model.

threshold float

Threshold for the GMM model.

scaler sklearn.preprocessing.RobustScaler

Scaler for the GMM model.

gmm_pop sklearn.mixture.GaussianMixture

GMM model for the population.

gmm_train sklearn.mixture.GaussianMixture

GMM model for the training set.

mixture_samples dict

Dictionary with the mixture samples for the population and the training set.

preprocess function

Function to preprocess the data before training the GMM model.

fit function

Function to train the GMM model.

population function

Function to train the GMM model for the population.

train function

Function to train the GMM model for the training set.

divide function

Function to divide the training set into mixture samples.

__init__(X_pop=None, X_train=None, Y_train=None, ncomp=10, threshold=0.5, niter=100, tol=0.001, random_state=0, scale=True)

Parameters:

Name Type Description Default
X_pop (array - like, shape(n_samples, n_features))

The data array for the reference population.

None
X_train (array - like, shape(n_samples, n_features))

The data array for the training sample.

None
Y_train (array - like, shape(n_samples))

The labels for the training sample.

None
ncomp int

Number of components for the GMM model.

10
threshold float

Threshold for the GMM model.

0.5
niter int

Number of iterations for the GMM model.

100
tol float

Tolerance for the GMM model.

1e-3
random_state int

Random state for the GMM model.

0
scale bool

If True, the data is scaled before training the GMM model.

True

calc_weights(X_train, X_pop=None, eta=0.001, max_weight=100)

Calculate the weights for the training set.

Parameters:

Name Type Description Default
X_train (array - like, shape(n_samples, n_features))

The data to train the GMM model.

required
X_pop (array - like, shape(n_samples, n_features))

The data to train the GMM model.

None
eta float

Softening parameter for the weights.

0.001
max_weight float

Maximum weight to apply to the training set.

100

Returns:

Name Type Description
weights (array - like, shape(n_samples))

The weights for the training set.

divide(X, weight=False, threshold=None, eta=0.001, max_weight=100, return_density=False)

Divide the input array into mixture samples.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The input data to map onto the GMM.

required
threshold float

Threshold for the GMM model. If None, the default threshold is used.

None
weight bool

If True, the cost-sensitive learning weights are calculated.

False
eta float

Softening parameter for the weights.

0.001
max_weight float

Maximum weight for the weights.

100
return_density bool

If True, the density of the sample for each mixture component is returned. For use with predictions with multiple components.

False

Returns:

Name Type Description
mixture_samples dict

Dictionary with the mixture samples for the population and the training set.

fit(X)

Train the GMM model.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The data to train the GMM model.

required

Returns:

Name Type Description
gmm sklearn.mixture.GaussianMixture

GMM model.

load(filename)

Load the GMM model from file.

Parameters:

Name Type Description Default
filename str

Name of the file to load the GMM model.

required

population(X)

Train the GMM model for the population.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The data to train the GMM model.

required

rescale(X)

Preprocess the data before training the GMM model.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The data to train the GMM model.

required

Returns:

Name Type Description
X (array - like, shape(n_samples, n_features))

The data to train the GMM model.

save(filename)

Save the GMM model to file.

Parameters:

Name Type Description Default
filename str

Name of the file to save the GMM model.

required

train(X)

Train the GMM model for the training set.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The data to train the GMM model.

required