Skip to content

Basic GPz++ usage

Introduction

For this example notebook we will use the CANDELS COSMOS catalogue as an input. This includes lots of photometry from both ground + space, but we will restrict analysis to just the HST filters for simplicity.

Inputting the catalogue into GPz however requires some additional pre-processing. We start by extracting the general common columns and setting any missing z_spec values to np.nan:

import numpy as np

# Plotting modules
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 100

# Astropy modules
from astropy.table import Table, join, vstack, hstack
from astropy.coordinates import SkyCoord
import astropy.units as u

# Import required gpz_pype functions
from gpz_pype.utilities import Params, set_gpz_path, basic_lupt_soft, flux_to_lupt
from gpz_pype.base import GPz

candels = Table.read('CANDELS.COSMOS.F160W.Processed.photz.fits', format='fits')

catalogue = candels[['ID', 'RA', 'DEC', 'CLASS_STAR', 'FLAGS', 'EBV', 'z_spec']]

catalogue['z_spec'][catalogue['z_spec'] < 0] = np.nan

To further trim the catalogue down, i next want to pull out only the flux/magnitude columns that belong to HST filters. In this case, columns that start with either 'ACS' or 'WFC'. As with z_spec, any value set as -99 in the input catalogue (i.e. no data) is converted to np.nan:

for col in candels.colnames:
    if col.startswith('ACS') or col.startswith('WFC'):
        inst, filt, ext = col.split('_') # Split column name into 3 parts based on '_'
        col_vals = candels[col]
        col_vals[col_vals < -90] = np.nan
        catalogue[f'{ext}_{filt}'] = col_vals # [FLUX/FLUXERR]_[FILTER]

To extract the filters that were found in the catalogue, we reprocess the column names to look for filters. This could have been done above, but since there were both MAG/FLUX columns this would have resulted in duplicates. So this is simpler:

filters = [col.split('_')[1] for col in catalogue.colnames if col.startswith('FLUX_')]

Asinh magnitudes

Extensive past experience has shown that training machine learning using magnitudes can provide significantly better results than using linear flux values. This is particularly true for fields with very high dynamic range.

However, the significant drawback of normal AB magnitudes is that they cannot be used at low S/N or for negative fluxes (consistent with zero). For GPz we therefore make use of asinh magnitudes, which remain real+positive for very low S/N that remain informative for non-detections without the need for additional missing value prediction etc. This is particularly key for high-redshift where the non-detection at bluer wavelengths is critical for the redshift estimate.

Ideally, the softening parameter ($b$) used to derive asinh magnitudes from fluxes + uncertainties will be derived from the local noise on a per-object basis. However, when these are not available the use of a global softening parameter does not significantly impact photo-z results.

For convenience, gpz_pype includes a function for estimating a suitable global softening parameter based on a set of input fluxes+errors (that are assumed to be representative of the full field):

# Calculate a softening parameter for each filter in the list of filters derived above:
b_arr = [basic_lupt_soft(catalogue[f'FLUX_{filt}'], catalogue[f'FLUXERR_{filt}']) for filt in filters] 

With softening parameters calculated, we can then calculate the asinh magnitudes (also known as 'luptitudes') for each of our filters. These can be calculated using the flux_to_lupt function included in gpz_pype.

# Make a new catalogue with the relevant key reference columns:
lupt_cols = catalogue[['ID', 'RA', 'DEC', 'CLASS_STAR', 'FLAGS', 'EBV', 'z_spec']]

check_nans = np.zeros(len(catalogue)) # Running counter of null values for each object

for filt, b in zip(filters, b_arr):
    lupt, lupterr = flux_to_lupt(catalogue[f'FLUX_{filt}'], # Input flux (uJy)
                                 catalogue[f'FLUXERR_{filt}'], # Input uncertainty (uJy)
                                 b, # Filter specific softening parameter
                                ) 

    lupt_cols[f'lupt_{filt}'] = lupt
    lupt_cols[f'lupterr_{filt}'] = lupterr

    check_nans += np.isnan(lupt) # Update nan counter for this filter

After filling all the columns, for the purpose of training GPz we want to cut our catalogue down to those sources with values in all the filters we want to use and for which there is a spectroscopic redshift:

good = (check_nans == 0) # 'Good' sources for training are those with 0 NaNs

cat = lupt_cols[good * (np.isnan(lupt_cols['z_spec']) == False)] # Keep only good sources with z_spec

The following code assumes that gpz++ is correctly installed on the system and that it can be run through the command line without issue. The gpz_pype classes simply wrap around gpz++ and streamline the book-keeping required when using more complicated splitting+weighting of training sample.

The path to gpz++ can be set as a system variable so this step might not be required in future. It can also be input directly into the GPz class below, but for safety we can use the convenience function set_gpz_path to set this for the current session.

Running GPz

set_gpz_path('gpzpp/bin/gpz++')

The main workhorse of gpz_pype is the GPz class, it needs to be instantiated with a set of parameters that we can then update manually (and which GPz will automatically set for us when necessary). We can also define the number of CPUs we want gpz++ to make use of during training:

test = GPz(param_file='gpzpp/example/gpz.param', ncpu=4)

To perform GPz training on an input catalogue in gpz++ in our most straight-forward approach, we can use the .run_training function from the GPz class.

Even in its most straight-forward approach, there are lots of options that must be set. Many of which can be left as their degault value. I have tried to ensure that in-code documentation for gpz_pype is relatively complete, so the full set of options available can be seen using the standard python help functionality, e.g.:

help(test.run_training)
Help on method run_training in module gpz_pype.base:

run_training(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', total_basis_functions=100, verbose=False, **kwargs) method of gpz_pype.base.GPz instance
    Run GPz training and prediction for a given input catalogue.

    Parameters
    ----------
    catalog : astropy.table.Table, str
        Catalog to run GPz on. If str, will read in catalog from file.
    outdir : str
        Output directory.
    basename : str
        Basename for output files.
    gmm_output : astroy.table.Table
        GMM divide and/or weight output catalog.
    bash_script: bool
        If True, output gpz++ commands to a bash script instead of running.
    mag_prefix : str, optional
        Prefix for magnitude columns.  Default is 'mag_'.
    error_prefix : str, optional
        Prefix for magnitude error columns.  Default is 'magerr_'.
    z_col : str, optional
        Name of spectroscopic redshift column.  Default is 'z_spec'.
    weight_col : str, optional
        Name of weight column.  Default is None.
    output_min : float, optional
        Minimum redshift to output.  Default is 0.
    output_max : float, optional
        Maximum redshift to output.  Default is 7.
    test_fraction : float, optional
        Fraction of catalog to use for testing.  Default is 0.2.
    valid_fraction : float, optional
        Fraction of training set to use for validation.  Default is 0.2.
    do_iteration : bool, optional
        If True, will set up 2nd gpz++ iteration with a more complex covariance.
        Default is False.
    iter_cov : str, optional
        Covariance type to use for 2nd iteration.  Default is 'gpvc'.
    total_basis_functions : int, optional
        Total number of basis functions to use.  Default is 100.
    verbose : bool, optional
        If True, will print out GPz++ output.  Default is False.
    **kwargs : dict, optional
        Additional keyword arguments to pass to Table.read() if catalog is a
        filename.

    Returns
    -------
simple_run, paths = test.run_training(
      cat, # Run training on this input catalogue - can be a path or a Table object
      outdir='test_dir', # Save training/output catalogues in this directory
      basename='candels_cosmos', # Start our output filenames with this
      bash_script=False, # Run GPz directly, don't just write the parameter files + commands to bash
      mag_prefix='lupt_', # Look for these asinh magnitude columns
      error_prefix='lupterr_', # Look for these error columns
      id_col='ID', # ID column to propagate into output files
      total_basis_functions=100, # NUMBF passed to gpz++
      do_iteration=False, # If True, run second iteration with more complex covariance
      verbose=False, # Don't print all of the gpz++ output to screen. Turn on for debugging
    )
WARNING:root:Removing existing model file test_dir/candels_cosmos_model.dat

There are two outputs from .run_training:

  1. Catalogue: The 'test' subset of the input catalogue, with gpz++ prediction columns appended:
simple_run.show_in_notebook()
  1. Output paths: A dictionary containing the relative paths to the various files produced from the gpz++ training - including both input catalogues (in gpz format) and the trained model + associated parameter file:
paths
{'train': 'test_dir/candels_cosmos_train.txt',
 'test': 'test_dir/candels_cosmos_test.txt',
 'output_cat': 'test_dir/candels_cosmos_output.txt',
 'output_model': 'test_dir/candels_cosmos_model.dat',
 'param': 'test_dir/candels_cosmos.param'}

Plotting outputs

Just for validation purposes, lets plot the predicted photo-$z$ versus spec-$z$ for the test catalogue. We can see that between $1 < z < 3$ the predictions do a decent job, even with just the 4-bands used in training. But at low and high redshifts the performance falls off

# Function x**(1/2)
def forward(x):
    return np.log10(1+x)

def inverse(x):
    return (10**x) - 1

Fig, Ax = plt.subplots(1,1,figsize=(4.5,4.5))

Ax.errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)
Ax.set_xlim([0, 7])
Ax.set_ylim([0, 7])

Ax.set_yscale('function', functions=(forward, inverse))
Ax.set_xscale('function', functions=(forward, inverse))


Ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
Ax.set_xlabel(r'$z_{\rm{spec}}$')
Ax.set_ylabel(r'$z_{\rm{phot}}$')

png

Modules

Bases: object

Class to run GPz.

This class is used to prepare and run gpz++ on a given catalog. Options for running gpz++ are set in a parameter file, which can be read in using the read() method. Alternatively, options can be set directly using the class attributes.

The run_training() method can be used either to run gpz++ on a single catalogue, or to automatically split a catalogue into mixture samples for training and testing based on a given GMM model (see gpz_pype.gmm.GMMbasic).

gpz++ can be called directly from within the class or for large catalogues, the necessary run commands can be written to a shell script.

__init__(param_file=None, gpz_path=None, ncpu=None)

Parameters:

Name Type Description Default
param_file str

Path to parameter file to read.

None
gpz_path str

Path to gpz++ installation. If not specified, will look for GPZPATH environment variable.

None
ncpu int

Number of CPUs for gpz++ to use. If not specified, will use all available CPUs.

None

predict(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, verbose=False, **kwargs)

Run GPz training and prediction for a given input catalogue.

Parameters:

Name Type Description Default
catalog (astropy.table.Table, str)

Catalog to run GPz on. If str, will read in catalog from file.

required
outdir str

Output directory.

required
basename str

Basename for output files.

required
gmm_output astroy.table.Table

GMM divide and/or weight output catalog.

None
bash_script

If True, output gpz++ commands to a bash script instead of running.

False
mag_prefix str

Prefix for magnitude columns. Default is 'mag_'.

'mag_'
error_prefix str

Prefix for magnitude error columns. Default is 'magerr_'.

'magerr_'
z_col str

Name of spectroscopic redshift column. Default is 'z_spec'.

'z_spec'
weight_col str

Name of weight column. Default is None.

None
verbose bool

If True, will print out GPz++ output. Default is False.

False
**kwargs dict

Additional keyword arguments to pass to Table.read()

{}

Returns:

Name Type Description
incat astropy.table.Table

Input catalog with GPz predictions appended.

paths dict

Dictionary of paths to output files.

prep_gpz(catalog, outdir, basename, mode='train', model=None, label=None, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', basis_functions=50, max_iter=500, tol=1e-09, grad_tol=1e-05)

Prepare gpz++ input catalogues and parameter files for a given input catalogue.

Parameters:

Name Type Description Default
catalog astropy.table.Table

Catalog to run GPz on.

required
outdir str

Output directory.

required
basename str

Basename for output files.

required
mode str

Mode to run GPz in. Default is 'train'. Other options are 'predict'.

'train'
model str

Path to GPz model file. Default is None.

None
label str

Label to append to output files.

None
mag_prefix str

Prefix for magnitude columns. Default is 'mag_'.

'mag_'
error_prefix str

Prefix for magnitude error columns. Default is 'magerr_'.

'magerr_'
z_col str

Name of spectroscopic redshift column. Default is 'z_spec'.

'z_spec'
weight_col str

Name of weight column. Default is None.

None
output_min float

Minimum redshift to output. Default is 0.

0
output_max float

Maximum redshift to output. Default is 7.

7
test_fraction float

Fraction of catalog to use for testing. Default is 0.2.

0.2
valid_fraction float

Fraction of training set to use for validation. Default is 0.2.

0.2
do_iteration bool

If True, will set up 2nd gpz++ iteration with a more complex covariance. Default is False.

False
iter_cov str

Covariance type to use for 2nd iteration. Default is 'gpvc'.

'gpvc'
basis_functions int

Number of basis functions to use. Default is 50.

50
max_iter int

Maximum number of iterations. Default is 500.

500
tol float

Tolerance for convergence. Default is 1e-9.

1e-09
grad_tol (float, optional)

Tolerance for gradient convergence. Default is 1e-5.

1e-05

Returns:

Name Type Description
run_string str

Command to run gpz++ for the input catalog.

file_paths list

List of paths to files generated for the input catalog.

read(param_file)

Read parameter file.

Parameters:

Name Type Description Default
param_file str

Path to parameter file to read.

required

run_training(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', total_basis_functions=100, bf_distribution='uniform', min_basis_functions=10, max_iter=500, tol=1e-09, grad_tol=1e-05, verbose=False, **kwargs)

Run GPz training and prediction for a given input catalogue.

Parameters:

Name Type Description Default
catalog (astropy.table.Table, str)

Catalog to run GPz on. If str, will read in catalog from file.

required
outdir str

Output directory.

required
basename str

Basename for output files.

required
gmm_output astroy.table.Table

GMM divide and/or weight output catalog.

None
bash_script

If True, output gpz++ commands to a bash script instead of running.

False
mag_prefix str

Prefix for magnitude columns. Default is 'mag_'.

'mag_'
error_prefix str

Prefix for magnitude error columns. Default is 'magerr_'.

'magerr_'
z_col str

Name of spectroscopic redshift column. Default is 'z_spec'.

'z_spec'
weight_col str

Name of weight column. Default is None.

None
output_min float

Minimum redshift to output. Default is 0.

0
output_max float

Maximum redshift to output. Default is 7.

7
test_fraction float

Fraction of catalog to use for testing. Default is 0.2.

0.2
valid_fraction float

Fraction of training set to use for validation. Default is 0.2.

0.2
do_iteration bool

If True, will set up 2nd gpz++ iteration with a more complex covariance. Default is False.

False
iter_cov str

Covariance type to use for 2nd iteration. Default is 'gpvc'.

'gpvc'
total_basis_functions int

Total number of basis functions to use. Default is 100.

100
bf_distribution str

Basis function distribution. Default is 'uniform'. Other options are 'sqrt' and 'linear', which will distribute basis functions according to the square root and linear of the number of training objects in each mixture, respectively.

'uniform'
min_basis_functions int

Minimum number of basis functions per mixture. Default is 10.

10
max_iter int

Maximum number of iterations. Default is 500.

500
tol float

Tolerance for convergence. Default is 1e-9.

1e-09
grad_tol (float, optional)

Tolerance for gradient convergence. Default is 1e-5.

1e-05
verbose bool

If True, will print out GPz++ output. Default is False.

False
**kwargs dict

Additional keyword arguments to pass to Table.read() if catalog is a filename.

{}

Returns:

Name Type Description
merged_output astropy.table.Table

Merged output catalog.

paths dict

Dictionary of paths to output files.