Basic GPz++ usage

Introduction

For this example notebook we will use the CANDELS COSMOS catalogue as an input. This includes lots of photometry from both ground + space, but we will restrict analysis to just the HST filters for simplicity.

Inputting the catalogue into GPz however requires some additional pre-processing. We start by extracting the general common columns and setting any missing z_spec values to np.nan:

import numpy as np

# Plotting modules
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 100

# Astropy modules
from astropy.table import Table, join, vstack, hstack
from astropy.coordinates import SkyCoord
import astropy.units as u

# Import required gpz_pype functions
from gpz_pype.utilities import Params, set_gpz_path, basic_lupt_soft, flux_to_lupt
from gpz_pype.base import GPz

candels = Table.read('CANDELS.COSMOS.F160W.Processed.photz.fits', format='fits')

catalogue = candels[['ID', 'RA', 'DEC', 'CLASS_STAR', 'FLAGS', 'EBV', 'z_spec']]

catalogue['z_spec'][catalogue['z_spec'] < 0] = np.nan

To further trim the catalogue down, i next want to pull out only the flux/magnitude columns that belong to HST filters. In this case, columns that start with either 'ACS' or 'WFC'. As with z_spec, any value set as -99 in the input catalogue (i.e. no data) is converted to np.nan:

for col in candels.colnames:
    if col.startswith('ACS') or col.startswith('WFC'):
        inst, filt, ext = col.split('_') # Split column name into 3 parts based on '_'
        col_vals = candels[col]
        col_vals[col_vals < -90] = np.nan
        catalogue[f'{ext}_{filt}'] = col_vals # [FLUX/FLUXERR]_[FILTER]

To extract the filters that were found in the catalogue, we reprocess the column names to look for filters. This could have been done above, but since there were both MAG/FLUX columns this would have resulted in duplicates. So this is simpler:

filters = [col.split('_')[1] for col in catalogue.colnames if col.startswith('FLUX_')]

Asinh magnitudes

Extensive past experience has shown that training machine learning using magnitudes can provide significantly better results than using linear flux values. This is particularly true for fields with very high dynamic range.

However, the significant drawback of normal AB magnitudes is that they cannot be used at low S/N or for negative fluxes (consistent with zero). For GPz we therefore make use of asinh magnitudes, which remain real+positive for very low S/N that remain informative for non-detections without the need for additional missing value prediction etc. This is particularly key for high-redshift where the non-detection at bluer wavelengths is critical for the redshift estimate.

Ideally, the softening parameter ($b$) used to derive asinh magnitudes from fluxes + uncertainties will be derived from the local noise on a per-object basis. However, when these are not available the use of a global softening parameter does not significantly impact photo-z results.

For convenience, gpz_pype includes a function for estimating a suitable global softening parameter based on a set of input fluxes+errors (that are assumed to be representative of the full field):

# Calculate a softening parameter for each filter in the list of filters derived above:
b_arr = [basic_lupt_soft(catalogue[f'FLUX_{filt}'], catalogue[f'FLUXERR_{filt}']) for filt in filters]

With softening parameters calculated, we can then calculate the asinh magnitudes (also known as 'luptitudes') for each of our filters. These can be calculated using the flux_to_lupt function included in gpz_pype.

# Make a new catalogue with the relevant key reference columns:
lupt_cols = catalogue[['ID', 'RA', 'DEC', 'CLASS_STAR', 'FLAGS', 'EBV', 'z_spec']]

check_nans = np.zeros(len(catalogue)) # Running counter of null values for each object

for filt, b in zip(filters, b_arr):
    lupt, lupterr = flux_to_lupt(catalogue[f'FLUX_{filt}'], # Input flux (uJy)
                                 catalogue[f'FLUXERR_{filt}'], # Input uncertainty (uJy)
                                 b, # Filter specific softening parameter
                                ) 

    lupt_cols[f'lupt_{filt}'] = lupt
    lupt_cols[f'lupterr_{filt}'] = lupterr

    check_nans += np.isnan(lupt) # Update nan counter for this filter

After filling all the columns, for the purpose of training GPz we want to cut our catalogue down to those sources with values in all the filters we want to use and for which there is a spectroscopic redshift:

good = (check_nans == 0) # 'Good' sources for training are those with 0 NaNs

cat = lupt_cols[good * (np.isnan(lupt_cols['z_spec']) == False)] # Keep only good sources with z_spec

The following code assumes that gpz++ is correctly installed on the system and that it can be run through the command line without issue. The gpz_pype classes simply wrap around gpz++ and streamline the book-keeping required when using more complicated splitting+weighting of training sample.

The path to gpz++ can be set as a system variable so this step might not be required in future. It can also be input directly into the GPz class below, but for safety we can use the convenience function set_gpz_path to set this for the current session.

Running GPz

set_gpz_path('gpzpp/bin/gpz++')

The main workhorse of gpz_pype is the GPz class, it needs to be instantiated with a set of parameters that we can then update manually (and which GPz will automatically set for us when necessary). We can also define the number of CPUs we want gpz++ to make use of during training:

test = GPz(param_file='gpzpp/example/gpz.param', ncpu=4)

To perform GPz training on an input catalogue in gpz++ in our most straight-forward approach, we can use the .run_training function from the GPz class.

Even in its most straight-forward approach, there are lots of options that must be set. Many of which can be left as their degault value. I have tried to ensure that in-code documentation for gpz_pype is relatively complete, so the full set of options available can be seen using the standard python help functionality, e.g.:

help(test.run_training)

Help on method run_training in module gpz_pype.base:

run_training(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', total_basis_functions=100, verbose=False, **kwargs) method of gpz_pype.base.GPz instance
    Run GPz training and prediction for a given input catalogue.

    Parameters
    ----------
    catalog : astropy.table.Table, str
        Catalog to run GPz on. If str, will read in catalog from file.
    outdir : str
        Output directory.
    basename : str
        Basename for output files.
    gmm_output : astroy.table.Table
        GMM divide and/or weight output catalog.
    bash_script: bool
        If True, output gpz++ commands to a bash script instead of running.
    mag_prefix : str, optional
        Prefix for magnitude columns.  Default is 'mag_'.
    error_prefix : str, optional
        Prefix for magnitude error columns.  Default is 'magerr_'.
    z_col : str, optional
        Name of spectroscopic redshift column.  Default is 'z_spec'.
    weight_col : str, optional
        Name of weight column.  Default is None.
    output_min : float, optional
        Minimum redshift to output.  Default is 0.
    output_max : float, optional
        Maximum redshift to output.  Default is 7.
    test_fraction : float, optional
        Fraction of catalog to use for testing.  Default is 0.2.
    valid_fraction : float, optional
        Fraction of training set to use for validation.  Default is 0.2.
    do_iteration : bool, optional
        If True, will set up 2nd gpz++ iteration with a more complex covariance.
        Default is False.
    iter_cov : str, optional
        Covariance type to use for 2nd iteration.  Default is 'gpvc'.
    total_basis_functions : int, optional
        Total number of basis functions to use.  Default is 100.
    verbose : bool, optional
        If True, will print out GPz++ output.  Default is False.
    **kwargs : dict, optional
        Additional keyword arguments to pass to Table.read() if catalog is a
        filename.

    Returns
    -------

simple_run, paths = test.run_training(
      cat, # Run training on this input catalogue - can be a path or a Table object
      outdir='test_dir', # Save training/output catalogues in this directory
      basename='candels_cosmos', # Start our output filenames with this
      bash_script=False, # Run GPz directly, don't just write the parameter files + commands to bash
      mag_prefix='lupt_', # Look for these asinh magnitude columns
      error_prefix='lupterr_', # Look for these error columns
      id_col='ID', # ID column to propagate into output files
      total_basis_functions=100, # NUMBF passed to gpz++
      do_iteration=False, # If True, run second iteration with more complex covariance
      verbose=False, # Don't print all of the gpz++ output to screen. Turn on for debugging
    )

WARNING:root:Removing existing model file test_dir/candels_cosmos_model.dat

There are two outputs from .run_training:

Catalogue: The 'test' subset of the input catalogue, with gpz++ prediction columns appended:

simple_run.show_in_notebook()

Output paths: A dictionary containing the relative paths to the various files produced from the gpz++ training - including both input catalogues (in gpz format) and the trained model + associated parameter file:

paths

{'train': 'test_dir/candels_cosmos_train.txt',
 'test': 'test_dir/candels_cosmos_test.txt',
 'output_cat': 'test_dir/candels_cosmos_output.txt',
 'output_model': 'test_dir/candels_cosmos_model.dat',
 'param': 'test_dir/candels_cosmos.param'}

Plotting outputs

Just for validation purposes, lets plot the predicted photo-$z$ versus spec-$z$ for the test catalogue. We can see that between $1 < z < 3$ the predictions do a decent job, even with just the 4-bands used in training. But at low and high redshifts the performance falls off

# Function x**(1/2)
def forward(x):
    return np.log10(1+x)

def inverse(x):
    return (10**x) - 1

Fig, Ax = plt.subplots(1,1,figsize=(4.5,4.5))

Ax.errorbar(simple_run['z_spec'], simple_run['value'], 
            yerr=simple_run['uncertainty'], 
            fmt='o', ms=3, color='k', ecolor='k', alpha=0.2)
Ax.set_xlim([0, 7])
Ax.set_ylim([0, 7])

Ax.set_yscale('function', functions=(forward, inverse))
Ax.set_xscale('function', functions=(forward, inverse))


Ax.plot([0, 7], [0, 7], '--', color='0.5', lw=2)
Ax.set_xlabel(r'$z_{\rm{spec}}$')
Ax.set_ylabel(r'$z_{\rm{phot}}$')

png

Modules

Bases: object

Class to run GPz.

This class is used to prepare and run gpz++ on a given catalog. Options for running gpz++ are set in a parameter file, which can be read in using the read() method. Alternatively, options can be set directly using the class attributes.

The run_training() method can be used either to run gpz++ on a single catalogue, or to automatically split a catalogue into mixture samples for training and testing based on a given GMM model (see gpz_pype.gmm.GMMbasic).

gpz++ can be called directly from within the class or for large catalogues, the necessary run commands can be written to a shell script.

`init(param_file=None, gpz_path=None, ncpu=None)`

Parameters:

Name	Type	Description	Default
`param_file`	`str`	Path to parameter file to read.	`None`
`gpz_path`	`str`	Path to gpz++ installation. If not specified, will look for GPZPATH environment variable.	`None`
`ncpu`	`int`	Number of CPUs for gpz++ to use. If not specified, will use all available CPUs.	`None`

`predict(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, verbose=False, **kwargs)`

Run GPz training and prediction for a given input catalogue.

Parameters:

Name	Type	Description	Default
`catalog`	`(astropy.table.Table, str)`	Catalog to run GPz on. If str, will read in catalog from file.	required
`outdir`	`str`	Output directory.	required
`basename`	`str`	Basename for output files.	required
`gmm_output`	`astroy.table.Table`	GMM divide and/or weight output catalog.	`None`
`bash_script`		If True, output gpz++ commands to a bash script instead of running.	`False`
`mag_prefix`	`str`	Prefix for magnitude columns. Default is 'mag_'.	`'mag_'`
`error_prefix`	`str`	Prefix for magnitude error columns. Default is 'magerr_'.	`'magerr_'`
`z_col`	`str`	Name of spectroscopic redshift column. Default is 'z_spec'.	`'z_spec'`
`weight_col`	`str`	Name of weight column. Default is None.	`None`
`verbose`	`bool`	If True, will print out GPz++ output. Default is False.	`False`
`**kwargs`	`dict`	Additional keyword arguments to pass to Table.read()	`{}`

Returns:

Name	Type	Description
`incat`	`astropy.table.Table`	Input catalog with GPz predictions appended.
`paths`	`dict`	Dictionary of paths to output files.

`prep_gpz(catalog, outdir, basename, mode='train', model=None, label=None, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', basis_functions=50, max_iter=500, tol=1e-09, grad_tol=1e-05)`

Prepare gpz++ input catalogues and parameter files for a given input catalogue.

Parameters:

Name	Type	Description	Default
`catalog`	`astropy.table.Table`	Catalog to run GPz on.	required
`outdir`	`str`	Output directory.	required
`basename`	`str`	Basename for output files.	required
`mode`	`str`	Mode to run GPz in. Default is 'train'. Other options are 'predict'.	`'train'`
`model`	`str`	Path to GPz model file. Default is None.	`None`
`label`	`str`	Label to append to output files.	`None`
`mag_prefix`	`str`	Prefix for magnitude columns. Default is 'mag_'.	`'mag_'`
`error_prefix`	`str`	Prefix for magnitude error columns. Default is 'magerr_'.	`'magerr_'`
`z_col`	`str`	Name of spectroscopic redshift column. Default is 'z_spec'.	`'z_spec'`
`weight_col`	`str`	Name of weight column. Default is None.	`None`
`output_min`	`float`	Minimum redshift to output. Default is 0.	`0`
`output_max`	`float`	Maximum redshift to output. Default is 7.	`7`
`test_fraction`	`float`	Fraction of catalog to use for testing. Default is 0.2.	`0.2`
`valid_fraction`	`float`	Fraction of training set to use for validation. Default is 0.2.	`0.2`
`do_iteration`	`bool`	If True, will set up 2nd gpz++ iteration with a more complex covariance. Default is False.	`False`
`iter_cov`	`str`	Covariance type to use for 2nd iteration. Default is 'gpvc'.	`'gpvc'`
`basis_functions`	`int`	Number of basis functions to use. Default is 50.	`50`
`max_iter`	`int`	Maximum number of iterations. Default is 500.	`500`
`tol`	`float`	Tolerance for convergence. Default is 1e-9.	`1e-09`
`grad_tol`	`(float, optional)`	Tolerance for gradient convergence. Default is 1e-5.	`1e-05`

Returns:

Name	Type	Description
`run_string`	`str`	Command to run gpz++ for the input catalog.
`file_paths`	`list`	List of paths to files generated for the input catalog.

`read(param_file)`

Read parameter file.

Parameters:

Name	Type	Description	Default
`param_file`	`str`	Path to parameter file to read.	required

`run_training(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, test_fraction=0.2, valid_fraction=0.2, do_iteration=False, iter_cov='gpvc', total_basis_functions=100, bf_distribution='uniform', min_basis_functions=10, max_iter=500, tol=1e-09, grad_tol=1e-05, verbose=False, **kwargs)`

Run GPz training and prediction for a given input catalogue.

Parameters:

Name	Type	Description	Default
`catalog`	`(astropy.table.Table, str)`	Catalog to run GPz on. If str, will read in catalog from file.	required
`outdir`	`str`	Output directory.	required
`basename`	`str`	Basename for output files.	required
`gmm_output`	`astroy.table.Table`	GMM divide and/or weight output catalog.	`None`
`bash_script`		If True, output gpz++ commands to a bash script instead of running.	`False`
`mag_prefix`	`str`	Prefix for magnitude columns. Default is 'mag_'.	`'mag_'`
`error_prefix`	`str`	Prefix for magnitude error columns. Default is 'magerr_'.	`'magerr_'`
`z_col`	`str`	Name of spectroscopic redshift column. Default is 'z_spec'.	`'z_spec'`
`weight_col`	`str`	Name of weight column. Default is None.	`None`
`output_min`	`float`	Minimum redshift to output. Default is 0.	`0`
`output_max`	`float`	Maximum redshift to output. Default is 7.	`7`
`test_fraction`	`float`	Fraction of catalog to use for testing. Default is 0.2.	`0.2`
`valid_fraction`	`float`	Fraction of training set to use for validation. Default is 0.2.	`0.2`
`do_iteration`	`bool`	If True, will set up 2nd gpz++ iteration with a more complex covariance. Default is False.	`False`
`iter_cov`	`str`	Covariance type to use for 2nd iteration. Default is 'gpvc'.	`'gpvc'`
`total_basis_functions`	`int`	Total number of basis functions to use. Default is 100.	`100`
`bf_distribution`	`str`	Basis function distribution. Default is 'uniform'. Other options are 'sqrt' and 'linear', which will distribute basis functions according to the square root and linear of the number of training objects in each mixture, respectively.	`'uniform'`
`min_basis_functions`	`int`	Minimum number of basis functions per mixture. Default is 10.	`10`
`max_iter`	`int`	Maximum number of iterations. Default is 500.	`500`
`tol`	`float`	Tolerance for convergence. Default is 1e-9.	`1e-09`
`grad_tol`	`(float, optional)`	Tolerance for gradient convergence. Default is 1e-5.	`1e-05`
`verbose`	`bool`	If True, will print out GPz++ output. Default is False.	`False`
`**kwargs`	`dict`	Additional keyword arguments to pass to Table.read() if catalog is a filename.	`{}`

Returns:

Name	Type	Description
`merged_output`	`astropy.table.Table`	Merged output catalog.
`paths`	`dict`	Dictionary of paths to output files.

Basic GPz++ usage

Introduction

Asinh magnitudes

Running GPz

Plotting outputs

Modules

__init__(param_file=None, gpz_path=None, ncpu=None)

predict(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, verbose=False, **kwargs)

read(param_file)

`init(param_file=None, gpz_path=None, ncpu=None)`

`predict(catalog, outdir, basename, gmm_output=None, bash_script=False, mag_prefix='mag_', error_prefix='magerr_', z_col='z_spec', id_col='id', weight_col=None, output_min=0, output_max=7, verbose=False, **kwargs)`

`read(param_file)`