To: John Lanzante <John.Lanzante@noaa.gov>, Thomas R Karl <Thomas.R.Karl@noaa.gov>, carl mears <mears@remss.com>, "David C. Bader" <bader2@llnl.gov>, "'Dian J. Seidel'" <dian.seidel@noaa.gov>, "'Francis W. Zwiers'" <francis.zwiers@ec.gc.ca>, Frank Wentz <frank.wentz@remss.com>, Karl Taylor <taylor13@llnl.gov>, Leopold Haimberger <leopold.haimberger@univie.ac.at>, Melissa Free <Melissa.Free@noaa.gov>, "Michael C. MacCracken" <mmaccrac@comcast.net>, "'Philip D. Jones'" <p.jones@uea.ac.uk>, Steven Sherwood <Steven.Sherwood@yale.edu>, Steve Klein <klein21@mail.llnl.gov>, 'Susan Solomon' <ssolomon@al.noaa.gov>, "Thorne, Peter" <peter.thorne@metoffice.gov.uk>, Tim Osborn <t.osborn@uea.ac.uk>, Tom Wigley <wigley@cgd.ucar.edu>, Gavin Schmidt <gschmidt@giss.nasa.gov>

Subject: More significance testing

Date: Thu, 27 Dec 2007 16:26:19 -0800

Reply-to: santer1@llnl.gov

<x-flowed>

Dear folks,

This email briefly summarizes the trend significance test results. As I

mentioned in yesterday's email, I've added a new case (referred to as

"TYPE3" below). I've also added results for tests with a stipulated 10%

significance level. Here is the explanation of the four different types

of trend test:

1. "OBS-vs-MODEL": Observed MSU trends in RSS and UAH are tested against

trends in synthetic MSU data in 49 realizations of the 20c3m experiment.

Results from RSS and UAH are pooled, yielding a total of 98 tests for T2

trends and 98 tests for T2LT trends.

2. "MODEL-vs-MODEL (TYPE1)": Involves model data only. Trend in

synthetic MSU data in each of 49 20c3m realizations is tested against

each trend in the remaining 48 realizations (i.e., no trend tests

involving identical data). Yields a total of 49 x 48 = 2352 tests. The

significance of trend differences is a function of BOTH inter-model

differences (in climate sensitivity, applied 20c3m forcings, and the

amplitude of variability) AND "within-model" effects (i.e., is related

to the different manifestations of natural internal variability

superimposed on the underlying forced response).

3. "MODEL-vs-MODEL (TYPE2)": Involves model data only. Limited to the M

models with multiple realizations of the 20c3m experiment. For each of

these M models, the number of unique combinations C of N 20c3m

realizations into R trend pairs is determined. For example, in the case

of N = 5, C = N! / [ R!(N-R)! ] = 10. The significance of trend

differences is solely a function of "within-model" effects (i.e., is

related to the different manifestations of natural internal variability

superimposed on the underlying forced response). There are a total of 62

tests (not 124, as I erroneously reported yesterday!)

4. "MODEL-vs-MODEL (TYPE3)": Involves model data only. For each of the

19 models, only the first 20c3m realization is used. The trend in each

model's first 20c3m realization is tested against each trend in the

first 20c3m realization of the remaining 18 models. Yields a total of 19

x 18 = 342 tests. The significance of trend differences is solely a

function of inter-model differences (in climate sensitivity, applied

20c3m forcings, and the amplitude of variability).

REJECTION RATES FOR STIPULATED 5% SIGNIFICANCE LEVEL

Test type No. of tests T2 "Hits" T2LT "Hits"

1. OBS-vs-MODEL 49 x 2 (98) 2 (2.04%) 1 (1.02%)

2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 58 (2.47%) 32 (1.36%)

3. MODEL-vs-MODEL (TYPE2) --- (62) 0 (0.00%) 0 (0.00%)

4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 22 (6.43%) 14 (4.09%)

REJECTION RATES FOR STIPULATED 10% SIGNIFICANCE LEVEL

Test type No. of tests T2 "Hits" T2LT "Hits"

1. OBS-vs-MODEL 49 x 2 (98) 4 (4.08%) 2 (2.04%)

2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 80 (3.40%) 46 (1.96%)

3. MODEL-vs-MODEL (TYPE2) --- (62) 1 (1.61%) 0 (0.00%)

4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 28 (8.19%) 20 (5.85%)

REJECTION RATES FOR STIPULATED 20% SIGNIFICANCE LEVEL

Test type No. of tests T2 "Hits" T2LT "Hits"

1. OBS-vs-MODEL 49 x 2 (98) 7 (7.14%) 5 (5.10%)

2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 176 (7.48%) 100 (4.25%)

3. MODEL-vs-MODEL (TYPE2) --- (62) 4 (6.45%) 3 (4.84%)

4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 42 (12.28%) 28 (8.19%)

Features of interest:

A) As you might expect, for each of the three significance levels, TYPE3

tests yield the highest rejection rates of the null hypothesis of "No

significant difference in trend". TYPE2 tests yield the lowest rejection

rates. This is simply telling us that the inter-model differences in

trends tend to be larger than the "between-realization" differences in

trends in any individual model.

B) Rejection rates for the model-versus-observed trend tests are

consistently LOWER than for the model-versus-model (TYPE3) tests. On

average, therefore, the tropospheric trend differences between the

observational datasets used here (RSS and UAH) and the synthetic MSU

temperatures calculated from 19 CMIP-3 models are actually LESS

SIGNIFICANT than the inter-model trend differences arising from

differences in sensitivity, 20c3m forcings, and levels of variability.

I also thought that it would be fun to use the model data to explore the

implications of Douglass et al.'s flawed statistical procedure. Recall

that Douglass et al. compare (in their Table III) the observed T2 and

T2LT trends in RSS and UAH with the overall means of the multi-model

distributions of T2 and T2LT trends. Their standard error, sigma{SE}, is

meant to represent an "estimate of the uncertainty of the mean" (i.e.,

the mean trend). sigma{SE} is given as:

sigma{SE} = sigma / sqrt{N - 1}

where sigma is the standard deviation of the model trends, and N is "the

number of independent models" (22 in their case). Douglass et al.

apparently estimate sigma using ensemble-mean trends for each model (if

20c3m ensembles are available).

So what happens if we apply this procedure using model data only? This

is rather easy to do. As above (in the TYPE1, TYPE2, and TYPE3 tests), I

simply used the synthetic MSU trends from the 19 CMIP-3 models employed

in our CCSP Report and in Santer et al. 2005 (so N = 19). For each

model, I calculated the ensemble-mean 20c3m trend over 1979 to 1999

(where multiple 20c3m realizations were available). Let's call these

mean trends b{j}, where j (the index over models) = 1, 2, .. 19.

Further, let's regard b{1} as the surrogate observations, and then use

Douglass et al.'s approach to test whether b{1} is significantly

different from the overall mean of the remaining 18 members of b{j}.

Then repeat with b{2} as surrogate observations, etc. For each

layer-averaged temperature series, this yields 19 tests of the

significance of differences in mean trends.

To give you a feel for this stuff, I've reproduced below the results for

tests involving T2LT trends. The "OBS" column is the ensemble-mean T2LT

trend in the surrogate observations. "MODAVE" is the overall mean trend

in the 18 remaining members of the distribution, and "SIGMA" is the

1-sigma standard deviation of these trends. "SIGMA{SE}" is 1 x

SIGMA{SE} (note that Douglass et al. give 2 x SIGMA{SE} in their Table

III; multiplying our SIGMA{SE} results by two gives values similar to

theirs). "NORMD" is simply the normalized difference (OBS-MODAVE) /

SIGMA{SE}, and "P-VALUE" is the p-value for the normalized difference,

assuming that this difference is approximately normally distributed.

MODEL "OBS" MODAVE SIGMA SIGMA{SE} NORMD P-VALUE

CCSM3.0 0.1580 0.2179 0.0910 0.0215 2.7918 0.0052

GFDL2.0 0.2576 0.2124 0.0915 0.0216 2.0977 0.0359

GFDL2.1 0.3567 0.2069 0.0854 0.0201 7.4404 0.0000

GISS_EH 0.1477 0.2185 0.0906 0.0214 3.3153 0.0009

GISS_ER 0.1938 0.2159 0.0919 0.0217 1.0205 0.3075

MIROC3.2_T42 0.1285 0.2196 0.0897 0.0211 4.3094 0.0000

MIROC3.2_T106 0.2298 0.2139 0.0920 0.0217 0.7305 0.4651

MRI2.3.2a 0.2800 0.2111 0.0907 0.0214 3.2196 0.0013

PCM 0.1496 0.2184 0.0907 0.0214 3.2170 0.0013

HADCM3 0.1936 0.2159 0.0919 0.0217 1.0327 0.3018

HADGEM1 0.3099 0.2095 0.0891 0.0210 4.7784 0.0000

CCCMA3.1 0.4236 0.2032 0.0769 0.0181 12.1591 0.0000

CNRM3.0 0.2409 0.2133 0.0918 0.0216 1.2762 0.2019

CSIRO3.0 0.2780 0.2113 0.0908 0.0214 3.1195 0.0018

ECHAM5 0.1252 0.2197 0.0895 0.0211 4.4815 0.0000

IAP_FGOALS1.0 0.1834 0.2165 0.0917 0.0216 1.5314 0.1257

GISS_AOM 0.1788 0.2168 0.0916 0.0216 1.7579 0.0788

INMCM3.0 0.0197 0.2256 0.0790 0.0186 11.0541 0.0000

IPSL_CM4 0.2258 0.2142 0.0920 0.0217 0.5359 0.5920

T2LT: No. of p-values .le. 0.05: 12. Rejection rate: 63.16%

T2LT: No. of p-values .le. 0.10: 13. Rejection rate: 68.42%

T2LT: No. of p-values .le. 0.20: 14. Rejection rate: 73.68%

The corresponding rejection rates for the tests involving T2 data are:

T2: No. of p-values .le. 0.05: 12. Rejection rate: 63.16%

T2: No. of p-values .le. 0.10: 13. Rejection rate: 68.42%

T2: No. of p-values .le. 0.20: 15. Rejection rate: 78.95%

Bottom line: If we applied Douglass et al.'s ridiculous test of

difference in mean trends to model data only - in fact, to virtually the

same model data they used in their paper - one would conclude that

nearly two-thirds of the individual models had trends that were

significantly different from the multi-model mean trend! To follow

Douglass et al.'s flawed logic, this would mean that two-thirds of the

models really aren't models after all...

Happy New Year to all of you!

With best regards,

Ben

----------------------------------------------------------------------------

Benjamin D. Santer

Program for Climate Model Diagnosis and Intercomparison

Lawrence Livermore National Laboratory

P.O. Box 808, Mail Stop L-103

Livermore, CA 94550, U.S.A.

Tel: (925) 422-2486

FAX: (925) 422-7675

email: santer1@llnl.gov

----------------------------------------------------------------------------

</x-flowed>

## No comments:

## Post a Comment