Friday, December 23, 2011


From: Ben Santer <>
To: "Thomas.R.Karl" <>
Subject: Re: More significance testing stuff
Date: Wed, 02 Jan 2008 20:52:31 -0800
Cc:, carl mears <>, "David C. Bader" <>, "'Dian J. Seidel'" <>, "'Francis W. Zwiers'" <>, Frank Wentz <>, Karl Taylor <>, Leopold Haimberger <>, Melissa Free <>, "Michael C. MacCracken" <>, "'Philip D. Jones'" <>, Sherwood Steven <>, Steve Klein <>, 'Susan Solomon' <>, "Thorne, Peter" <>, Tim Osborn <>, Tom Wigley <>, Gavin Schmidt <>

Dear Tom,

In the end, I decided to test the significance of trends in the O(t)
minus M(t) difference time series, as you and John Lanzante have
suggested. I still think that this "difference series test" is more
appropriate when one is operating on a pair of time series with
correlated variability (for example, if you wished to test whether an
observed tropical T2LT trend was significantly different from the T2LT
trend simulated in an AMIP experiment). But you and John convinced me
that our response to Douglass et al. would be strengthened by using
several different approaches to address the statistical significance of
differences between modeled and observed temperature trends.

The Tables given below show the results from two different types of
test. You've already seen the "TYPE1" or "PAIRED TREND" results. These
involve b{O} and b{M}, which represent any single pair of Observed and
Modeled trends, with standard errors s{bO} and s{bM} (which are adjusted
for temporal autocorrelation effects). As in our previous work (and as
in related work by John Lanzante), we define the normalized trend
difference d as:

d1 = (b{O} - b{M}) / sqrt[ (s{bO})**2 + (s{bM})**2 ]

Under the assumption that d1 is normally distributed, values of d1 >
+1.96 or < -1.96 indicate observed-minus-model trend differences that
are significant at the 5% level, and one can easily calculate a p-value
for each value of d. These p-values for the 98 pairs of trend tests (49
involving UAH data and 49 involving RSS data) are what we use for
determining the total number of "hits", or rejections of the null
hypothesis of no significant difference between modeled and observed
trends. I note that each test is two-tailed, since we have no
information a priori about the "direction" of the model trend (i.e.,
whether we expect the simulated trend to be significantly larger or
smaller than observed).

The "TYPE2" results are the "DIFFERENCE SERIES" tests. These involve
O(t) and M(t), which represent any single pair of modeled and observed
layer-averaged temperature time series. One first defines the difference
time series D(t) = O(t) - M(t), and then calculates the trend b{D} in
D(t) and its adjusted standard error, s{bD}. The test statistic is then
simply d2 = b{D} / s{bD}. As in the case of the "PAIRED TREND" tests, we
assume that d2 is normally distributed, and then calculate p-values for
the 98 pairs of difference series tests.

As I mentioned in a previous email, the interpretation of the
"DIFFERENCE SERIES" tests is a little complicated. Over half (35) of the
49 model simulations examined in the CCSP report include some form of
volcanic forcing. In these 35 cases, differencing the O(t) and M(t) time
series reduces the amplitude of this externally-forced component in
D(t). This will tend to reduce the overall temporal variability of D(t),
and hence reduce s{bD}, the standard error of the trend in D(t). Such
noise reduction should make it easier to identify true differences in
the anthropogenically-forced components of b{O} and b{D}. But since the
internally-generated variability in O(t) and M(t) is uncorrelated,
differencing O(t) and M(t) has the opposite effect of amplifying the
noise, thus inflating s{bD} and making it more difficult to identify
model-versus-observed trend differences.

The results given below show that the "PAIRED TREND" and "DIFFERENCE
SERIES" tests yield very similar rejection rates of the null hypothesis.
The bottom line is that, regardless of which test we use, which
significance level we stipulate, which observational dataset we use, or
which atmospheric layer we focus on, there is no evidence to support
Douglass et al.'s assertion that all "UAH and RSS satellite trends are
inconsistent with model results".

Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL (TYPE1) 49 x 2 (98) 2 (2.04%) 1 (1.02%)
2. OBS-vs-MODEL (TYPE2) 49 x 2 (98) 2 (2.04%) 2 (2.04%)

Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL (TYPE1) 49 x 2 (98) 4 (4.08%) 2 (2.04%)
2. OBS-vs-MODEL (TYPE2) 49 x 2 (98) 3 (3.06%) 3 (3.06%)

Test type No. of tests T2 "Hits" T2LT "Hits"
1. OBS-vs-MODEL (TYPE1) 49 x 2 (98) 7 (7.14%) 5 (5.10%)
2. OBS-vs-MODEL (TYPE2) 49 x 2 (98) 10 (10.20%) 7 (7.14%)

As I've mentioned in previous emails, I think it's a little tricky to
figure out the null distribution of rejection rates - i.e., the
distribution that might be expected by chance alone. My gut feeling is
that this is easiest to do by generating distributions of the d1 and d2
statistics using model control run data only. Use of Monte Carlo
procedures gets into issues of whether one should use "block
resampling", and attempt to preserve the characteristic decorrelation
times of the model and observational data being tested, etc., etc.

Thanks very much to all of you for your advice and comments. I still
believe that there is considerable merit in a brief response to Douglass
et al. I think this could be done relatively quickly. From my
perspective, this response should highlight four issues:

1) It should identify the flaws in the statistical approach used by
Douglass et al. to compare modeled and observed trends.

2) It should do the significance testing properly, and report on the
results of "PAIRED TREND" and "DIFFERENCE SERIES" tests.

3) It should show something similar to the figure that Leo recently
distributed (i.e., zonal-mean trend profiles in various versions of the
RAOBCORE data), and highlight the fact that the structural uncertainty
in sonde-based estimates of tropospheric temperature change is much
larger than was claimed in Douglass et al.

4) It should note and discuss the considerable body of "complementary
evidence" supporting the finding that the tropical lower troposphere has
warmed over the satellite era.

With best regards,


Thomas.R.Karl wrote:
> Thanks Ben,
> You have been busy! I sent Tom an email before reading the last
> paragraph of this note. Recognizing the "random" placement of ENSO in
> the models and volcanic effects (in a few) and the known impact of the
> occurrence of these events on the trends, I think it is appropriate that
> the noise and related uncertainty about the trend differences be
> increased. Amplifying the noise could be argued as an appropriate
> conservative approach, since we know that these events are confounding
> our efforts to see differences between models and obs w/r to greenhouse
> forcing.
> I know it is more work, but I think it does make sense to calculate
> O(1)-M(1), O(2)-M(2) .... O(n)-M(n) for all combinations of observed
> data sets and model simulations. You could test for significance by
> using a Monte Carlo bootstrap approach by randomizing the years for both
> models and data.
> Regards, Tom
> Ben Santer said the following on 12/26/2007 9:50 PM:
>> Dear John,
>> Thanks for your email. As usual, your comments were constructive and
>> thought-provoking. I've tried to do some of the additional tests that
>> you suggested, and will report on the results below.
>> But first, let's have a brief recap. As discussed in my previous
>> emails, I've tested the significance of differences between trends in
>> observed MSU time series and the trends in synthetic MSU temperatures
>> in a multi-model "ensemble of opportunity". The "ensemble of
>> opportunity" comprises results from 49 realizations of the CMIP-3
>> "20c3m" experiment, performed with 19 different A/OGCMs. This is the
>> same ensemble that was analyzed in Chapter 5 of the CCSP Synthesis and
>> Assessment Product 1.1.
>> I've used observational results from two different groups (RSS and
>> UAH). From each group, we have results for both T2 and T2LT. This
>> yields a total of 196 different tests of the significance of
>> observed-versus-model trend differences (2 observational datasets x 2
>> layer-averaged temperatures x 49 realizations of the 20c3m
>> experiment). Thus far, I've tested the significance of trend
>> differences using T2 and T2LT data spatially averaged over oceans only
>> (both 20N-20S and 30N-30S), as well as over land and ocean (20N-20S).
>> All results described below focus on the land and ocean results, which
>> facilitates a direct comparison with Douglass et al.
>> Here was the information that I sent you on Dec. 14th:
>> ANALYSIS PERIOD 1979 TO 1999)
>> T2LT tests, RSS observational data: 0 out of 49 model-versus-observed
>> trend differences are significant at the 5% level.
>> T2LT tests, UAH observational data: 1 out of 49 model-versus-observed
>> trend differences are significant at the 5% level.
>> T2 tests, RSS observational data: 1 out of 49 model-versus-observed
>> trend differences are significant at the 5% level.
>> T2 tests, UAH observational data: 1 out of 49 model-versus-observed
>> trend differences are significant at the 5% level.
>> In other words, at a stipulated significance level of 5% (for a
>> two-tailed test), we rejected the null hypothesis of "No significant
>> difference between observed and simulated tropospheric temperature
>> trends" in only 1 out of 98 cases (1.02%) for T2LT and 2 out of 98
>> cases (2.04%) for T2.
>> You asked, John, how we might determine a baseline for judging the
>> likelihood of obtaining the 'observed' rejection rate by chance alone.
>> You suggested use of a bootstrap procedure involving the model data
>> only. In this procedure, one of the 49 20c3m realizations would be
>> selected at random, and would constitute the "surrogate observations".
>> The remaining 48 members would be randomly sampled (with replacement)
>> 49 times. The significance of the difference between the surrogate
>> "observed" trend and the 49 simulated trends would then be assessed.
>> This procedure would be repeated many times, yielding a distribution
>> of rejection rates of the null hypothesis.
>> As you stated in your email, "The actual number of hits, based on the
>> real observations could then be referenced to the Monte Carlo
>> distribution to yield a probability that this could have occurred by
>> chance."
>> One slight problem with your suggested bootstrap approach is that it
>> convolves the trend differences due to internally-generated
>> variability with trend differences arising from inter-model
>> differences in both climate sensitivity and in the forcings applied in
>> the 20c3m experiment. So the distribution of "hits" (as you call it;
>> or "rejection rates" in my terminology) is not the distribution that
>> one might expect due to chance alone.
>> Nevertheless, I thought it would be interesting to generate a
>> distribution of "rejection rates" based on model data only. Rather
>> than implementing the resampling approach that you suggested, I
>> considered all possible combinations of trend pairs involving model
>> data, and performed the paired difference test between the trend in
>> each 20c3m realization and in each of the other 48 realizations. This
>> yields a total of 2352 (49 x 48) non-identical pairs of trend tests
>> (for each layer-averaged temperature time series).
>> Here are the results:
>> T2: At a stipulated 5% significance level, 58 out of 2352 tests
>> involving model data only (2.47%) yielded rejection of the null
>> hypothesis of no significant difference in trend.
>> T2LT: At a stipulated 5% significance level, 32 out of 2352 tests
>> involving model data only (1.36%) yielded rejection of the null
>> hypothesis of no significant difference in trend.
>> For both layer-averaged temperatures, these numbers are slightly
>> larger than the "observed" rejection rates (2.04% for T2 and 1.02% for
>> T2LT). I would conclude from this that the statistical significance of
>> the differences between the observed and simulated MSU tropospheric
>> temperature trends is comparable to the significance of the
>> differences between the simulated 20c3m trends from any two CMIP-3
>> models (with the proviso that the simulated trend differences arise
>> not only from internal variability, but also from inter-model
>> differences in sensitivity and 20th century forcings).
>> Since I was curious, I thought it would be fun to do something a
>> little closer to what you were advocating, John - i.e., to use model
>> data to look at the statistical significance of trend differences that
>> are NOT related to inter-model differences in the 20c3m forcings or in
>> climate sensitivity. I did this in the following way. For each model
>> with multiple 20c3m realizations, I tested each realization against
>> all other (non-identical) realizations of that model - e.g., for a
>> model with an 20c3m ensemble size of 5, there are 20 paired trend
>> tests involving non-identical data. I repeated this procedure for the
>> next model with multiple 20c3m realizations, etc., and accumulated
>> results. In our CCSP report, we had access to 11 models with multiple
>> 20c3m realizations. This yields a total of 124 paired trend tests for
>> each layer-averaged temperature time series of interest.
>> For both T2 and T2LT, NONE of the 124 paired trend tests yielded
>> rejection of the null hypothesis of no significant difference in trend
>> (at a stipulated 5% significance level).
>> You wanted to know, John, whether these rejection rates are sensitive
>> to the stipulated significance level. As per your suggestion, I also
>> calculated rejection rates for a 20% significance level. Below, I've
>> tabulated a comparison of the rejection rates for tests with 5% and
>> 20% significance levels. The two "rows" of "MODEL-vs-MODEL" results
>> correspond to the two cases I've considered above - i.e., tests
>> involving 2352 trend pairs (Row 2) and 124 trend pairs (Row 3). Note
>> that the "OBSERVED-vs-MODEL" row (Row 1) is the combined number of
>> "hits" for 49 tests involving RSS data and 49 tests involving UAH data:
>> Test type No. of tests T2 "Hits" T2LT "Hits"
>> Row 1. OBSERVED-vs-MODEL 49 x 2 2 (2.04%) 1 (1.02%)
>> Row 2. MODEL-vs-MODEL 2352 58 (2.47%) 32 (1.36%)
>> Row 3. MODEL-vs-MODEL 124 0 (0.00%) 0 (0.00%)
>> Test type No. of tests T2 "Hits" T2LT "Hits"
>> Row 1. OBSERVED-vs-MODEL 49 x 2 7 (7.14%) 5 (5.10%)
>> Row 2. MODEL-vs-MODEL 2352 176 (7.48%) 100 (4.25%)
>> Row 3. MODEL-vs-MODEL 124 8 (6.45%) 6 (4.84%)
>> So what can we conclude from this?
>> 1) Irrespective of the stipulated significance level (5% or 20%), the
>> differences between the observed and simulated MSU trends are, on
>> average, substantially smaller than we might expect if we were
>> conducting these tests with trends selected from a purely random
>> distribution (i.e., for the "Row 1" results, 2.04 and 1.02% << 5%, and
>> 7.14% and 5.10% << 20%).
>> 2) Why are the rejection rates for the "Row 3" results substantially
>> lower than 5% and 20%? Shouldn't we expect - if we are only testing
>> trend differences between multiple realizations of the same model,
>> rather than trend differences between models - to obtain rejection
>> rates of roughly 5% for the 5% significance tests and 20% for the 20%
>> tests? The answer is clearly "no". The "Row 3" results do not involve
>> tests between samples drawn from a population of randomly-distributed
>> trends! If we were conducting this paired test using randomly-sampled
>> trends from a long control simulation, we would expect (given a
>> sufficiently large sample size) to eventually obtain rejection rates
>> of 5% and 20%. But our "Row 3" results are based on paired samples
>> from individual members of a given model's 20c3m experiment, and thus
>> represent both signal (response to the imposed forcing changes) and
>> noise - not noise alone. The common signal component makes it more
>> difficult to reject the null hypothesis of no significant difference
>> in trend.
>> 3) Your point about sensitivity to the choice of stipulated
>> significance level was well-taken. This is obvious by comparing "Row
>> 3" results in the 5% and 20% test cases.
>> 4) In both the 5% and 20% cases, the rejection rate for paired tests
>> involving model-versus-observed trend differences ("Row 1") is
>> comparable to the rejection rate for tests involving inter-model trend
>> differences ("Row 2") arising from the combined effects of differences
>> in internal variability, sensitivity, and applied forcings. On
>> average, therefore, model versus observed trend differences are not
>> noticeably more significant than the trends between any given pair of
>> CMIP-3 models. [N.B.: This inference is not entirely justified, since,
>> "Row 2" convolves the effects of both inter-model differences and
>> "within model" differences arising from the different manifestations
>> of natural variability superimposed on the signal. We would need a
>> "Row 4", which involves 19 x 18 paired tests of model results, using
>> only one 20c3m realization from each model. I'll generate "Row 4"
>> tomorrow.]
>> John, you also suggested that we might want to look at the statistical
>> significance of trends in time series of differences - e.g., in O(t)
>> minus M(t), or in M1(t) minus M2(t), where "O" denotes observations,
>> and "M" denotes model, and t is an index of time in months. While I've
>> done this in previous work (for example in the Santer et al. 2000 JGR
>> paper, where we were looking at the statistical significance of trend
>> differences between multiple observational upper air temperature
>> datasets), I don't think it's advisable in this particular case. As
>> your email notes, we are dealing here with A/OGCM results in which the
>> phasing of El Ninos and La Ninas (and the effects of ENSO variability
>> on T2 and T2LT) differs from the phasing in the real world. So
>> differencing M(t) from O(t), or M2(t) from M1(t), probably actually
>> amplifies rather than damps noise, particularly in the tropics, where
>> the externally-forced component of M(t) or O(t) over 1979 to 1999 is
>> only a relatively small fraction of the overall variance of the time
>> series. I think this amplification of noise is a disadvantage in
>> assessing whether trends in O(t) and M(t) are significantly different.
>> Anyway, thanks again for your comments and suggestions, John. They
>> gave me a great opportunity to ignore the hundreds of emails that
>> accumulated in my absence, and instead do some science!
>> With best regards,
>> Ben
>> John Lanzante wrote:
>>> Ben,
>>> Perhaps a resampling test would be appropriate. The tests you have
>>> performed
>>> consist of pairing an observed time series (UAH or RSS MSU) with each
>>> one
>>> of 49 GCM times series from your "ensemble of opportunity". Significance
>>> of the difference between each pair of obs/GCM trends yields a certain
>>> number of "hits".
>>> To determine a baseline for judging how likely it would be to obtain the
>>> given number of hits one could perform a set of resampling trials by
>>> treating one of the ensemble members as a surrogate observation. For
>>> each
>>> trial, select at random one of the 49 GCM members to be the
>>> "observation".
>>> >From the remaining 48 members draw a bootstrap sample of 49, and
>>> perform
>>> 49 tests, yielding a certain number of "hits". Repeat this many times to
>>> generate a distribution of "hits".
>>> The actual number of hits, based on the real observations could then be
>>> referenced to the Monte Carlo distribution to yield a probability
>>> that this
>>> could have occurred by chance. The basic idea is to see if the observed
>>> trend is inconsistent with the GCM ensemble of trends.
>>> There are a couple of additional tweaks that could be applied to your
>>> method.
>>> You are currently computing trends for each of the two time series in
>>> the
>>> pair and assessing the significance of their differences. Why not first
>>> create a difference time series and assess the significance of it's
>>> trend?
>>> The advantage of this is that you would reduce somewhat the
>>> autocorrelation
>>> in the time series and hence the effect of the "degrees of freedom"
>>> adjustment. Since the GCM runs are based on coupled model runs this
>>> differencing would help remove the common externally forced variability,
>>> but not internally forced variability, so the adjustment would still be
>>> needed.
>>> Another tweak would be to alter the significance level used to assess
>>> differences in trends. Currently you are using the 5% level, which
>>> yields
>>> only a small number of hits. If you made this less stringent you
>>> would get
>>> potentially more weaker hits. But it would all come out in the wash
>>> so to
>>> speak since the number of hits in the Monte Carlo simulations would
>>> increase
>>> as well. I suspect that increasing the number of expected hits would
>>> make the
>>> whole procedure more powerful/efficient in a statistical sense since you
>>> would no longer be dealing with a "rare event". In the current
>>> scheme, using
>>> a 5% level with 49 pairings you have an expected hit rate of 0.05 X
>>> 49 = 2.45.
>>> For example, if instead you used a 20% significance level you would
>>> have an
>>> expected hit rate of 0.20 X 49 = 9.8.
>>> I hope this helps.
>>> On an unrelated matter, I'm wondering a bit about the different
>>> versions of
>>> Leo's new radiosonde dataset (RAOBCORE). I was surprised to see that the
>>> latest version has considerably more tropospheric warming than I
>>> recalled
>>> from an earlier version that was written up in JCLI in 2007. I have a
>>> couple of questions that I'd like to ask Leo. One concern is that if
>>> we use
>>> the latest version of RAOBCORE is there a paper that we can reference --
>>> if this is not in a peer-reviewed journal is there a paper in
>>> submission?
>>> The other question is: could you briefly comment on the differences
>>> in methodology used to generate the latest version of RAOBCORE as
>>> compared to the version used in JCLI 2007, and what/when/where did
>>> changes occur to
>>> yield a stronger warming trend?
>>> Best regards,
>>> ______John
>>> On Saturday 15 December 2007 12:21 pm, Thomas.R.Karl wrote:
>>>> Thanks Ben,
>>>> You have the makings of a nice article.
>>>> I note that we would expect to 10 cases that are significantly
>>>> different by chance (based on the 196 tests at the .05 sig level).
>>>> You found 3. With appropriately corrected Leopold I suspect you
>>>> will find there is indeed stat sig. similar trends incl.
>>>> amplification. Setting up the statistical testing should be
>>>> interesting with this many combinations.
>>>> Regards, Tom
> --
> *Dr. Thomas R. Karl, L.H.D.*
> */Director/*//
> NOAA's National Climatic Data Center
> Veach-Baley Federal Building
> 151 Patton Avenue
> Asheville, NC 28801-5001
> Tel: (828) 271-4476
> Fax: (828) 271-4246
> <>

Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel: (925) 422-2486
FAX: (925) 422-7675

No comments:

Post a Comment