Dear folks,

I'm forwarding an email from Prof. Glenn McGregor, the IJoC editor who

is handling our paper. The email contains the comments of Reviewer #1,

and notes that comments from two additional Reviewers will be available

shortly.

Reviewer #1 read the paper very thoroughly, and makes a number of useful

comments. The Reviewer also makes some comments that I disagree with.

The good news is that Reviewer #1 begins his review (I use this personal

pronoun because I'm pretty sure I know the Reviewer's identity!) by

affirming the existence of serious statistical errors in DCPS07:

"I've read the paper under review, and also DCPS07, and I think the

present authors are entirely correct in their main point. DCPS07 failed

to account for the sampling variability in the individual model trends

and, especially, in the observational trend. This was, as I see it, a

clear-cut statistical error, and the authors deserve the opportunity to

present their counter-argument in print."

Reviewer #1 has two major concerns about our statistical analysis. Here

is my initial reaction to these concerns.

CONCERN #1: Assumption of an AR-1 model for regression residuals.

In calculating our "adjusted" standard errors, we assume that the

persistence of the regression residuals is well-described by an AR-1

model. This assumption is not unique to our analysis, and has been made

in a number of other investigations. The Reviewer would "like to see at

least some sensitivity check of the standard error formula against

alternative model assumptions." Effectively, the Reviewer is asking

whether a more complex time series model is required to describe the

persistence.

Estimating the order of a more complex AR model is a tricky business.

Typically, something like the BIC (Bayesian Information Criterion) or

AIC (Akaike Information Criterion) is used to do this. We could, of

course, use the BIC or AIC to estimate the order of the AR model that

best fits the regression residuals. This would be a non-trivial

undertaking. I think we would find that, for different time series, we

would obtain different estimates of the "best-fit" AR model. For

example, 20c3m runs without volcanic forcing might yield a different AR

model order than 20c3m runs with volcanic forcing. It's also entirely

likely (based on Rick Katz's experience with such AR model-fitting

exercises) that the AIC- and BIC-based estimates of the AR model order

could differ in some cases.

As the Reviewer himself points out, DCPS07 "didn't make any attempt to

calculate the standard error of individual trend estimates and this

remains the major difference between the two paper." In other words, our

paired trends test incorporates statistical uncertainties for both

simulated and observed trends. In estimating these uncertainties, we

account for non-independence of the regression residuals. In contrast,

the DCPS07 trend "consistency test" does not incorporate ANY statistical

uncertainties in either observed or simulated trends. This difference in

treatment of trend uncertainties is the primary issue. The issue of

whether an AR-1 model is the most appropriate model to use for the

purpose of calculating adjusted standard errors is really a subsidiary

issue. My concern is that we could waste a lot of time looking at this

issue, without really enlightening the reader about key differences

between our significance testing testing procedure and the DCPS07 approach.

One solution is to calculate (for each model and observational time

series used in our paper) the parameters of an AR(K) model, where K is

the total number of time lags, and then apply equation 8.39 in Wilks

(1995) to estimate the effective sample size. We could do this for

several different K values (e.g., K=2, K=3, and K=4; we've already done

the K=1 case). We could then very briefly mention the sensitivity of our

"paired trend" test results to choice of order K of the AR model. This

would involve some work, but would be easier to explain than use of the

AIC and BIC to determine, for each time series, the best-estimate of the

order of the AR model.

CONCERN #2: No "attempt to combine data across model runs."

The Reviewer is claiming that none of our model-vs-observed trend tests

made use of data that had been combined (averaged) across model runs.

This is incorrect. In fact, our two modified versions of the DCPS07 test

(page 29, equation 12, and page 30, equation 13) both make use of the

multi-model ensemble-mean trend.

The Reviewer argues that our paired trends test should involve the

ensemble-mean trends for each model (something which we have not done)

rather than the trends for each of 49 individual 20c3m realizations. I'm

not sure whether the rationale for doing this is as "clear-cut" as the

Reviewer contends.

Furthermore, there are at least two different ways of performing the

paired trends tests with the ensemble-mean model trends. One way (which

seems to be what the Reviewer is advocating) involves replacing in our

equation (3) the standard error of the trend for an individual

realization performed with model A with model A's intra-ensemble

standard deviation of trends. I'm a little concerned about mixing an

estimate of the statistical uncertainty of the observed trend with an

estimate of the sampling uncertainty of model A's trend.

Alternately, one could use the average (over different realizations) of

model A's adjusted standard errors, or the adjusted standard error

calculated from the ensemble-mean model A time series. I'm willing to

try some of these things, but I'm not sure how much they will enlighten

the reader. And they will not help to make an already-lengthy manuscript

any shorter.

The Reviewer seems to be arguing that the main advantage of his approach

#2 (use of ensemble-mean model trends in significance testing) relative

to our paired trends test (his approach #1) is that non-independence of

tests is less of an issue with approach #2. I'm not sure whether I

agree. Are results from tests involving GFDL CM2.0 and GFDL CM2.0

temperature data truly "independent" given that both models were forced

with the same historical changes in anthropogenic and natural external

forcings? The same concerns apply to the high- and low-resolution

versions of the MIROC model, the GISS models, etc.

I am puzzled by some of the comments the Reviewer has made at the top of

page 3 of his review. I guess the Reviewer is making these comments in

the context of the pair-wise tests described on page 2. Crucially, the

comment that we should use "...the standard error if testing the average

model trend" (and by "standard error" he means DCPS07's sigma{SE}) IS

INCONSISTENT with the Reviewer's approach #3, which involves use of the

inter-model standard deviation in testing the average model trend.

And I disagree with the Reviewer's comments regarding the superfluous

nature of Section 6. The Reviewer states that, "when simulating from a

know (statistical) model... the test statistics should by definition

give the correct answer. The whole point of Section 6 is that the DCPS07

consistency test does NOT give the correct answer when applied to

randomly-generated data!

In order to satisfy the Reviewer's curiosity, I'm perfectly willing to

repeat the simulations described in Section 6 with a higher-order AR

model. However, I don't like the idea of simulation of synthetic

volcanoes, etc. This would be a huge time sink, and would not help to

illustrate or clarify the statistical mistakes in DCPS07.

It's obvious that Reviewer #1 has put a substantial amount of effort

into reading and commenting on our paper (and even performing some

simple simulations). I'm grateful for the effort and the constructive

comments, but feel that a number of comments are off-base. Am I

misinterpreting the Reviewer's comments?

With best regards,

Ben

</x-flowed>

24-Apr-2008

JOC-08-0098 - Consistency of Modelled and Observed Temperature Trends in the Tropical Troposphere

Dear Dr Santer

I have received one set of comments on your paper to date. Altjhough I would normally wait for all comments to come in before providing them to you, I thought in this case I would give you a head start in your preparation for revisions. Accordingly please find attached one set of comments. Hopefully I should have two more to follow in the near future.

Best,

Prof. Glenn McGregor

