At last, after 24 years of reflection, Professor Flynn has felt able to disburden himself of a ambitious general book[1] aimed at explaining the “massive rise in IQ” during the 20th century, an effect which now carries his name.


The results are curious. To be sure, Flynn has wished to write a popular book of general interest, not devoid of some philosophical self-indulgence, and to achieve both “critical acumen” (a pet concept) and judicious inoffensiveness. He includes for aficionados much of the psychometric test data which support his finding of long-term secular gains in IQ in tabular form.


Many debates regarding intelligence are both vexed and thorny, so it is helpful that his general account is careful to avoid animosity and welcomes the contending parties within the congenial circle of his “humane egalitarianism”.


However, I do have some criticisms:  

  1. It is actually quite difficult to discover what he does think. Eventually he admits — in the final chapter — to what must be described as an extreme form of environmentalism (an increasingly complex post-industrial society has amplified cognitive skills). This is partly because his manner of exposition is amazingly diffuse.
  2.  Flynn seems surprisingly uninterested in psychometrics and test construction as such. What is universally known as the ACID profile he calls, for some reason, the AICD profile. Yet, as I will argue, these topics may contain a large part of the relevant explanation.
  3. Flynn mentions his background as a moral philosopher and seems disinclined to do any original research into the questions that interest him (though he is generous in suggesting research designs to others). The consequence is that the book is a hash of opinions, mostly unsupported by factual evidence. This does not detract from its stimulating properties but cumulatively leads to some confusion and frustration in the specialist reader.
  4. The book is quite carelessly written, much of it apparently dictated, and poorly proofed: words are missing and the expression is often rough. It reads like a draft. Though often amusing and direct, the manner is frequently somewhat crackerbarrel.


Missing not only from Flynn’s account, but from most other discussions of this topic which has excited so many experts, is any discussion of the vastly changed technology of psychometric assessment. We expect a car built in 2000 to be a great deal more powerful than one built in 1910. Particularly since 1960, when Georg Rasch published his innovative account of test construction theory,[2] and 1980 when, after a 20 year lag during which the penny dropped, the technology of item response theory (IRT) has radically streamlined test administration. Older tests now look like what they are — museum pieces — simply because of their gross superfluity of items. We no longer find it acceptable that a new test should be published that does not utilise IRT-based scaling.


For instance, the family of Raven tests — the Coloured Progressive Matrices,[3] Standard Progressive Matrices[4] and Advanced Progressive Matrices[5]  between them offer 108 items, probably enough for three equivalent all-age tests. Using Andrich’s Rasch analysis of the difficulty values,[6] it is possible to select every third item, once ranked for difficulty, and come up with something like an economic modern test.[7]


Because of the worldwide establishment and acceptance of the Wechsler tests, there has been a sustained reluctance to modernise them. The Wechsler Intelligence Scale for Children – 3rd edition (WISC-III)[8]  was a remarkably tame revision and, even with one new test added, psychologists continued to give the test as if it were the Wechsler Intelligence Scale for Children – Revised (WISC-R),[9]  omitting Symbol Search — the new addition — and reporting subtests in their old order. They even continued hunting for the subjectively-perceived ACID profiles even when provided with the quantitatively based factor or Index scores.[10]


Things changed, however, when the Psychological Corporation assembled a team to revise the WAIS. Originally known as the Wechsler-Bellevue and first published in 1939, this was standardised in the most limited way on a single convenience sample of adults in a suburb of New York, but became the progenitor of the entire family of Wechsler tests. Desiring a downward extension for children, David Wechsler produced the first WISC in 1949. Although now better sampled, all the children were white:


The national standardization sample for the WISC, stratified by geographic region and parental occupation, consisted of 2200 white children (1100 boys, 1100 girls): 100 boys and 100 girls at each age from 5 to 15 years inclusive […] Miele (1958 ) reported the mean raw scores obtained by the examinees in the WISC standardization sample for each subtest by age and sex. Because standard deviations for WISC raw scores were never reported, total raw-score variability had to be estimated from the tables of norms in the WISC manual, which were used to calculate the interquartile range (the difference between the score at the 25th percentile and the 75th percentile in a normal distribution of scores) … [T]he standard deviation is 25% smaller than the interquartile range in a normal distribution of scores (Rosenthal & Rosnow, 1984) […] [11]


The test was to prove a permanent success in that it added the vital dimension of visuospatial problem-solving ability to the heavily verbal emphasis embodied in the older Stanford-Binet (Terman Merrill) test.


Flynn does not seem to appreciate the extent to which the content of subtests was revised with each new edition, nor the progressive refinement of technical standards indicated above. By the time the team got to work to produce the Wechsler Adult Scale of Intelligence (WAIS-III),[12] the courage to modernise radically had at last been found;[13] the same team was shortly afterwards responsible for the WISC-IV revision also.[14]  The latter two tests give Flynn a problem because of the radical extent of the alterations. Yet if one accepts the factorial logic of g — the statistical general factor in such tests — it should not really matter how it is measured.  In particular, he overlooks the case lying within his own data that modernisation seems to result in more intelligence.[15]


The technological argument must be that, as the measurement of intelligence threw off its historical legacy of sensory-motor testing,[16] so the tests revealed more and more of what we might nowadays regard as true intelligence, namely higher-order abstract logical reasoning and novel problem-solving. Engineers speak of signal-to-noise ratio and the history of psychometric technology is one of increasing signal and decreasing noise. In particular, the Wechsler tests included an enormous contamination of fine motor skills (especially timed tests) in what seems a mélange of sensorimotor activities. The most streamlined modern tests of general cognitive ability, such as the Differential Ability Scales – Second Edition,[17]  are virtually free of such noise and show why such psychometric testing is now regarded as virtually a finished technology. This may indeed result in a cessation of further upward movement in population IQs, as is apparently already evident in the Scandinavian countries.


Yet this is counterintuitive. We are all familiar with the antique tests, beloved of Mensa, which happily produce IQs of 180.[18] Is it therefore the case that older tests inflate, and more rigorous modern tests deflate, IQ? This would not constitute an explanation of the Flynn Effect! I believe the opposite is the case. Because of the poor technical standards — to our contemporary eyes — of the older tests, there was poor targeting of intellectual ability and a greater element of randomness. Paradoxically, the more rigorous modern technology actually reveals more of the intelligence that is there. Thus we do not overestimate contemporaries but underestimate, for technical reasons, previous generations.


The special case here, and the counterargument which must be addressed, is that of the Raven tests. If only raw scores were reported in the literature, as they often are by experimental psychologists typically reporting highly focused enquiries, we should be on firm ground. But the massive gains that show up on this test of non-verbal reasoning typically involve feeding raw scores through the wholly inadequate apparatus of the various standardisations published since the 1930s[19] in order to report derived scores of some kind. None of these standardisations has ever commanded general acceptance, with the partial exception of the 1979 norms for the SPM.


Nevertheless, I remain open to the very considerable evidence of the Flynn effect, to which some reliable Raven raw score data must powerfully contribute, but suggest that many of the competing explanations at present remain unsatisfactory. The most parsimonious explanation, or part-explanation, should concern the measurement technology itself — before we start speculating about the complexity of modern society and the scientific training of modern populations. Flynn certainly reveals the extent to which so-called IQ tests are relativistic from top to bottom, though he does not choose to emphasise their powerful clinical accuracy and utility. But he seems unwilling to recognize their central role in contributing to his undoubted effect.


We are left, then, with widely accepted and scarcely challenged evidence of “massive IQ gains” ever since such tests were invented, together with an excessive number of explanations and non-explanations. Though we may feel more sympathy for this one, and less for that, I suggest that a modest but fundamental source of influence lies in the progress of the technology itself. Before we resort to Bonapartist social explanations, we ought surely to consider the little black boxes from which the grand effect arises.


