Historian on the Edge: Archaeology, History and Bad Science: A critique of the analysis of DNA at Szólád (Hungary) and Collegno (Italy). Part 2 (Method; Results).

Method

Imagine a historical study that claimed that a general north-south division was visible in, for the sake of argument, the prologues of medieval charters and that this model had predictive value, such that the geographical origin of a charter could be accurately discerned from the sequence, appearance, or non-appearance of particular phrases. This would be quite an assertion, if not necessarily implausible. Imagine, however, that this model had been constructed from a sample of no fewer than 2,233 charters from Switzerland and southern Europe but of only 179 from northern Europe. The claim would – at best – be regarded as shaky. Yet, the geographical distribution of the modern DNA samples against which the aDNA extracted at Szólád and Collegno were compared takes exactly this form. The POPRES (POPulation REference Sample) data base,[1] the largest of those used to establish the distribution of genetic types across modern Europe, contains 1349 subjects from Switzerland, 599 from the United Kingdom, 147 from France, 131 Portuguese subjects, 114 from Italy and 92 from Spain. For the variable ‘Country of Father’, the uneven distribution was skewed further: 1,404 samples from Switzerland, 310 from Italy, 184 from Spain, 177 from France and 158 from Portugal, compared with 91 from Germany, 50 from Belgium and 38 from England.[2] That latter pattern was repeated across the other country of parent or grandparent variables. Every other nation represented, and from which regional characteristics were constructed, contained a few dozen individuals at most. Remember, too, that that the Swiss and southern European subjects were drawn from a population of 193.185 million people: they represented, in other words, just over a thousandth of one percent of the modern population. Even the largest sample, which happened to be taken from the smallest population (that of Switzerland), represents only a hundredth of 1% of the latter. The whole POPRES reference population totals only 5,886 subjects reduced, after quality-control procedures, to 3,082. These are infinitesimally small samples.

The POPRES data was compiled from ten collections with the overall aim of providing useful material ‘for population, disease, and pharmacological genetics research’.[3] Its UK data came overwhelmingly from a sample of 431 South Asian and 938 Northern European subjects, aged between thirty-five and seventy-five, collected from fifty-eight GPs in west London for the purposes of research into cardiovascular illness. The largest population (2,809 subjects) was assembled from the Centre Hospitalier Universitaire Vaudois in Lausanne (hence the dominance of Swiss and southern European samples). Other collections included only healthy subjects. Much of the European genetic data was assembled from the declarations of US, Canadian and Australian subjects of their paternal and maternal ancestry. According to the publication of the POPRES project, ‘[t]he second round of quality control included further PCA [Principal Components Analysis] to identify subjects with ... misreported genetic ancestry’.[4] What this means is unclear but it gives the impression that ‘misreporting’ was established upon the basis of perceived statistical anomaly. Another statement of method is worth quoting in full:

Based on this information, we first attributed a best-guess geographic label to each of the family members based on the following rules: 1) missing data was ignored; 2) if ethnicity conflicted with birthplace or first language data, only ethnicity was considered; 3) if birthplace and first language disagreed, a higher level container label was chosen (e.g. an individual who was born in France but reported his first language to be Norwegian was labeled European); and 4) white individuals born in the US or Canada were attributed according to the first language information alone, if other than English.[5]

Such methods and assumptions may be fair enough, especially for the purposes for which the database was assembled, but they might all be subject to discussion if employed to study historical population movement and ethnicity: not a purpose for which the data were collected. Furthermore, there are serious differences in the reliability of these data between their use at the level of population and their employment at the level of individuals.[6]

The other reference populations were smaller than the POPRES sample.[7] The ‘1000 Genome Project’, from which the genetic ancestry of the Szólád-Collegno individuals was estimated, used populations of only 100-200 subjects.[8] Those that were most significant in the discussion of the results were ‘Central Europeans in Utah’ (CEU: 184 samples); ‘Toscani’ project (TSI, from a small town near Florence: 117 samples); Great Britain (GBR: 107 samples); and ‘Iberians in Spain’ (IBS: 162 samples). Statistically we are looking at fragments of droplets in oceans.

The small modern DNA sample was compared with a still smaller (435 subjects)[9] sample of aDNA collected from archaeologically-recovered skeletons of Bronze Age date ‘or more recent’[10] and the conclusion drawn that, over the past three and a half millennia in Europe, there has been only the barest drift of population, generally to the south.[11] It is significant, however, that the distribution of the Bronze Age sample was almost the diametrical opposite of that of the modern reference population: ninety-three subjects from north of the Alps (mostly from Germany), compared with thirty-three from south of the mountains (including only four from Italy).[12] Some areas heavily represented in the modern sample (Switzerland; France, the UK) featured barely or not at all in the Bronze Age reference set. We have no idea what the population of Bronze Age northern Europe was but one imagines that eighty-eight would be a small fraction of one percent of it. This should raise all sorts of red flags. On this statistical basis, the claims made by Amorim and his fellow authors are, to be generous, bold indeed.

The kindreds were illustrated through a comparison of their genetic ancestry, expressed in terms of the admixture of seven types, of which the most important were labelled ‘CEU+GBR’, ‘TSI’ and ‘IBS’ (see above). It might seem fair to refer to ‘CEU+GBR’ as ‘northern’ and ‘TSI’ and ‘IBS’ as ‘southern’. It should be noted though that the analyses of the aDNA were incapable of clearly separating ‘GBR’ and ‘CEU’. The ‘CEU’ population, as will have been noted, was of modern Americans of European descent. How accurate and precise are its results likely to be in a European context? Even without the problems of sample and method, combining these ancestries covers a very broad region of Europe, inside and outside the Roman frontiers, one unlikely to sustain the very precise but sweeping claims made in the article.

This experiment takes data suggesting immobility, constructed at the level of populations, and compares it against data drawn from individuals and putatively showing migration. If, against the backdrop of a 3,500-year-long history of supposedly general population immobility, aDNA taken from sixty-three burials at two different cemeteries revealed, at both sites, evidence of the arrival of genetically distinct populations, this must have been the equivalent of randomly locating the proverbial needle in a haystack, worthy of a media ‘splash’ in itself. The other implication ought to be that – given the supposed genetic difference of the incomers at Collegno from modern north Italians – whatever its scale, this population movement turned out to be a genetic dead-end, leaving no significant trace in the region’s modern population. If so, the value of this research for the study of the Völkerwanderung should be quite the opposite of that which has been supposed.

Finally, it is worth mentioning that one set of results, which lay outside the expected range, was rejected on hypothetical grounds: ‘this sample showed high levels of contamination (which we hypothesize is the result of plastic wares produced in China that were utilized in DNA extraction) and thus the results are unreliable.’[13] If that were the case, surely that whole body of data should be thrown out of the experiment, not just selected results that did not ‘fit’.

Results

We must assume that the laboratory analyses and subsequent mathematical modelling were flawless but there are strong reasons to discount the experiment’s results on the grounds of its set-up and the problems with its samples. Let us nonetheless treat the results on their own terms. My first point concerns the geographical plotting of different genotypes. The SPA (Spatial Ancestry Analysis) plots geographical coordinates for each allele within a Single Nucleotide Polymorphism (SNP) according to the location of the individual from whom the DNA sample was taken.[14] This data can then be used to predict the location of individuals according to the frequency of particular alleles within the SNPs of their genome. After running a series of SPA analyses, the geographical location of the individuals whose DNA was collected could be represented on a graph, using x and y coordinates, in such a way that the means of samples from different regions stood in a spatial relationship to each other that more or less replicated the geographical relationships between those regions. Thus the mean of samples from, say, the Republic of Ireland, United Kingdom and the Netherlands would be located near each other in the top left quadrant of the graph, above and perhaps to the left of the mean for France, and so on. Now, as Yang et al. illustrate, a very similar result can be produced using Principal Components Analysis.[15] If that is so, it must also be the case that geographical coordinates describe (that is to say represent the variation within) the data to a greater degree than variables within the genotypes. Otherwise, the means of samples taken from particular countries or regions would be pulled into clusters according to those genetic variables rather than their geographical relationships. Alternatively the genetic variables have been represented in such a way as to describe the data less well than the geographical coordinates and so allow the latter to determine the plot to a greater extent. Now, for the medical purposes for which SPA or other genetic models were created, this need not be an issue; indeed it might be desirable. For the discussion of historical genetics, however, one is left wondering exactly how significantly the genotypes differ from one another. [I am no longer sure that I have expressed (or got) this quite right, but there’s something very problematic about this mapping and its implications, and the assumptions made about it.] The plotting of individual samples against these geographically-driven plots seems to produce anomalies.

The presentation of the experiment’s results steers the reader towards a particular interpretation. At both Szólád and Collegno analyses suggested the existence of two genetically distinct groups. The argument is that one such group represents ‘northerners’, implicitly immigrants, and the other ‘locals’ (we can bracket the question of whether these assumptions are valid). On the published diagrams the former is coloured blue; the latter red. There is no good reason to have used exactly the same colour-coding at both sites or, alternatively, to have overlaid the results from both sites on the same figures,[16] especially when we might suppose that they represent significantly different populations. Clearly, the reader is intended to associate the two groups at the two sites and to see them as parts of two larger, generally distinct populations – of incoming Longobards and local provincial Romans, respectively. This hampers any critical reading of the data.

The analyses strongly suggested that there were genetically distinct kindreds present at both Szólád and Collegno. They also showed, however, that the two sites’ populations were quite similar overall.[17] Both included people with genetic ancestry of predominantly ‘CEU+GBR’ (‘northern’) type and others with ancestry that was overwhelmingly of ‘TSI’ (‘southern’) type,[18] although most individuals showed combinations of the two. ‘IBS’ ancestry at both was only found in subjects who showed ‘TSI’ ancestry (although in most cases ‘CEU+GBR’ was also present). Both sites contained some people with entirely ‘northern’ and others with entirely ‘southern’ genetic ancestry. On the basis of the genetic data, however, there would be as much reason to suppose that, at least in in Szólád, the people with ‘southern’ ancestry were the incomers, and those with ‘northern’ ancestry the locals, rather than vice versa. That might superficially seem less likely at Collegno but the lack of significant Italian aDNA comparanda means it is possible there too. The – hardly numerous – prehistoric Italian aDNA subjects clustered in a quite different part of the SPA diagram from the Collegno ‘southerners’.[19] That the kindred with ‘northern’ ancestry were newer to the region of Collegno than the other kindreds was only revealed by the isotopic analyses which were, overall, more interesting than the genetic studies. At Szólád those analyses suggested that kindreds of both ancestry types had moved there quite recently.

Two of the analysts’ presuppositions come into play here, neither of which emerges from the data themselves. The first is that the pattern illustrates a specific episode of demographic movement; the similarity results from one population moving to the area of the other. The second is that, more specifically, this episode was the Longobard migration from Pannonia to Italy. Without these, one could argue that the profiles of the two sites revealed that, genetically, populations in sixth-century northern Italy and in Pannonia were fairly similar and attested to continuous movement back and forth between the two regions as one might expect on historical grounds. This is why a control, or other comparanda, was (or were) essential. How likely is it that a sample of any cemetery in the Po Valley, dating to any period between the ‘Celtic’ settlement of Cisalpine Gaul and now, would – like Collegno – contain at least some people with genetic make-up suggestive of comparatively recent origins north of the Alps? I would propose that the answer is ‘very likely’.

While difficult and dangerous, population movement across the Alps has been constant since Ötzi the Iceman.[20] The Celtic migration into northern Italy has been mentioned; later, the different regions were part of the same imperial state for the best part of five centuries; Carolingian Italy was politically connected with the Rhine valley, Germany, and Provence and many armies (and doubtless countless individuals) moved back and forth over the mountains. Those contacts continued in the period of the ‘Holy Roman Empire’, bringing French and German troops into the peninsula, as happened again in the sixteenth-century Italian Wars. The ensuing Hapsburg dominance of northern Italy strengthened the already significant ties between that region, southern Germany, and Hungary up to the late nineteenth century. The idea that genetic similarities between Italian and transalpine populations at any point in history can (let alone must) be explained by then recent, discrete large-scale events lacks empirical basis. In other words, while the similarities between the populations of Szólád and Collegno surely attest to individual movement across the Alps, there is no good reason to suppose that they must testify to any particular, large-scale ‘migration event’, or to change rather than stasis in patterns of population movement. None of that (or indeed any of the arguments proposed here) means there was no Longobard migration or that that movement did not involve a large number of people: both facts are incontestable. What they do mean is that the evidence from these sites is not necessarily evidence of that migration, and that traces of that migration need not be expected to be especially clear in the genomes of late antique northern Italians. In many regards the experiment was fundamentally ill-conceived.

Overall, the analyses suggested a generally mixed population of Szólád. The results of Principal Coordinates Analysis (PCA) of Hungarian aDNA samples, when overlaid (using ‘Procrustes’[21]) with the Szólád-Collegno and modern reference samples revealed a distribution that overlapped with the ‘northern’ and ‘southern’ Szólád kindreds.[22] If the authors’ assumptions about long-term population stability between the Bronze Age and the present day, and about their methodology, were correct[23] this evidence would surely not show very conclusively that either group had moved into the region from a significantly distinct area. Analysis of the strontium content of the teeth at Szólád did not suggest that the ‘northern’ group were necessarily more likely to be outsiders than the ‘southern’ group. They were, however, evidently more heterogeneous in origin than the latter. According to the ‘narrative’ the study was supposed to be ‘testing’, barbarian immigrants were heterogeneous but on what basis would one assume that the population of late Roman Pannonia was not? From historical sources we know that it was a frontier province in which garrisons of diverse origins were stationed; in the late fourth century, Goths passed through the region more than once; the fifth century saw several groups, not least the Huns and Ostrogoths, resident there. The latter, of course, later moved to Italy and established a kingdom there.

The results of the analyses, as presented in the diagrams in Nature, seem less than convincing when examined closely. Subjects with different genetic ancestry are plotted against modern and Bronze Age subjects, as mentioned earlier. However, their grouping raises critical issues as the Principal Components Analysis, for whatever reason, pulled the data in such a way as to reveal, in some cases, a greater range within the groups defined by their ancestry than between them. For example, the Szólád ‘northerner’ and the two Szólád ‘southerners plotted nearest the origin lie closer to each other than they do to the members of their groups plotted farthest from the origin. It is also clear that some Principal Components Analyses have described the data far less clearly than others. Something in the data gives us grounds to wonder about the combination of different analyses of different data sets. Is the PCA calling the ‘Admixture’ analyses into question? This especially muddles the results at Szólád. That issue is further obfuscated by the overlaying of the results from both sites on the same plot and the use of the same colours in their representation, discussed earlier.

The ‘northern’ group at Collegno is in fact clustered more compactly, in a different region of the plot from the Szólád ‘northerners’. Indeed we see that the different genetic kindreds at Collegno are far more significantly separated on that plot, something that might support the conclusions the authors wished to draw. However, we also perceive a third group clearly distinguished from both: those with over 50% ‘TSI’ and ‘IBS’ (Tuscan and Iberian) ancestry who, one would have thought, ought to be plotted much further towards the ‘south’ or ‘south-east’ (or lower left-hand) quadrant on the PCA plot rather than in the region where modern Central European subjects cluster. This must question some of the experiment’s assumptions.[24] Ultimately, though, while there are nine ‘northerners’ (with over 70% ‘GBR+CEU’ ancestry) plotted, there are only four with over 70% Tuscan ancestry and four with Tuscan/Iberian. We may wonder why the authors chose to emphasise only the group with Tuscan ancestry as locals when the ‘TSI + IBS’ group could just as easily be called ‘southerners’, unless it was because this was inconvenient for the narrative that they had decided their results should present.

Finally, given the stress laid upon ancestry in the article’s conclusions, the cautionary note sounded recently by Mathieson and Scally is important:

Another source of confusion is that three distinct concepts – genealogical ancestry, genetic ancestry, and genetic similarity – are frequently conflated. ... but note that only the first two are explicitly forms of ancestry, and that genetic data are surprisingly uninformative about either of them.

[25]

Notes

[1] https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v4.p2

[2] https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000145/phs000145.v4.p2/pheno_variable_summaries/phs000145.v4.pht000659.v2.p2.POPRES_v1_v2_Subject_Phenotypes.var_report.xml (accessed 26/02/2021)

[3] Nelson MR, et al., ‘The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research.’ Am J Hum Genet. 2008 Sep;83(3):347-58. doi: 10.1016/j.ajhg.2008.08.005. Epub 2008 Aug 28. PMID: 18760391; PMCID: PMC2556436.

[4] Nelson et al., ‘The Population Reference Sample’.

[5] https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs000145.v4.p2&phv=173964&phd=&pha=&pht=2998&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1

[6] Mathieson I, Scally A (2020) ‘What is ancestry?’ PLoS Genet 16(3): e1008624. https://doi.org/10.1371/journal.pgen.1008624

[7] The next largest reference population was that created by G. Hellenthal, G.B.J. Busby, et al., ‘A genetic atlas of human admixture history.’ Science. 2014 Feb 14;343(6172):747-751. doi: 10.1126/science.1243518. PMID: 24531965; PMCID: PMC4209567. This contained 1,490 subjects from ninety-five genotyped population groups worldwide (thus an average of fifteen subjects each).

[8] A. Auton et al., ‘A global reference for human genetic variation.’ Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393. See https://www.internationalgenome.org/1000-genomes-project-publications/

[9] Mathieson et al.’ ‘Genome-wide patterns of selection in 230 ancient Eurasians.’ Nature 528, 499–503 (2015); Mathieson et al., ‘The genomic history of southeastern Europe.’ Nature 555, 197–203 (2018).

[10] ‘For the latter two studies we only utilize individuals dating from the Bronze Age (within which we included the Beaker Culture) or more recent’. K.R. Veeramah, ‘Supplementary Note 6. Modern and ancient reference dataset construction’. Amorim et al. ‘Understanding’, OSM, pp.28-30, at p.29. The implications of the phrase ‘or more recent’ are unclear.

[11] C.E.G. Amorim & K.R. Veeramah ‘Supplementary Note 7. Principal Component Analysis’ analysis’, Amorim et al., ‘Understanding’, OSM, pp.31-35, at pp.32-33.

[12] Mathieson et al., ‘Genome-wide patterns’, Supplementary Data 1; Mathieson et al., ‘The genomic history’, Supplementary Data. Most data were from eastern and south-eastern Europe and western Asia. There were two more Italian aDNA samples from Neolithic subjects.

[13] C.E.G. Amorim & K.R. Veeramah, ‘Supplementary note 7.’, p.31.

[14] Yang, et al., ‘A model-based approach’.

[15] Yang, et al., ‘A model-based approach’, fig.2.e.

[16] Amorim et al. ‘Understanding’, fig.2.

[17] Amorim et al. ‘Understanding’, fig.3.

[18] On which, see above, pp.000.

[19] Amorim et al. ‘Understanding’, fig.2b.

[20] Ötzi, ironically, was included in their ‘Bronze Age’ sample. Above, n.30.

[21] A transformation of one plot so that it overlays another with the best fit.

[22] Amorim et al. ‘Understanding’, fig.2b.

[23] An assumption questioned by the fact that modern Hungarian DNA samples clustered in quite a different part of the diagram; Amorim et al. ‘Understanding’, fig.2a.

[24] See above.

[25] Mathieson & Scally, ‘What is ancestry?

Thursday, 15 December 2022

Archaeology, History and Bad Science: A critique of the analysis of DNA at Szólád (Hungary) and Collegno (Italy). Part 2 (Method; Results).

Method

Results

Notes