Japanese surname regions

18
Japanese surname regions James A. Cheshire 1 , Paul A. Longley 2 , Keiji Yano 3 , Tomoki Nakaya 3 1 Centre for Advanced Spatial Analysis, University College London, 90 Tottenham Court Road, London W1T 4TJ, UK (e-mail: [email protected]) 2 Department of Geography, University College London, Gower Street, London WC1E 6BT, UK (e-mail: [email protected]) 3 Department of Geography, Ritsumeikan University, 56-1, Toji-in-kita-machi, Kita-ku, Kyoto, 603-8577 Japan (e-mail: [email protected], [email protected]) Received: 13 March 2012 / Accepted: 30 October 2012 Abstract. This paper uses an extended case study of Japan to illustrate how surnames, or family names, can be used as a basis for regionalization. We undertake a comparison between induc- tively surname regions of Japan with areal geographies based upon both contemporary and historical prefecture (administrative) units. The work is seen as using highly disaggregate framework data to evaluate the integrity of the areal units that are used in regional science. It also is relevant to understanding population distributions, past and present, and the consequences of local, regional and national residential mobility and migration. JEL classification: C02, R23, R59, P25 Key words: Surnames, regionalization, geodemographics, Lasker distance, clustering 1 Introduction It is well known that there are few, if any, inherently ‘natural’spatial units of analysis for the study of population characteristics in regional science (Openshaw 1984), and that the outcome of regional analysis is to a greater or lesser extent shaped by the nature, scale and extent of pre-defined areal aggregations. It is usually the case that such areal units have been deemed fit for the purpose of data dissemination because they are convenient for administrative purposes, although zone engineering through geocomputational analysis has become established in dis- seminating some data sources, such as the UK Census of Population (see Martin 1998). Thus the outputs of regional analysis become shackled to administrative geographies that may not engender any strong cultural affinity amongst residents, with attendant negative implications for policy implementation – for example, the demise of UK regional administrations for ‘Avon’ and ‘Humberside’ in part reflects their failure to capture the imaginations of local people. This may be of little consequence in some applications – as where the focus of regional analysis has fundamentally economic drivers – yet in other applications (such as the current localism agendas doi:10.1111/pirs.12002 © 2013 The Author(s). Papers in Regional Science © 2013 RSAI Papers in Regional Science, Volume 93 Number 3 August 2014.

Transcript of Japanese surname regions

Japanese surname regions

James A. Cheshire1, Paul A. Longley2, Keiji Yano3, Tomoki Nakaya3

1 Centre for Advanced Spatial Analysis, University College London, 90 Tottenham Court Road, London W1T 4TJ,UK (e-mail: [email protected])

2 Department of Geography, University College London, Gower Street, London WC1E 6BT, UK(e-mail: [email protected])

3 Department of Geography, Ritsumeikan University, 56-1, Toji-in-kita-machi, Kita-ku, Kyoto, 603-8577 Japan(e-mail: [email protected], [email protected])

Received: 13 March 2012 / Accepted: 30 October 2012

Abstract. This paper uses an extended case study of Japan to illustrate how surnames, or familynames, can be used as a basis for regionalization. We undertake a comparison between induc-tively surname regions of Japan with areal geographies based upon both contemporary andhistorical prefecture (administrative) units. The work is seen as using highly disaggregateframework data to evaluate the integrity of the areal units that are used in regional science. It alsois relevant to understanding population distributions, past and present, and the consequences oflocal, regional and national residential mobility and migration.

JEL classification: C02, R23, R59, P25

Key words: Surnames, regionalization, geodemographics, Lasker distance, clustering

1 Introduction

It is well known that there are few, if any, inherently ‘natural’ spatial units of analysis for thestudy of population characteristics in regional science (Openshaw 1984), and that the outcomeof regional analysis is to a greater or lesser extent shaped by the nature, scale and extent ofpre-defined areal aggregations. It is usually the case that such areal units have been deemed fitfor the purpose of data dissemination because they are convenient for administrative purposes,although zone engineering through geocomputational analysis has become established in dis-seminating some data sources, such as the UK Census of Population (see Martin 1998). Thus theoutputs of regional analysis become shackled to administrative geographies that may notengender any strong cultural affinity amongst residents, with attendant negative implications forpolicy implementation – for example, the demise of UK regional administrations for ‘Avon’ and‘Humberside’ in part reflects their failure to capture the imaginations of local people. This maybe of little consequence in some applications – as where the focus of regional analysis hasfundamentally economic drivers – yet in other applications (such as the current localism agendas

doi:10.1111/pirs.12002

© 2013 The Author(s). Papers in Regional Science © 2013 RSAI

Papers in Regional Science, Volume 93 Number 3 August 2014.

of several countries) the distinctive features that together comprise local context may becatalytic to effective policy prescription and implementation.

The argument for refocusing policy analysis using areal units of analysis that are of conse-quence to citizens is a persuasive one, but may be difficult to achieve in many practicalapplications. Cultures are built upon assemblages of individuals, and thus any attempt to identifythe cultural cohesion of territory using aggregations of unique human individuals has intrinsicmerits. However, the strictures and constraints of disclosure control severely limit the range ofpopulation attributes that can be geographically referenced at this level, and additional compu-tational issues are associated with managing the problems posed by scale and aggregation. Inprevious work, we have established that the residential geography of individuals bearing dif-ferent family names provides a useful basis to regionalization in European cultures (Cheshireet al. 2011). Our premise here is that ‘a name is a statement’, not just of individual identity, butalso a means of assimilating the effects of locational proximity into culturally relevant region-alization of territory. The motivation for this paper is to develop, extend and apply this work tothe regionalization of Japan. We believe that this makes it possible to establish the extent towhich modern and historical administrative geographies reflect population structure (beyondpopulation density). In the case of a poor match we argue that conventional administrativegeographies are not appropriate for studies of population movement, cultural diffusion oraffinity and historical geography.

Regional scientists have long been aware that there are no purely analytical solutions to theregionalization problem that are independent of local, regional and national context. Cognizant ofthis, we argue that zone design in regional science should instead attempt region-building throughinductive empirical analysis of demographic indicators that distil the cumulative effects ofinter-generational socio-spatial differentiation. Moreover, we suggest that surnames (familynames) provide quantifiable indicators of regional and local affinity in Japan (and elsewhere), andthat they continue to exhibit continuity and stability in their regional patterning. The creation ofsurname regions has traditionally been undertaken in the context of population genetics with fewstudies (such as that by Longley et al. 2011) making it beyond this body of literature.

It is our belief that using surnames as a basis for region building has much to offer regionalscience, especially in the context of charting population mobility. In this paper we undertake thefirst comprehensive and automated regionalization of an Asia-Pacific country based on surnamefrequencies taken from the most complete population register available. The primary focus of ourown previous research has been upon the geographic concentrations of Anglo-Saxon names inGreat Britain, Europe and ‘new world’ countries settled in historic times from these areas (e.g.,Longley et al. 2011). Most such names were first coined between the twelfth and fourteenthcenturies and although there are exceptions (common metonyms such as Smith and others such asBrown have multiple origins), their bearers were originally heavily concentrated in particularlocalities.Avery interesting finding of this research is that, five or more centuries later, geographicconcentration in these same areas is very much the norm. Our case study of Japan is motivated bya number of important issues that arise out of this finding. First, does the geography of namingconventions in Japan suggest similar conclusions as regard to historic propensities to migratebetween regions? Second, the European work establishes measures of the spread of names overhalf a millennium, yet surnames in Japan came into common parlance much more recently: is thelater advent of names in Japan evident in their usefulness for region definition? Third, as in theWest, successive national governments have pursued various population settlement policies, andit is interesting to ascertain the degrees to which their effects are manifest in the regionaldistribution of family names. A final motivation that arises out of this, concerns the global reachof regional science as a discipline: to what extent might it be possible to regionalize the entireworld on the basis of family names, and can the international distributions of names be used togauge the magnitudes of international flows of people, current and historic?

540 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

As such, our objective here is to generalize the structure of the regional system, rather thanmine more specific local population histories and movements. This amounts to a comparisonbetween the ‘natural’ (that is, inductively generalized) surname regions of Japan with bothcontemporary and historical prefecture (administrative) units. The findings amount to a culturalvalidation of Japan’s administrative geography and we hope that our approach will inspiresimilar studies in other jurisdictions.

2 Geo-genealogy: Names as framework data

Human identity is defined by the distinctive characteristics of individuals or groups, andnumerous reviews have investigated the ways in which the conflation of distinctive individualidentities can be used to define neighbourhoods (e.g., Batey and Brown 2007). It can be arguedthat one of the strongest and most enduring indicators of an individual’s identity is his or hersurname, since this can often provide insights into cultural, ethnic and linguistic characteristics(Mateos et al. 2011). Couched at the level of the individual, this indicator can be deemedinvulnerable to the vagaries of scaling and aggregation. Moreover, the availability of lists ofindividuals from sources such as electoral registers and telephone directories can provide aframework for small area estimation of population size. Finally, and perhaps most importantly,the statements of identity that names offer are enduring, and provide a defining statement aboutthe uniqueness of places for reasons outlined below.

Japanese surnames typically consist of two or three Chinese letters, originally used todenote ancestral settlements, occupations, or the names of ancestral clans (Watanabe 1966).While a minority of Japanese people have long-established surnames, it was not until circa1875 (following the Meiji Revolution) that farmers and the other ‘common’ people joined thenoble and military classes in taking surnames. At this time, most individuals adopted namesthat were already in use by the nobility but, as elsewhere in the world (see Cheshire andLongley 2012), some individuals took the names of their hamlets or other geographical fea-tures. The establishment of the nationwide Jinshin-koseki in 1871, and the 1875 cabinetdecree to make surnames obligatory, initiated the process of regular transmission of a surnamefrom parent to child (Lasker 1985). The process of surname inheritance is importantas it makes it possible to benchmark the effects of subsequent migration at the regional,local and national scales. As such, intergenerational transmission of names in Japan bearsbroad similarities with most Western countries, although there are some specific nationalcustoms that vary this general rule – such as muko-yoshi, whereby the matrilineal name maybe preserved by adoption of the bridegroom-to-be by the bride’s family (Yasuda and Furusho1971).

It is widely recognized that most Japanese surnames relate to specific place names (e.g.,Niwa 2002), and that these in turn often refer to their environmental and geomorphologicalsetting – such as mountain, river, forest, paddy field, and so on (Takemitsu 1998). Yano (2007)has identified the unique structure of surnames in Okinawa (e.g., Haga, Kinjo and Oshiro) and,following migration, has traced occurrences of these distinctive names from this region to thethree largest metropolitan areas of Tokyo, Osaka and Nagoya. The geographic origins of peoplewho participated in the late nineteenth century government sponsored settlement of Hokkaidocan be traced in a similar manner. Yano’s research has also been used to measure spatialsegregation of bearers of Chinese and Korean surnames in the major Japanese metropolitanareas.

In comparison with its neighbours, such as South Korea and China, Japanese society ischaracterized by a very large number of surnames; there were for example approximately 30,000surnames in the country even before they came into common usage following the 1875 decree

541Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

referred to above (Lasker 1985). Such an abundance of different surnames offers a unique anddetailed resource for understanding population structures both in terms of cultural and genetictransmission. As in the Anglo-Saxon world, many names are polyphyletic (that is, they havemultiple origins) but they are not present in sufficient numbers to obscure the distinctive regionalprofiles of the majority of localized names.

As will become clear, the methods employed in this paper have been devised by populationgeneticists to gauge the degree to which a population or multiple populations were character-ized by in-breeding over the centuries, as indicated by the mixing of surnames. These moti-vations are not appropriate here, not least because of the relatively late introduction ofJapanese surnames for most citizens: such research is best undertaken at the local level, asdemonstrated by Yasuda and Furusho (1971). The processes driving the transmission of geneticand cultural attributes, such as surnames, are similar in the sense that they are both passeddown from one generation to the next and only move with the people bearing them. Thecompulsory introduction of surnames for all citizens in 1875 may have been too late toestablish a reliable link between surnames and genetic attributes, but we suggest that it isunlikely to have been too late to provide a cultural benchmark, given that occurred in the earlyyears of Japan’s industrial revolution.

We therefore consider surnames as purely cultural attributes that vary over space. Thisrepresents a departure from much of the previous literature (see Colantonio 2003) becausesurnames have hitherto been considered to correspond with genetic attributes. Such variationscan be detected even given the high diversity of surnames in Japan, because most of them havelocalized origins. Many of Japan’s islands have developed their own distinctive naming con-ventions (see Katayama et al. 1978), sometimes reflecting the interventions of Buddhist priestsor other local cultural preferences (Lasker 1985). On this basis it is reasonable to assume thatJapanese surnames exhibited a clear regionalization in 1875 (when they became compulsory),and in the preceding period. This in turn legitimizes the use of contemporary data to investi-gate whether and to what extent this historic regionalization persists today, following local,regional and national population movements over the last century or so. Previous work (seeColantonio 2003) suggests that most population movements are short-distance, and that as aconsequence ‘surname regions’ should comprise contiguous spatial units (municipalities in thiscase). Where contiguity no longer persists it is anticipated that this is often indicative oflarge-scale migrations to high order city settlements (see Longley et al. 2011 for the case ofGreat Britain).

Japanese regional scientists and geographers have developed many perspectives on theregionalization problem, based upon physical features or population characteristics (e.g., Naka-mura et al. 2005; Yamamoto et al. 2006). The most widely recognized regions of Japan are,moving from north to south, Hokkaido, Tohoku, Kanto, Chubu, Kinki, Chubu, Shikoku andKyushu. Japanese administrative geography is based on prefectures which are widely consid-ered to constitute homogenous regional units with respect to culture, history and physicalenvironment. The current system of prefectures was devised in the Meiji era as part of govern-ment initiatives to abolish feudal clans or Han. Boundaries are still largely based upon historicprefectures, except in the Tohoku region.

There is a growing realization that broader geographic and cultural context needs to beviewed alongside the social similarities that are revealed by conventional geodemographicclassification. Analysis of family names may help to specify unique place effects and crystallizethe long and short-term dynamics of migration and residential mobility, viewed against abackground of stability in local and regional social structures. Place effects are an integral partof human identity, and their specification as part of geodemographic classifications and othermicro scale models to create regions offers the prospect of addressing issues of life chances andsocial mobility (see Birkin and Clarke 2011).

542 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

3 Surnames and region-building

This empirical contribution to regional science presented in this paper builds on growing interestin the regional analysis of surnames (e.g., Colantonio et al. 2003; Longley et al. 2011) byexamining their relationship to the historical and contemporary administrative geographies ofJapan and their importance for policy initiatives at a number of scales. As such, we seek to framemore informed regional analysis based on a ubiquitous human phenomenon. Our a prioriexpectations are that the geography of surnames is a manifestation of the flow of people andideas, mitigated by distance effects and specific physical features. By creating more contextsensitive, yet generalized, regional aggregations using surnames, this research can also contrib-ute towards our understanding of historical population movements and structures. Likethe historical prefectures of Japan, our surname regions are a manifestation of cultural interac-tions. It therefore follows that their comparison can gauge the cultural significance, if any,of modern prefectures in Japan. The results of such work will inform future researchers ofthe appropriateness of the imposed administrative spatial units when studying the Japanesepopulation.

Here we follow a process of inductive generalization to produce surname regions through thecreation, and subsequent clustering, of a dissimilarity matrix that in turn is based on thequantitative comparison of the observed mix of surnames in each municipality. The result is aregionalization of Japanese surnames in unprecedented detail, based upon within-region simi-larities and between-region differences. A two-stage process is used to create a regionalizationof surnames. We then compare our surname regions with the administrative geographies of boththe historical and contemporary prefecture divisions of Japan.

The first stage of the methodology entails the calculation of a dissimilarity matrix thatcompares the surname compositions of every municipality in Japan. Such measures have beensuccessfully used in the population genetics literature (see, e.g., Manni et al. 2008) as the basisfor the calculation of Lasker or Nei (1978) distances (see Jobling 2001; Colantonio et al. 2003).It is not uncommon to calculate both measures as each is sensitive to different aspects of thesurname distribution, such as the skew of the underlying population and the importance of smallnumbers of rare names in particular municipalities (see Manni et al. 2008). Our preliminarywork included a comparison of both measures and concluded that the Lasker distance is moreeffective in this context. Given the space constraints we therefore do not describe or report theresults from the Nei distance. Following the distance calculation, principal components analysis(PCA: see Joliffe 2002) is used indicate the relative (dis)similarities of the areal units based ontheir surname compositions.

The Lasker distance is derived from a ‘coefficient of relationship by isonymy’ (Crow andMange 1965), originally designed to establish the probability of relatedness between individualsbased on the frequency of occurrence of the same surname (Lasker 2002). In addition toapplications in the study of inbreeding within and between extended families or social groups,isonymy (sharing of the same surname) can be also be used to establish the degree of apparentrelatedness between two or more population groups at different geographic locations (Smith2002). It is this latter regional interpretation of isonymy that has gained greater currency over thelast decade because of its effectiveness as a general indicator rather than as a specific measureof genetic relatedness (Cheshire et al. 2011). Crow and Mange (1965) propose the coefficient ofrelationship by isonymy (RAB) to be half the proportion of isonymy. In the two population (onein each spatial unit) case it is defined as:

Rp p

ABiA iB

i

= ∑ 2, (1)

543Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

where piA is the relative frequency of the ith surname in population A and piB is the relativefrequency of the ith surname in population B. Regional populations that are dissimilar to oneanother will share relatively few common surnames, resulting in very small numbers. Thus atransformed measure, the Lasker distance (Rodriguez-Larralde et al. 1994) is used here. It isdefined as:

L RAB AB= − ( )ln ,2 (2)

where RAB = (piA ¥ piB)/2. We can think of this Lasker distance as measuring distance in surnamespace, and use it as the basis for creating a (dis)similarity matrix between a set of area basedpopulation measures.

Our motivation in this paper is to compare Japan’s de jure administrative geographies, pastand present, with an inductive regionalization based upon similarities in surname structure that isa de facto indicator of structure and cohesion. This leads us to use Ward’s Hierarchical Clustering(Ward 1963) as a convenient method of partitioning Japan into the same number of regions as arisein the administrative geographies. Our own regionalization thus assumes a pre-specified numberof regions, but unlike the administrative geographies asserts the primacy of spatial clustering andwithin zone homogeneity rather than administrative constraints such as optimum population size.

Ward’s procedure forms hierarchical groups of mutually exclusive subsets in attribute spaceby incrementally minimizing the total increase in within-cluster variance (which is proportionalto the squared Euclidean distance between cluster centres) as region splitting proceeds (Székelyand Rizzo 2005). The algorithm initially assigns the n Japanese municipalities to (n – 1)exclusive sets by considering the union of all possible n(n – 1)/2 pairs, and proceeding to selectthe combination that minimizes within-cluster variance. This procedure is then repeated insubsequent iterations (Ward 1963). The resulting clusters are not necessarily optimal becausethe ‘best route’ between the clusters has the priority, and this may be achievable only through anincremental improvement in the homogeneity of a partitioned cluster. As with other hierarchicalclassifications (see Gordon 1999), a dendrogram can be used to illustrate the relationshipsbetween observations. All of the observations are joined together at the origin of the branchingstructure with nodes joining branches that terminate with individual regions. The length of thesestems (cophenetic distance) indicates the strength of the relationship between the observations.

An important advantage of Ward’s clustering is the stability of the cluster outcome overrepeated iterations: moreover, the result can be replicated across multiple runs and disaggregatedinto more clusters or aggregated into fewer clusters without affecting the overall structure. Othermethods considered here, such as K-means and partitioning around medoids (PAM) weredeemed less suitable as they do not maintain structure and rely on stochastic assignments ofinitial seed locations. It is worth noting that a range of more complex techniques have beenutilized in the context of similar datasets to those used here (Cheshire et al. 2011). It is, forexample, possible to go one step further than the ‘basic’ form of Ward’s by employing bootstrapresampling in the cluster analysis. This has been successfully used in similar contexts, such aslinguistics (or dialectometrics), and provides an indication of the stability and robustness of thecluster outcomes (Manni et al. 2008; Nerbonne et al. 2008). Bootstrapping can be integratedwith any cluster algorithm and works by creating several samples from the original data and toapply the clustering on each of these with several values of K. The cluster validity criterion isbased on the comparisons of a given statistic over the bootstrap samples for each value of K. Itcounts the frequency of the replicates falling in the region across a range of sample sizes andthen calculates the p-value by looking at the change in the frequencies along the changingsample sizes (Shimodaira 2004). The bootstrap probability (BP) value of a cluster is thereforethe frequency with which it appears in the bootstrap replicates. Early implementations of thismethod were primarily designed for phylogenic analysis but the R software environment’s

544 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

‘Pvclust’ package facilitates more general hierarchical clustering applications (see Suzuki andShimodaira 2006). It is our experience that more sophisticated methods are extremely compu-tationally intensive on datasets of this size (timely completion of the clustering requires parallelcomputing). Moreover, our preliminary analysis suggested that increasing methodologicalsophistication would have only minor impacts on the characteristics of the regions produced.

Clustering routines that require contiguity constraints were deliberately avoided, on the basisthat these would prevent links between non-adjacent municipalities. Such links can arise throughgovernment initiatives to encourage large-scale migration from one area to another, as forexample with increasing the settlement on the northern Japanese island of Hokaido.As a norm, wenevertheless expect contiguity to emerge from the structure of the data independent of contiguity.There is a further related advantage to hierarchical clustering because it can lead to nested regionsemerging with the first few splits in the dendrogram if there is a clear spatial structure in the data.

The modified Rand statistic was used to determine the similarity between the surnameregions produced at the municipality level and both the historical and contemporary adminis-trative prefectures for Japan. This measures the degree-of-agreement between two sets of arealpartition using the same data. We consider that a pair of individual objects (in our case,municipalities) is ‘agreed’ or ‘consistent’ between the two partitions in cases that the objects areboth found in the same segments (in our case, areal clusters or administrative regions) for eachpartition. The Rand Index was originally proposed as a proportion of consistent pairs amongstall possible pairs of objects. Its modified version proposed by Hubert and Arabie (1985) adjustsfor chance agreement and allows comparisons of partitions with different numbers. The valuesof zero and one correspond to complete independence and agreement, respectively. The modi-fied Rand Index has become a standard measure in comparing the agreements of partitions in thefield of cluster analysis (Miligan and Cooper 1986).

The index for comparing partitions A and B, R is computed as:

R =⋅ +( ) − +( )⋅ +( ) + +( )⋅ +( )[ ]

−P C C C C C C C C C C

P11 22 11 12 11 21 21 22 12 22

2 CC C C C C C C C11 12 11 21 21 22 12 22+( )⋅ +( ) + +( )⋅ +( )[ ], (3)

where P is the number of pairs (P = N(N – 1)/2: N is the number of basic spatial units), and Cijas the number of pairs belonging to same or different clusters of partition A and B defined asfollows: i = 1: pair in same clusters of partition A; i = 2: pair in different clusters of partition A;j = 1: pair in same clusters of partition B; j = 2: pair in different clusters of partition B.

The statistic is used here to provide a summary of the correspondence between the admin-istrative geographies and those provided by surnames. Our interest here extends beyond this tosimilarities in spatial configuration, and for this visual interpretation will be provided. Inputs tothe modified Rand calculation are therefore the N surname regions produced in the Wardsclustering compared to the N (either contemporary or historical) prefectures that we wish tocompare them to.

4 An empirical comparison of prefectures and surname regions

Our Japanese surname database is based on the 1,968 municipalities extant as of 1 April 2007.It was obtained from Acton Winds Co. Ltd., a Japanese direct mail company that regularlyassembles the data from telephone directories, Japanese residential maps and field collection ofname plate data at residential properties. These sources each make possible addresses assign-ment at cho-cho-aza (neighbourhood) level. The total number of household records is approxi-mately 44.9 million, almost equal to the number of households in Japan. Every Japanesemunicipality belongs to one of the 47 present day prefectures.

545Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

The regional geographies of surnames are established by aggregating the administrativedistricts that nest within the historic prefectures as well as the present day ones. As stated above,the historic prefecture system comprised 67 prefectures, and excluded the islands of Hokkaidoand Okinawa. In general terms, the historic prefectures located far from the historic capital ofKyoto tend to be more coarse-grained, while those in the Western part of Japan are smaller. Theboundaries of the present day 47 prefectures broadly follow the historical ones, although somehistoric prefectures in Tohoku have been divided.

The Japanese surnames database records 44,993,886 occurrences of 157,227 names. As isthe case in Anglo Saxon culture, the distribution of occurrences has a very long tail, andsurnames with fewer than 20 occurrences nationally were excluded from our analysis forcomputational reasons (see Figure 1). As a result we use 44,436,703 occurrences of 51,867surnames. Systematic bias in the names recorded in the database is unlikely to be a problem asthe data are derived from near-complete adult population registers. Another filter often appliedin the context of surname research is the removal of so-called polyphyletic surnames (thoseoriginating in multiple places simultaneously). If the focus of the study were upon establishinggenetic relatedness, such surnames would incorrectly suggest relationships between individualswith entirely different ancestry. As our interest is in surnames as cultural attributes, we wouldargue that areas with polyphyletic surnames, most likely because of the actions of Buddhistpriests in the case of Japan, reflect similar cultural influences and preferences and therefore offeruseful information in this context and should not be excluded. It should also be noted that theremoval of such surnames would not be straightforward, requiring high levels of manual inputinto our automated, inductive, methodology.

The analysis was implemented in two stages. Firstly, the Lasker distances were calculated,requiring 3.5 hours of processing using 64-bit MySQL installed on an iMac with a 2.7GHzquad-core Intel Core i5 processor and 12GB of RAM. The resulting matrix was 1,968 (the numberof municipalities) by 1,968 cells (3,873,024 cells). In the second stage, these data were clusteredusing the Ward’s hierarchical algorithm in the R software for statistical computing and graphics(R Development Core Team 2011), and the modified Rand measure of association was acquiredfrom the clues package of Chang et al. (2010). The ability to create such a regionalization with adesktop computer is testament to the impact that recent advances in both computational power anddata availability have had on the ability to resolve complex and relevant problems in regionalscience. Data were plotted using a combination of R and ArcGIS 10.

Figure 2 plots the first two components of the Lasker distance matrix based on principalcomponents analysis (PCA). The distribution of points indicate a clear distinction between thebulk of the Japanese mainland and the known outliers of the Ryukyu Islands in the far south ofthe country with 79 per cent of the variation captured (Figure 2B). A more detailed look at thePCA result (Figure 2A) reveals that the Ryukyu islands surrounding (and including) Ishigaki andOkinawa are clear outliers from the bulk of the municipalities. Given their different populationhistory in comparison to the rest of Japan, this is to be expected and serves as reassurance thatthe Lasker distance provides a plausible metric for measuring the (dis)similarity of the surnamecompositions of Japanese municipalities. These differences are reflected in the clustering result,outlined below.

Ward’s hierarchical clustering routine undertaken on the Lasker distance matrix producedthe dendrogram shown in Figure 3. The dendrogram branches were then ‘pruned’ at pre-determined points (described below) in order to produce the different tessellations with therequired numbers of regions. The dendrogram itself begins to reveal how some of the structurein the data matches geographic regions in Japan. The first branch is between northern andsouthern Japan, as shown in Figure 4A. The branch, at an approximate height of 150 (on thedendrogram’s Y axis), separates the Ryukyu Islands from the surname regions of the mainland.Clusters 4 and 5 split the northern cluster into northern and central regions of Japan. The latter

546 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

is largely comprised of the Tokyo conurbation. Clusters 1 to 4 are shown in Figure 4B. As thelabels in Figure 3 show, the next set of branches in the dendrogram approximately identifyfurther north-south divisions within each of the clusters. Beyond this level the picture becomesmore complex as demonstrated by Figure 4D showing 16 broad but homogenous surname

Fig. 1. Maps of the administrative geographies used in this study. Map D shows the population contained within eachmunicipality after people possessing surnames with fewer than 20 occurrences across Japan have been excluded

547Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

regions in Japan. As explained above, the process of adding new branches of the dendrogram cancontinue until each individual municipality has been assigned a unique cluster. For the purposesof this research the dendrogram was terminated at 74 groups (the number of historic prefectures)and 47 groups (the number of contemporary prefectures in Japan). As the number of clustersincreases, it is remarkable that many of the resulting small regions nonetheless remain contigu-ous, providing testimony to the strong spatial patterning of Japanese surname geography. Tokyoand its environs appear increasingly fractured, as is to be expected of a world city with residentsoriginating from throughout Japan. In this area, it may be that the municipality aggregationsmanifest the outcome of inter-regional migration flows or even the out-turn of social mobility.

Perhaps in contrast with other parts of the world, such as Great Britain (see Longley et al.2011), non-contiguous classes with representation at municipality scale in different major citiesdo not occur in Japan. In the case of Great Britain, the names that make up such classes indicatethe preponderance of distinctive ethnic minorities. Japanese urban populations instead compriselarge numbers of migrants from within the national system. The composition of urban surnamesis highly diverse and distinct from neighbouring more rural areas. In this context the LaskerDistance, and subsequent clustering, groups urban areas together and large urban areas such asTokyo, Osaka and Kyoto form a distinctive ‘urban’ cluster in the initial branches of thehierarchical classification. In the lower levels of the hierarchy, the lack of similarity betweenurban conurbations may arise from the distinctive migration histories that individual cities haveexperienced, as well as the minimal impact of international in-migration. It is possible thatremoval of names with fewer than 20 bearers at the national level removes some of thedistinctive shared characteristics of high order cities in the Japanese urban and regional system

Fig. 2. Plots of the first two dimensions from PCA performed on Lasker distance matrixNotes: Approximate regions have been labelled as they appear in the cloud of points. The main plot (A) excludes theRyukyu Islands that, as the inset (B) shows, exhibit significant variation from the majority of municipalities.

548 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

– their combined influence might have brought the largest urban areas closer together in termsof their Lasker distance.

The cluster structure in Figure 3 was further subdivided to form 47 and 74 groups in orderthat the regionalization can be compared with the current and historic prefectures respectively.Figure 5 shows the relative number of people with surnames that only exist in a single cluster for47 clusters (A) and 74 clusters (B). There is an apparent increase in the proportion of exclusivelyoccurring surnames for 74 clusters because the total population of each cluster is reduced whenthe number of clusters increases. Areas with large proportions of people with surnames exclu-sive to their cluster represent a tighter group in the sense that it is very different to the otherclusters. Such areas could also, however, present a conflicting view of the population withinthem. Both the Tokyo and the Ryukyu Island clusters exhibit high numbers of surnames uniqueto them but for different reasons. In the case of the former, much like other global cities, externalinfluences have meant rare surnames (to Japan) from around the world have been brought inwith waves of migrants (the full extent of this has been suppressed due to the exclusion ofsurnames with fewer than 20 occurrences in our analysis). In the case of the Southern Islands,and other remote areas such as northern Tohoku, the propensity for unique surnames (incomparison the rest of Japan) is a manifestation of their relative isolation leading to minimalsurname exchange with the mainland. Hokkaido to the north, in contrast, has a low relativefrequency of individuals with surnames exclusive to that cluster, indicating that migrants (orthose descending from them) from other parts of Japan live there.

The prefecture of Kochi provides compelling evidence for the continuing impact of naturalfeatures on both cultural and administrative boundaries. Located on the island of Shikoku, itforms a large crescent shaped surname region that aligns perfectly with both the administrativegeography and the arc of mountains. Despite the existence of multiple freeways on the island

Fig. 3. Dendrogram produced from Ward’s Hierachical Clustering of the Lasker distance matrixNotes: Major branches have been labelled with the approximate regions they produce. These are mapped in Figure 4.

549Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

Fig. 4. Mapped regionsNotes: Corresponding to: (A) 2 clusters; (B) 4 clusters; (C) 8 clusters; and (D) 16 clusters (D). Best viewed on the digitalversion see http://spatialanalysis.co.uk/surnames.

550 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

and good transport links between urban areas it continues to be the case that Kochi is relativelyisolated from the rest of Japan in terms of its surname distribution. This is demonstrated by itscluster assignment and also the number of surnames within that cluster (see Figure 5) occurringnowhere else.

The modified Rand Index was also calculated in the form outlined above and yielded asimilarity score of 0.537 for the current prefectures and 0.535 for the historic prefectures thussuggesting no significant difference in agreement between modern prefectures and their histori-cal equivalents. This finding is perhaps surprising given the different purposes and origins of thetwo administrative geographies. One might expect the historical prefectures to be a closermatch, but this does not appear to be the case according to the metric. Some spatial context tothe modified Rand Index scores is provided by Figure 6, which identifies the prefectures thatdeviate most strongly from the surname regions (see also Table 1). These were identified bymeasuring the distance between each vertex of the administrative geographies and each vertexof the surname region boundaries. The further the administrative vertices are from a borderbetween surname regions the darker is the colour (see key). These maps demonstrate the closesimilarities between the two geographies, with many being identical. The largest discrepanciesbetween the old prefectures and surname regions exist in the central area of Japan (explained inmore detail below), where four prefectures failing to align with any surname region. This regionof inconsistency is also identified when comparing the modern prefectures.

Mapping the different boundaries as we have in Figure 6 enables a more subjective inter-pretation of the cluster outcome than the modified Rand Index. The maps suggest that themodern prefectures deviate slightly more from the surname regions than their historical

Fig. 5. The relative number of people with surnames that only exist in a single cluster for 47 clusters (A) and74 clusters (B)

Notes: Hokkaido to the north, for example has a low relative frequency of the population with surnames exclusive to thatcluster, indicating (correctly) that migrants from other parts of Japan live there. There is an apparent increase in theproportion of exclusively occurring surnames for 74 clusters because the total population of each cluster is reduced whenthe number of clusters increases.

551Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

counterparts. For example, to the north of Japan two historic prefectures have been split to formthe contemporary ones and these divisions have evidently ignored cultural transitions. Lookingat the surname regions it would appear that a natural break occurs in the surname characteristicsof those regions to the north of the border between the current prefectures, and so it seems likelythat this partitioning could have been avoided if predicated upon population size arguments. Thesurname regions in the area covered by the prefectures of Nagano, Toyama and Ishikawa in theChubu region (central Japan) bear little resemblance to both the historical and modern admin-istrative geography based on Figure 6. This area has been subject to a number of boundarychanges over the years. For example, the eastern part of Toyama Prefecture (known at the timeas Niikawa) was briefly merged with Ishikawa Prefecture in 1876 before being separated to beincluded in Toyama Prefecture again. It also contains the fewest municipalities (due to arelatively sparse population) and this would have limited their configuration, perhaps exagger-ating the spatial discrepancies between the two geographies.

5 Discussion

The methodology employed here is one that has been successfully applied in a number ofcontexts, but surprisingly never in regional science. This paper marks the first attempt at

Table 1. Illustrative hypothetical input into the modified Rand Index

Prefecture Group 1 2 3 4 5 6

1 183 1 3 0 10 02 2 0 6 40 8 333 1 1 0 0 2 04 2 0 8 0 1 05 1 0 0 0 3 26 0 13 0 0 0 0

Notes: Prefecture identifiers (past or present) are the column headings and rows correspond to surname group. Matrixelements are numbers of municipalities assigned to each prefecture/surname group pair.

Fig. 6. A comparison between the surname regions: (A) the historic prefectures and (B) the contemporary prefecturesNote: Darker colours indicate greater differences.

552 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

utilizing quantitative surname data to address key scientific questions of regionalization, and isalso the first surname regionalization inductively produced for Japan based on a near-completepopulation register.

The regions, and the methods used to create them, represent the starting point for a wealthof future regional research. An important illustration of this potential has been demonstratedthrough the inductive comparison of Japan’s contemporary and historical administrative geog-raphy, using surnames as a means of quantifying the cultural significance of such boundaries. Atthe outset we stated it was primarily our intention to validate the ways in which Japaneseadministrative regions were constituted and have evolved, and we are aware that in the moun-tainous terrain of Japan, migration between the interstitial flatlands has been quite limited untilroad and rail routes developed 140 or so years ago. It is clear that most prefectural boundariesfollow mountain ridge lines, and that boundaries within urban areas often follow water coursesor main streets. More generally, we nevertheless feel that our analysis provides a basis to beginto use surnames to answer questions such as: what spatial granularity of locality achieves thebest trade-offs between level of citizen consensus and spatial efficiency of administration? Whatdegree of temporal stability characterizes data collection zones? What zonal attributes areassociated with the greatest within zone homogeneity? Overall, the regions appear relativelystable over time with changes presenting an important research topic in themselves, as observedsurnames are likely to reflect longer-term shifts in population structure in the past and couldinform predictions for the future. Such areas could be used to identify worthwhile studies ofmigration whilst those of stable populations across generations are of interest to the populationgeneticists – interested, for example, in the geography of susceptibility to adult T-cellleukaemia/lymphoma (ATL), which has spread from its original endemic areas in Kyusyu andOkinawa (Western Japan) to metropolitan regions (Tajima et al. 1990), or more broadly withregard to colonic cancer (see Winney et al. 2012). Internationally, names analysis makes itpossible to gauge the relative importance of physical features such as mountains and rivers andsocial divides such as those arising through street geometry in shaping regional identity indifferent parts of the world.

There are, however, a number of limitations that should be addressed in future research. Wehave made no attempt here to establish the ‘optimal’ number of surname clusters, or regions (seeNakaya 2000). Whilst methods exist to facilitate this process (outlined briefly above and inCheshire et al. 2011) they remain inappropriate insofar as they create only one slice throughJapan’s regional geography. Here we are proposing a variable regional geography that can offerdifferent insights depending on the level of interest. Without such flexibility it would not havebeen possible to alter the number of surname regions to match the number of administrativeunits, for example. Had fewer input spatial units (1,968 municipalities) been used in theclassification greater constraints would have been placed on the number of meaningful regionsthat could have been produced. The use of smaller cho-cho-aza units would have been prob-lematic because very low populations within each municipality render formal comparisondifficult. An alternative might be to select or create an input spatial unit that meets a minimumpopulation threshold. Such units are complex to create, as many automated algorithms cannotaccount for natural barriers such as the ocean between islands.

We are also aware that our comparison of different regionalizations is as yet rudimentary inspatial terms, and see the development of improved spatial statistics as a central part of futurerefinement of our methodology. An important element of the research agenda is to improve theindices used to compare different sets of regions generated through cluster analysis. Pontius andMillones (2011) have argued that spatial configuration as well as categorical composition shouldbe taken in account when using the modified Rand Index to consider the degree of consistencybetween the mapped results of cluster analysis. Improved statistics to summarize the differencesbetween boundary lengths and delineations provide one promising way forward.

553Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

Aside from potential methodological limitations, in comparing the surname regions andhistorical prefectures it should be remembered that only contemporary surname distributionswere available to us for analysis. These are the outcome of hundreds of years of populationmovements since the historic prefectures were delineated, and much of this period predates 1875when surnames came into universal parlance. Although it is clear that very strong patterningremains, migratory processes do not occur uniformly so whilst islands and rural areas, forexample, may have been relatively unaffected, Japan’s urban conurbations (especially Tokyoand Kobe-Osaka-Kyoto) have undergone profound changes in population structure over thistime period. It is in and around these twentieth century megalopolises these we find the mostfragmented regions and where comparisons between the contemporary and historic administra-tive structures are likely to have changed the most profoundly. The national-level regionalizationand initial comparisons made here, it is hoped, will provide the foundations for more detailedstudy of such interactions.

These qualifications notwithstanding, it is nonetheless remarkable that it remains possible tocreate a detailed national regionalization using little more than the spatial distribution ofsurname frequencies. The utility of such an approach rests on the potential to liberate quanti-tative cultural research from imposed and often arbitrary administrative units. Within economicgeography bespoke regionalizations are commonplace with the likes of travel to work areas orretail market area analysis. On the basis of such regions it is possible to begin to develop insightsinto the issues highlighted earlier such as the spatial efficiency, temporal stability and withinzone homogeneity of the input spatial units to regional analysis. These in turn might assist indeveloping devolved approaches to economic reconstruction following the 11 March 2011earthquake and tsunami, for example.

In summary, the surname regions of Japan reflect enduring cultural processes that haveoccurred independently of both the historical and contemporary prefectures, with the formerbeing more reflective of this than the latter. This finding is the product of a traditional inductiveapproach to produce a regionalization of a novel ‘big’ dataset using computer intensive methodsthat have been under-utilized in regional science. In so doing it has sought to directly address theneed for regions, and the spatial units that underpin them, to be created along cultural lines, ratherthan simply for administrative convenience. Our regions bear testimony to the evolving functionalregionalization of Japan, arising out of historic and contemporary economic engagement. Theabundance of computational power and detailed spatial datasets facilitate the wide-scale utiliza-tion of the approach taken here which is one that augments regional science research.

References

Batey PW, Brown PJ (2007) The spatial targeting of urban policy initiatives: A geodemographic assessment tool.Environment and Planning A 39: 2774–2793

Birkin M, Clarke GP (2012) The enhancement of spatial microsimulation models using geodemographics. The Annalsof Regional Science 49: 515–532

Chang F, Qiu W, Zamar R, Lazarus R, Wang X (2010) Clues: An R package for nonparametric clustering based on localshrinking. Journal of Statistical Software 33: 1–16

Cheshire JA, Longley PA (2012) Identifying spatial concentrations of surnames. International Journal of GeographicInformation Science 26: 309–325

Cheshire J, Mateos P, Longley P (2011) Delineating Europe’s cultural regions: Population structure and surnameclustering. Human Biology 83: 573–598

Colantonio S, Lasker G, Kaplan BA, Fuster V (2003) Use of surname models in human population biology: A reviewof recent developments. Human Biology 75: 785–807

Crow J, Mange P (1965) Measurement of inbreeding from the frequency of marriages between persons of the samesurname. Eugenics Quarterly 12: 199–203

Gordon A (1999) Classification (2nd edn). Chapman and Hall, LondonHubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2: 193–218Jobling M (2001) In the name of the father: Surnames and genetics. Trends in Genetics 17: 353–357

554 J.A. Cheshire et al.

Papers in Regional Science, Volume 93 Number 3 August 2014.

Joliffe I (2002) Principle component analysis (2nd edn). Springer, New YorkKatayama K, Toyomasu T, Matsumoto H (1978) Genetic study on the local populations in Mie Prefecture: II. Population

structure in Kamishama Island. Journal of the Anthropological Society of Nippon 87: 377–392Lasker G (1985) Surnames and genetic structure. Cambridge University Press, CambridgeLasker G (2002) Using surnames to analyse population structure. In: Postle D (ed) Naming, society and regional

identity. Leopard’s Head Press, OxfordLongley PA, Cheshire JA, Mateos P (2011) Creating a regional geography of Britain through the spatial analysis of

surnames. Geoforum 42: 506–516Manni F, Heeringa W, Toupance B, Nerbonne J (2008) Do surname differences mirror dialect variation? Human Biology

80: 41–64Martin D (1998) Automatic neighbourhood identification from population surfaces. Computers, Environment and

Urban Systems 22: 107–120Mateos P, Longley PA, O’Sullivan D (2011) Ethnicity and population structure in personal naming networks. PLoS ONE

(Public Library of Science) 6: 1–12Miligan GW, Cooper MC (1986) A study of comparability of external criteria for hierarchical cluster analysis.

Multivariate Behavioral Research 21: 441–458Nakamura K, Iwata S, Arai T, Yonekura N (eds) (2005) Geography of Japan, 1 (Physical Geography). Asakura Shoten,

TokyoNakaya T (2000) An information statistical approach to the modifiable areal unit problem in incidence rate maps.

Environment and Planning A 32: 91–109Nei M (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics

89: 583–590Nerbonne J, Kleiweg P, Heeringa W, Manni F (2008) Projecting dialect distances to geography: Boostrap clustering vs.

noisy clustering. In: Preisach C, Schmidt-Thieme L, Burkhardt H, Decker R (eds) Data analysis, machine learning,and applications. Proceedings of the 31st Annual Meeting of the German Classification Society, Springer, Berlin

Niwa M (2002) Japanese Surnames. Kobunsha, TokyoOpenshaw S (1984) The modifiable areal unit problem, CATMOG 38. GeoBooks, NorwichPontius Jr RG, Millones M (2011) Death to Kappa: Birth of quantity disagreement and allocation disagreement for

accuracy assessment. International Journal of Remote Sensing 32: 4407–4429R Development Core Team (2011) R: A language and environment for statistical computing. R Foundation for Statistical

Computing, ViennaRodriguez–Larralde A, Pavesi A, Siri G, Barrai, I (1994) Isonymy and the genetic structure of Sicily. Journal of

Biosocial Science 26: 9–24Shimodaira H (2004) Approximately unbiased tests of regions using multistep–multiscale bootstrap resampling. Annals

of Statistics 6: 2616–2641Smith MT (2002) Isonymy analysis: The potential for application of quantitative analysis of surname distributions to

problems in historical research. In: Smith MT (eds) Human biology and history. Taylor and Francis, LondonSuzuki R, Shimodaira H (2006) Pvclust: An R package for assessing the uncertainty in hierarchical clustering.

Bioinformatics 22: 1540–1542Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: Extending Ward’s minimum

variance method. Journal Classification 22: 151–183Tajima K, The T– and B–cell Malignancy Study Group (1990) The 4th nation–wide study of adult T–cell leukemia/

lymphoma (ATL) in Japan: Estimates of risk of ATL and its geographical and clinical features. InternationalJournal of Cancer 45: 237–243

Takemitsu M (1998) Surnames and Japanese. Bungeishunju, TokyoWard J (1963) Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association

58: 236–244Watanabe M (1966) Surnames in Japan. Mainichi–Shinbun Press, TokyoWinney B, Boumertit A, Day T, Davison D, Echeta C, Evseeva I, Hutnik K, Leslie S, Nicodemus K, Royrvik EC, Tonks

S, Yang X, Cheshire J, Longley P, Mateos P, Groom A, Relton C, Bishop DT, Black K, Northwood E, ParkinsonL, Frayling TM, Steele A, Sampson JR, King T, Dixon R, Middleton D, Jennings B, Bowden R, Donnelly P,Bodmer W (2012) People of the British Isles: Preliminary analysis of genotypes and surnames in a UK controlpopulation. European Journal of Human Genetics 20: 203–210

Yamamoto S, Taniuchi T, Kanno M, Tabayashi A, Okuno T (eds) (2006) Geography of Japan, 2 (human and socialgeography). Asakura Shoten, Tokyo

Yano K (2007) GIS based Japanese family name maps and their potential in geographic information science. Proceed-ings of IPSJ SIG Computers and the Humanities 15: 47–54

Yasuda N, Furusho T (1971) Random and non–random inbreeding revealed from isonymy study: I. Small cities in Japan.American Journal of Human Genetics 23: 303–316

555Japanese surname regions

Papers in Regional Science, Volume 93 Number 3 August 2014.

Resumen. Este artículo utiliza un estudio de caso ampliado de Japón para ilustrar cómo losapellidos o los nombres familiares se pueden utilizar como base para la regionalización. Serealiza una comparación inductivamente entre regiones de apellidos de Japón con geografías deárea basadas en las dos unidades (administrativas) de prefecturas contemporáneas e históricas.El trabajo se muestra como ejemplo del uso de datos de marco altamente desagregados paraevaluar la integridad de las unidades territoriales utilizadas en ciencias regionales. Tambiénes relevante para entender la distribución de población del pasado y del presente, y lasconsecuencias de la movilidad y migración residencial local, regional y nacional.

要約:

doi:10.1111/pirs.12002

© 2014 The Author(s). Papers in Regional Science © 2014 RSAI

Papers in Regional Science, Volume 93 Number 3 August 2014.