Index Structures for Path Expressions

15

Transcript of Index Structures for Path Expressions

Index Structures for Path ExpressionsExtended AbstractTova MiloTel Aviv [email protected] Dan SuciuAT&T [email protected] IntroductionIn recent years there has been an increased interestin managing data which does not conform to tradi-tional data models, like the relational or object ori-ented model. The reasons for this non-conformanceare diverse. One one hand, data may not conform tosuch models at the physical level: it may be stored indata exchange formats, fetched from the Internet, orstored as structured �les. One the other hand, it maynot conform at the logical level: data may have miss-ing attributes, some attributes may be of di�erenttypes in di�erent data items, there may be heteroge-neous collections, or the data may be simply speci�edby a schema which is too complex or changes too of-ten to be described easily as a traditional schema.The term semistructured data has been used to re-fer to such data. The data model proposed for thiskind of data consists of an edge-labeled graph, inwhich nodes correspond to objects and edges to at-tributes or values. Figure 1 illustrates a semistruc-tured database providing information about a city.Relational databases are traditionally queried withassociative queries, retrieving tuples based on thevalue of some attributes. To answer such queries ef-�ciently, database management systems support in-dexes for translating attribute values into tuple ids(e.g. B-trees or hash tables). In object-orienteddatabases, path queries replace the simpler associa-tive queries. Several data structures have been pro-posed for answering path queries e�ciently: e.g., ac-cess support relations [14] and path indexes [4].In the case of semistructured data, queries are evenmore complex, because they may contain generalizedpath expressions [1, 7, 8, 16]. The additional exibil-ity is needed in order to traverse data whose structureis irregular, or partially unknown to the user. Forexample the following query retrieves all restaurantsserving lasagna for dinner:select xfrom (�:Restaurant) x (Menu:�:Dinner:�:Lasagna) yStarting at the root of the database DB, the

query searches for paths satisfying the regularexpression �:Restaurant and, from the retrievednodes, searches for another regular expression,Menu:�:Dinner:�:Lasagna.How are such queries evaluated ? A naive eval-uation that scans the whole database traversing allpossible paths and selects those that match the pat-terns in the query is obviously very expensive. As inthe case of relational and OO databases, we wouldlike to use some indexes to speed up the evaluationof such queries. Index structures developed for tradi-tional data models rely on some pre-de�ned databaseschema: e.g. relational databases index on a speci�cattribute of a speci�c relation, while object-orienteddatabases index on a speci�c path [4, 14] in theobject-oriented schema (e.g. document:section:title).Hence, these index structures are not applicable tosemistructured data, because the schema is missing,unavailable, or only partially known. At the otherextreme, full text indexing systems take an oppositeapproach. Given no knowledge on the structure of in-formation, they index all the data. But this is still oflimited use for semistructured data, where some (per-haps very partial) knowledge on the structure maybe available and exploited in queries: e.g. the queryabove insists that a Dinner item appears inside a aMenu.Recent work has addressed the problem of e�-ciently evaluating path expressions on semistructureddatabases [2, 19, 18, 11]. But they focused mainlyon deriving and using schema information to rewritequeries and guide the search. The issue of indexingwas almost ignored. An exception are the dataguidesof [11] which record information on the existing pathsin a database, using this as an index. However, thescope of dataguides is restricted to queries with asingle regular expression: they are not adequate formore complex queries, having several regular expres-sions and variables, like the one above.In this paper we propose a novel, general in-dex structure for semistructured databases, called T-index. It improves over the previous approaches inseveral ways. First, T-indexes are exible in that1

they allow us to trade space for generality. Theclass of paths associated with a given T-index is spec-i�ed by a path template. For example, we can builda T-index to evaluate paths described by the tem-plate P x P y: here P can be replaced by anyregular expression (P stands for \path expression").The query above is of this form. An alternative tem-plate would be (�:Restaurant) x P y, in which the�rst regular expression is �xed to �:Retaurant: thecorresponding T-index takes less space while beingless general. Second, we show that every T-index canbe e�ciently constructed. Dataguides [11] requireda powerset construct over the underlying database,which in the worst case can be of exponential cost:by contrast, T-indexes rely on the computation ofa simulation or a bisimulation relation, for whiche�cient algorithms exists. Third, we o�er guaran-tees for the size of a T-index. For example thesize of a T-index associated to a single regular ex-pressions is at most linear in that of the database,(again, we contrast this to dataguides which, in theworst case, are exponential), and often, as our exper-iments show, it is much less. Third, we show that T-indexes turn out to be elegant generalizations of indexstructures considered previously in various contexts:dataguides for semistructured data, Pat trees for fulltext indexes [12, 21], and Access Support Relationsfor OODBs [14].A T-index starts by grouping database objects intoequivalence classes containing objects that are indis-tinguishable w.r.t to a class of paths de�ned by a pathtemplate as described above. Computing this equiva-lence relation may be expensive (PSPACE complete),so we consider �ner equivalence classes de�ned bybisimulation or simulation, which are e�ciently com-putable. Next, a T-index is built from these equiv-alence classes, by constructing a non-deterministicautomaton whose states represent the equivalenceclasses and whose transitions correspond to edges be-tween objects in those classes.While each T-index is designed for a particularclass of queries (given by one template), it can be usedto answer queries of more general forms. We addressthe problem of deciding whether a given query withgeneralized path expressions can be rewritten to takeadvantage of a given T-index. In its full generality,this problem is a generalization of the query rewritingproblem [15] to the case of queries with generalizedpath expressions and, to the best of our knowledge,is still open. Here we have a more modest goal: weshow that a certain restriction of this query rewritingproblem is decidable, and, moreover, it is in PTIMEfor a speci�c class of queries, which is of interest inpractice. Even in this restricted form, our result hasan interesting Corollary: the fact that containment

CityHall

Restaurant Cafe Restaurant

Dining

http://www.quintillion.com

Shopping

Name

. . .

. . .

Westfield

Museums

Summit

. . .

Aquila

Menu

CoachStage

Hours RestarantFigure 1: Example of a semistructured database withinformations on small towns in New Jersey.of regular expressions consisting of concatenations ofconstants and wildcards is decidable in PTIME. Thiscomes at a surprise, because the associated determin-istic automaton in this case is still exponential in thesize of the regular expression.Organization: In section 2 we review the datamodel and query language for semi-structured dataand introduce the notion of path templates . To ex-plain how T-indexes are built for such templates, we�rst consider in Sections 3 and 4 two speci�c tem-plates and their corresponding indexes, called 1 and2-index resp. While presenting these two cases we il-lustrate the details of our techniques, then we carrythem over in Section 5 to general case of T-indexes.We conclude in section 6.2 Review: Data Model andQuery LanguagesWe start by reviewing the basic framework ondatabases and queries.The data model: All models proposed forsemistructured data consists of a labeled graph,in which nodes correspond to the objects in thedatabase, and edges to their attributes. Unlike therelational or object oriented data models, the labeledgraph model carries both data and schema informa-tion, making it easy to represent irregular data andtreat data coming from di�erent sources in a uniformmanner[2].De�nition 2.1 We assume an in�nite set D of datavalues and an in�nite set N of Nodes. A data graphDB = (V;E;R) is a labeled rooted graph, whereV � N is a �nite set of nodes; E � V � D � Vis a set of labeled edges, and R � V is a set of root2

nodes. W.l.o.g. we assume that all the nodes in V arereachable from some root in R. We will often referto such a data graph as a database.Path expressions: Following [16, 6], we use for-mulas to describe properties of the labels of the edgesof data graphs. We assume a set of base predicatesp1; p2:::, over the domain of values D, and denotewith F the set of formulas obtained by taking booleancombinations of such predicates. We assume that sat-is�ability of formulas in F is decidable.A regular path expression, or path expression inshort, P , is a regular expression over formulas in F .That is, P ::= � j f j (P jP ) j (P:P ) j P�. We denotewith L(P ) the regular language de�ned by P , andwith W (P ) the set of all words w = a1 : : : an overvalues in D, s.t. there exists a word w0 = f1 : : : fn 2L(P ) and fi(ai) holds for all i = 1 : : : n (i.e. theset of word obtained by replacing each formula bysome value that satis�es it). Using the traditionaltechniques for regular language it is easy to see thatthe languages de�ned by path expressions are closedunder intersection and that the emptiness problemfor W (P ) is decidable.Given a data graph DB and a path p = v0 a1! v1 a2!v2 : : : vn�1 an! vn in DB, we say that p matches thepath expression P i� the word a1 : : : an is in W (P ).For brevity, we will use in the sequel the followingshorthands. The path expression �x:(x = d), whered is a constant, is written as d; �x:T rue is written as; and � is written as �. For example �:Restaurant: �:Name:Fridays is a regular path expression.Queries A query path is an expression of the formP1 x1 P2 x2 : : : Pn xn where the xi's are distinctvariable names, and the Pi's are path expressions.Given a graph database DB = (V;E;R), we saythat the nodes v0; v1; : : : ; vn satisfy a query pathP1 x1 P2 x2 : : : Pn xn if v0 2 R (is a root) and forall vi�1; vi; i = 1 : : : n, there exist a path from vi�1 tovi that matches Pi. A query has the form:select xi1 ; xi2 ; : : : ; xikfrom P1 x1 P2 x2 : : : Pn xnwhere 1 � i1 < i2 < : : : < ik � n. That is, a queryconsists of a query path and a set of head variables.The query in Section 1 has this form. We will oftenrefer to the query by giving only the query path, andimplicitly assume all its variables to be head vari-ables. The answer of a query is the projection onthe indexes i1; : : : ; ik of all tuples (v0; v1; : : : ; vn) thatsatisfy the query path.

Path Templates: Relational databases create aseparate index for each relation, attribute pair. Ob-ject oriented databases associate separate path in-dexes for each path in the object-oriented schema.Hence, an index can answer only a certain class ofqueries, for which it was designed. The index struc-tures we describe here are also designed for givenclasses of queries. Such a class is speci�ed by a querytemplate. Formally, a query template t has the formT1 x1 T2 x2 : : : Tn xn where each Ti is either a regularpath expression, or one of the following two place-holders: P and F . A concrete query path q is ob-tained from a query template t by instantiating eachof the P place holders by some concrete path ex-pressions, and each of the F place holders by someconcrete formulas. The query path thus obtained iscalled an instantiation of the query template t. Theset of all such instantiations is denoted inst(t).For example, consider the query template(�:Restaurant) x1 P x2 Name x3 F x4. The fol-lowing three query paths are possible instantiations:q1 = (�:Restaurant) x1 � x2 Name x3Fridays x4q2 = (�:Restaurant) x1 � x2 name x3 x4q3 = (�:Restaurant) x1 (� j ) x2 Name x3Fridays x4Given a query template t, our goal is to constructan index structure that will enable an e�cient eval-uation of queries in inst(t). (In fact, as we shall seelater, it will also assist in answering several variantsof such queries). The templates are used to guide theindexing mechanism to concentrate on the more inter-esting (or frequently queried) parts of the data graph.For example, if we know that the database contains arestaurants directory and that most common queriesrefer to the restaurant and its name, we may use aquery template such as the one above to guide theindexing process. As another example, assume weknow nothings about the database, but assume thatusers never ask for more than k objects on a path.Then we may take t = P :x1: P :x2 : : : P xk, andbuild the corresponding index.Before explaining how indexes are constructed forgeneral templates, we give some intuition about theindexing process using two concrete templates t1 =P x1 and t2 = � x1 P x2. The �rst is targeted toqueries searching for nodes reachable from the rootby some arbitrary path expression, (i.e. queries of theform select x from P x, where P is any path expres-sion). The second is targeted for queries searchingfor pairs of objects connected by some path match-ing an arbitrary path expression. (i.e. queries of theform select x; y from � x P y). We call the index con-structed to handle the �rst case a 1-index and theone for the second case a 2-index . While presenting3

these two cases we will illustrate the details of ourtechniques. Then we will carry them over to the gen-eral case, called T-index (Template index). We donot address the issue of index maintenance here andconsider it only brie y in section 6.3 1-IndexesThe 1-index assists, given some path expression P ,in �nding all objects reachable from the root bya matching path. Putting this in terms of querytemplates, it assists in computing query paths q 2inst( P x). The index consists of a concies descrip-tion of all possible paths in DB and, for each suchpath, of the objects reachable by the path. Queriescan then be evaluated over this compact representa-tion, rather than on the original database.A First attempt: A naive way (which we will soonre�ne) to capture information about the paths in adata graph DB is to proceed as follows. For eachnode v in DB, let Lv(DB), or Lv in short, when DBis understood, be the set of words on paths from someroot node to v:Lv(DB) def= fw j w = a1 : : : anand there exists apath v0 a1! : : : an! vin DB with v0 being a root nodegNext, de�ne the language equivalence relation, v � uon nodes in DB to be:v � u() Lv = LuWe denote with [v] the equivalence class of v in DB.Clearly, there are no more equivalence classes thannodes in DB. The language equivalence is importantbecause two nodes v; u in DB can be distinguished1by a query path in inst( P x) i� u 6� v.A naive index can be constructed as follows: itconsists of the collection of all equivalence classess1; s2; : : :, each accompanied by (1) an automa-ton/regular expression describing the correspondinglanguage, and (2) the set of nodes in the equivalenceclass. We call this set the extent of si, and denote itby extent(si). Given the naive index, a query pathof the form P x can be can be evaluated by iteratingover all the classes si, and for each class testing if thelanguage of that class has a nonempty intersectionwith W (P ). The answer of the query is the union ofall the extents extent(si) for which this intersectionis not empty.This naive approach is ine�cient, for two reasons.1By distinguished we mean that one node belongs to thequery's answer while the other does not.

� Construction Cost: the construction of theindex is very expensive since computing theequivalence classes for a given data graph is aPSPACE complete problem [22].� Index Size: the automaton/regular expressionsassociated with di�erent equivalence classes haveoverlapping parts which are stored redundantly.This also results in ine�cient query evaluation,since we have to intersect W (P ) with each reg-ular language.We next address these problems. To tackle the con-struction cost we introduce the notion of approxima-tion. We call an equivalence relation � an approxi-mation if it has the property:v � u =) v � u (1)As we shall see, any approximation is �ne for con-structing 1-indexes, as soon as it is e�ciently com-putable: we illustrate below two examples of approx-imations. The basic idea to tackle the index size wasintroduced in [19], and consists in a more concise rep-resentation for the languages of s1; s2; : : :, based on�nite state automata. A novelty here over [19] is theuse of a non deterministic automaton to get a morecompact structure.Approximations We discuss here two choices forapproximations � of �: bisimulation, �b, and sim-ulation, �s. Both are discussed extensively in theliterature [17, 20, 13]. The idea that these can beused to approximate the language equivalence datesback to the modeling of reactive systems and pro-cess algebras [13]. For completeness, we revise theirde�nitions in the Appendix.Both �b and �s are approximations, i.e. satisfyEquation 1. In fact we have: v �b u =) v �s u =)v � u. The implications are strict: this is illustratedin Figure 4, where x � y � z, x 6�s y �s z, andx 6�b y 6�b z. Moreover, �b is easiest to compute(O(m logn)), followed by �s (O(mn)), then by �(PSPACE).In constructing our indexes we will use either abisimulation or a simulation. The reader may won-der how much we loose in practice by using an ap-proximation instead of �. The answer is: not much.In fact, for tree data graphs the three coincide. Weprove a slightly more general statement. Let us saythat a database DB has unique incoming labels iffor any node x, whenever a; b are labels of two dis-tinct edges entering x, then a 6= b. In particular, treedatabases have unique incoming labels. We prove inthe Appendix:Proposition 3.1 If DB is a graph database withunique incoming labels, then �, �s, and �b coincide.4

1-Indexes We can now de�ne 1-indexes. Given adatabase DB and an approximation equivalence re-lation � (i.e. satisfying equation (1)), we constructa rooted labeled graph I(DB) as follows. Its nodeswill be the equivalence classes s1; s2; : : :, i.e. each siis some equivalence class (w.r.t �) [v], for some nodev in DB. I(DB) has an edge si a! sj i� DB containsan edge v a! v0 for some v 2 si; v0 2 sj . Finally, theroots are the equivalence classes of DB's roots, i.e.all the [v] where v is a root of DB. Thus, the regularlanguages which previously had to be stored explic-itly for each equivalence class si are now implicitlygiven as Lsi(I(DB)).We call I(DB) the 1-index of DB, and when DBis clear from the context we omit it and simply use I .We store an 1-index as follows. First we associate anoid s to each node in I , and store I 's graph structurein a standard fashion. Second, we record for eachnode s the nodes in DB belonging to that equivalenceclass, which we denote extent(s). That is, if s is anoid for [v], then extent(s) = [v]. The space for Iincurs two costs: the space for the graph I , and thatfor the extents. The graph is at most as large as thedata graph DB, but we will argue that in practice itmay be much less. The extents are exactly the totalnumber of nodes inDB: this may be acceptable for 1-indexes, but will become too costly for more complexindexes, discussed later. We describe in the Appendixtechniques for reducing the total size of all extents.Evaluating Query Paths with 1-Indexes Wedescribe now how to evaluate a query path P x.Rather than evaluating it on the data graph DBwe evaluate it on the index graph I(DB). Letfs1; s2; : : : ; skg be the set of nodes in I(DB) thatsatisfy the query path. Then the answer of the queryon DB is extent(s1)[extent(s2)[: : :[extent(sk). Thecorrectness of this algorithms follows from the follow-ing proposition, whose proof is in the Appendix:Proposition 3.2 Let � be an approximation (i.e.satis�es Equation (1)) on DB. Then, for any node vin DB, Lv(DB) = L[v](I(DB)).The complexity of evaluating a query q = P x onany graph is proportional to the size of the graph. Infact it is polynomial in the size of the graph, thequery path, and the complexity of computing thetruth value of unary formulas in F . Since the index islikely to be smaller than the databaseDB, evaluatingthe query on the index rather than on the databaseyields better performance. Note that nodes in the in-dex graph may have many outgoing edges. This is be-cause an equivalence class may contain many nodes,and the outgoing edges of the class node is the union

13

ttt t

10 7 9

3 4 5 6

t

a b

1

2

11 12

aa b a c a d

8 (a) b d

2 3 4 5 6

ca

a

1

1198 10 127 13

t(b)Figure 2: A data graph (a) and its 1-index (b)of all their outgoing edges. To make the computa-tion faster, these edges can be further indexed (e.g.by hashing or using B-tree on the labels) so that theselection of edges with speci�c labels is faster.Example 3.3 Figure 2 (a) illustrates a fragment ofa database with tuples of a irregular structure (wedropped the values of the attributes). Its 1-indexis shown in Figure 2 (b). When evaluating a queryq = t:a x we follow the two t:a paths (rather thanthe 5 in the original database), and take the unionof their extents: f7; 13g [ f8; 10; 12g. If attributevalues are added, then the current leaves will haveoutgoing edges representing them. Typically thereare many possible values (hence outgoing edges) foran attribute, so we will index these outgoing edges(e.g. using B-tree). When searching for a speci�cvalue, e.g. t:a:7, we will follow the two t:a paths,then in each of them use the corresponding B-tree toidentify the outgoing 7 edge.The Size of a 1-Index The storage of a 1-indexconsists of the graph I and the sum of all extents. Asexplained above, both of them are bounded in sizeby the size of the database (up to a constant factor).Since query paths are now computed on the indexgraph I rather than on DB, the smaller I is, relativeto DB, the better the improvement in performace.On the experimental side we tested the technique ona variety of databases, obtaining very encourangingresults, showing that in common scenarios I is sig-ni�cantly smaller than DB. A brief discussion of theexperiments is given in the Appendix, where we alsodescribe three simple implementation techniques tofurther reduce the the storage size for both the graphI and the associated extents. On the theoretical sidewe identify here two parameters which may cause thesize of I to approach its upper bound. These are: (1)a large number of distinct labels in DB, and (2) theexistence of very long acyclic paths. We prove herethat, by imposing limits on these parameters, the up-per bound on the size of I is independent on that ofDB. Technically this is one of the hardest results in5

this paper, and we believe it is valuable in focusingfuture research aimed at reducing the index size.Formally, for a database DB and number k, we saythat DB is \k-short" if there are no simple2 paths oflength > k. For example trees of depth � k are k-short. Some important instances of semistructureddatabases are in practice k-short, for some small k.Namely many web sites have the following structure:they start as a tree of depth d, then add back links ,which always point back to some ancestor of the cur-rent page, and a navigation bar , consisting of p linksto p distinguished pages in the web site: importantly,every page having a navigation bar refers to the sameset of p distinguished pages. It is easy to see that sucha database is d+ p(d � 1) short. In practice, both dand p are very small, even if the web site itself islarge.Theorem 3.4 Let DB be a k-short database havingat most p distinct labels, and let � be any approxi-mation which is at least as coarse as a bisimulation3.Then the size of I is bounded by some number de-pending only on k and p, and is independent on thesize of DB.The proof is sketched in the Appendix.Connection to Related Work: Data GuidesIn [19] and [11], the authors proposed for the �rsttime a method for extracting all the possible path in-formation from a given database DB, and describeit as a concise labeled graph called a dataguide. Intheir approach they insist that each path in the databe represented at most once in the dataguide: this im-plies that the dataguide, when viewed as a �nite stateautomata, is deterministic. In fact, a dataguide Gfor DB is any deterministic automaton which gener-ates the same words as DB. Here, and in the fol-lowing discussion, both DB and G are viewed as au-tomata by taking their roots as initial states and alltheir nodes as �nal states. However, [11] observesthat not any dataguide is appropriate for answeringqueries, because in general there exists no clear cor-respondence between states in G and sets of nodesin DB (our extents). They therefore consider onlydataguides having certain properties, which they callstrong data guides. For any DB there exists exactlyone strong dataguide G, namely the standard pow-erset automaton construct on DB. The correspon-dence between nodes in G and nodes in DB is nowexplicit, since each node in G is a set of nodes inDB: this relationship is similar to our extents. How-ever, unlike in our 1-indexes, the extents of a strong2A simple path is a path which does not go through thesame node twice.3That is u �b v =) u � v.

dataguide may overlap. Hence, the storage size fordataguides is larger than that for 1-indexes for tworeasons: (1) the size of the dataguide graph may beas large as exponential in that of the database, whilethe 1-index is at most linear, and (2) the total size ofall extents in a dataguide may be as large as exponen-tial in that of the database, due to overlaps, while for1-indexes it is again linear in the size of DB. We be-lieve that one of the main contributions of our workis to identify that, by relaxing the determinism re-quirement imposed on dataguides, the 1-indexes canbe constructed and stored more e�ciently, while atthe same time achieve a similar performance. Wepinpoint the relationship between dataguides and 1-indexes in the following proposition. (Proof omitted.)Proposition 3.5 Let � be any approximation rela-tion on the nodes of a database DB (i.e. � satis�esEquation (1)), and let I be the 1-index constructedon DB using �. Then the deterministic automatonbuilt from I by the standard powerset constructioncoincides with the strong dataguide.Thus, 1-indexes are non-deterministic alternativesto dataguides. Moreover, the two coincide on treedatabases, (because in this case I , when viewed as anautomaton, is deterministic.)4 2-IndexesIn this section we describe index structures for an-swering queries of the form select x; y from � x P y,where the P can be any regular path expression. Thetemplate representing these queries is � x1 P x2. Weagain use language equivalence to form equivalenceclasses of nodes. But here we are interested in pairsof nodes (matching x1 and x2), so we will consider thelanguage between pairs of nodes. Formally, de�neL(v;u)(DB) def= fw j w = a1 : : : an; and there existsa path v a1! : : : an! u in DBgWe write L(v;u) when DB is clear from the con-text. Now, de�ne two pairs to be equivalent, (v; u) �(v0; u0), i� L(v;u) = L(v0;u0), and let [(v; u)] denote theequivalence class of (v; u). As before, computing �is prohibitively expensive, so we consider (e�cientlycomputable) approximations, �, satisfying:(y; u) � (v0; u0) =) (v; u) � (v0; u0) (2)As for the case of 1-index, it is possible to de�nee�cient approximations � using variants of the sim-ulation or bisimulation relations. (Details are omit-ted for lack of space). Then, we de�ne the 2-indexI2(DB) of DB to be the following rooted graph.6

aba

c

t

dcaba

dFigure 3: A 2-index for the data graphIts nodes are equivalence classes (w.r.t �), [(v; u)];the roots are all the equivalence classes of the form[(x; x)]; �nally, there is an edge s a! s0 i� there existv; u; u0 s.t. (v; u) 2 s, (v; u0) 2 s0, and DB containsan edge u a! u0. Besides the graph I2 itself, we alsostore, for each state s, the extent of s, consisting ofall pairs (v; u) in the equivalence class s.Proposition 3.2 now becomes: L(v;u)(DB) =L[(v;u)](I2(DB)). Node that the L(v;u)(DB) on theleft represent the paths between v and u in thedatabase DB, while the L[(v;u)](I2(DB)) on the rightrepresents the paths, in the 2-index I2(DB), betweensome root of the index and [(v; u)].Query evaluation with 2-indexes proceeds similarlyto that with 1-indexes, with small modi�cation: Tocompute select x; y from � x P y, we compute thequery path P y on I2 and take the union of the ex-tents. Note that this saves the � search: rather thansearching for P from all the nodes in DB, in theindex it su�ces to look for P paths staring at theroots. These are often fewer than nodes in DB: Forexample, in acyclic databases, I2 has a single root,because4 (u; u) � (v; v) for every nodes u; v 2 DB.Figure 3 shows the 2-Index (without extents) for thedatabase in Figure 2 (a). It has a single root: the topnode. The query select x; y from � x a y is evaluatedby traversing the outgoing a edges of that root.As for 1-indexes, the storage of a 2-index consists oftwo parts: the graph and the extents. Both are now(at worst) quadratic in the size of DB. Again, whilethis guarantees that querying the index will not takemore than querying the database, we would like tokeep the index as small as possible. Our experiments(described brei y in the Appendix) indicate that inpractice the index size is by far smaller than this up-per bound, thus providing a signi�cant improvementin query evaluation. A number of implementationtechniques for further reducing the size of 2-indexesare also available, but they are beyond the scope ofthis paper, and are only mentioned brie y in the Ap-pendix. On the theoretical side, Theorem 3.4 can beextended to 2-indexes for obtaining upper bounds onthe size of the graph of I2 which are independent onthe size of DB: we omit this for lack of space.4This remains true if we replace � with �b or �s.

Connection to Related Work: Patricia TreesWe conclude this section by explaining brie y the re-lationship to full-text indexing mechanisms and inparticular to Pat trees [12, 21]. Its purpose is to as-sist in computing regular expressions over large text�les. A Pat tree is a Patricia tree [9, 12] constructedover all the possible su�xes of a text (viewing thetext as in�nitely long), as follows. The root node willhave one outgoing edge for each character in the �le.Each of its children, say that corresponding to theletter k, will have one child for each character follow-ing that letter, e.g. the children may correspond toka; kb; kc; : : : These nodes in turn will have one childfor each continuation of that group of two charac-ters, etc. If a node has only one child, that childis deleted, and the node is annotated with the num-ber of descendents being omitted. The leaves of thetree point back into the data, to the beginning of thecorresponding strings.There exists a close relationship between Pat treesand 2-indexes, if we view a �le consisting of a se-quence of characters a1; a2; : : :, as a graph databaseDB having a single long chain: v1 a1! v2 a1! : : : Herethe 2-index for DB is a tree (note that the discussionabove implies that the 2-index has a single root). ThePat tree can be obtained from the 2-index by per-forming some of the optimizations presented in theAppendix (namely (1) keeping only the x values inthe extents, (2) skipping nodes and pointing back tothe data whenever the descendents form a long chain,and (3) keeping extents only in leaf nodes).5 T-IndexesThe 1-index and 2-index represent all the paths inthe database (or all the paths from the root, inthe case of 1-index) hence if the paths structure isvery irregular, the index may become too large andhence ine�cient. More performance improvementcan be obtained if we restrict the class of querieswhich the index supports. This general principalhas been applied successfully to relational and ob-ject oriented databases, where indexes are speci�cfor one attribute, or for one �xed path. To illus-trate with an example from semistructured data, con-sider the repository of cities in Figure 1. Assumethat a high percentage of the query mix has the formselect x2 from �:Restaurant x1 R x2, where R issome arbitrary path expression: that is, the queryconforms to the template �:Restaurant x1 P x2.Rather than indexing all the paths, it is more con-venient to index only those having a Restaurant in-coming edge. Another example is the case wheremost of the information in the database has a �xed,7

pre-de�ned structure, and only certain componentsare irregular. For example, consider the relationRestaurants(Name;Phone;Menu): Name and Phonehave a �xed structure while the Menu attribute has acomplex structure that di�ers from one restaurant tothe other. We want to use standard optimization andindexing techniques for the structured parts, and fo-cus our novel indexing mechanisms to the Menu part,where the standard ones do not apply.We show here how the principles underlying the1- and 2-indexes can be extended to more exibleindex structures, capturing the above, and gener-alizing relational indexes, object-oriented path in-dexes, as well as 1- and 2- indexes. For the re-mainder of this section we consider a query templatet = T1 x1 T2 x2 : : : Tn xn, where each of the Ti'sis either a path expression or a place holder P orF . We build an index structure, called a T-index ,to assist in answering queries q 2 inst(t). Before go-ing into the de�nition of the index, we would like topoint out that T-index both generalize and specialize1 and 2-indexes, in certain ways. The generaliza-tion comes from the fact that both 1 and 2-indexesare particular cases of T-indexes (see below). ButT-indexes also specialize 1 and 2-indexes, because ofthe following intuition. Suppose we built a T-indexfor a template t, and then want to evaluate a queryQ =select x from P x. We can always use a 1-indexto evaluate Q, but we can use the T-index only if thepath expression P is in some sense \compatible" withthe T1:T2 : : : Tn path in t: thus T-indexes reduce theclass of path expressions they can evaluate. We willdiscuss below how to test whether a given query canbe evaluated using a T-index.De�nitions In the case of 1 and 2-indexes we de-�ned the language equivalence to be the equivalencerelation on nodes, (resp. on pairs of nodes) in DB.We want to proceed similarly for arbitrary templatest. The di�erence is that here a query binds the vari-ables x1; : : : ; xn in some order, hence it makes senseto talk about identifying tuples of nodes correspond-ing to subsets of these n variables. We make a choice,and impose the evaluation strategy where the vari-ables x1; : : : ; xn are searched and bound in this or-der. This leads to the de�nition below. First, somenotations: given a tuple (v1; : : : ; vi) of nodes in DB,we use L̂j , j = 1; i, to denote the language L(vj�1;vj)for j � 2, and to denote the language Lv1 for j = 1.De�nition 5.1 Let t = T1 x1 : : : Tn xn be a pathtemplate. Let $; S1; : : : ; Sn be new data values not inD. (D is the domain of data values from De�nition2.1.) For any i-tuple (v1; : : : ; vi) of nodes in DB, i =1 : : : n, we de�ne T(v1;:::;vi)(DB) to be the following

regular language over the alphabet D[f$; S1; : : : ; Sng:T(v1;:::;vi)(DB) def= R1:$:R2:$ : : : Ri, where the Rj 's ,j = 1 : : : i are the regular expression below:� If Tj = P (path template), then Rj def= L̂j.� If Tj = F (formula template), then Rj def= L̂j \D. That is, Rj is the set of labels on all edgesfrom vj�1 to vj (where v0 is a root).� If Tj = Pj (constant path expression): if L̂j \W (Pj) 6= ; then Rj def= Sj , otherwise Rj def= ;.Finally, for two i-tuples (v1; : : : ; vi) and (u1; : : : ; ui)we de�ne the language-equivalence relation,(u1; : : : ; vi) � (u1; : : : ; ui), i� T(v1;:::;vi)(DB) =T(u1;:::;ui)(DB). The equivalence class of (v1; : : : ; vi)is denoted [(v1; : : : ; vi)]As before two tuples (v1 : : : vn), (u1 : : : un) in DBcan be distinguished by a query path P1 x1 : : : Pn xnin inst(t) i� (v1; : : : ; vn) 6� (u1; : : : ; un). The goalof the the $ and the new Si symbols is to pinpointthe range of each of the path term in the query, (andin particular those that match the constant path ex-pressions in the template), and thus determine theassignments of nodes to the query variables. Thisissue will be further clari�ed below.Here again, computing � is expensive, so we con-sider approximations, �, satisfying:(v1; : : : ; vi) � (u1; : : : ; ui) =) (v1; : : : ; vi) � (u1; : : : ; ui) (3)and that can be computed e�ciently. As for the caseof 1 and 2-index, it is possible to de�ne e�cient ap-proximations using variants of the traditional simu-lation and bisimulation relations. (Details omitted).Given such an approximation �, the T-indexIt(DB) for t is the following rooted graph:Nodes - The nodes include all the equivalence classes(w.r.t �) [(v1; : : : ; vi)]; i = 1; n. Also, for each suchclass we introduce an additional new node which wedenote [(v1; : : : ; vi)]$.Edges - We have edges labeled by $ from each node[(v1 : : : vi�1; vi)]$, 1 � i < n, to [(v1 : : : vi�1; vi; vi)].Additionally, each Ti in the template t =T1 x1 : : : Tn xn introduces some edges, depending onits structure:1. If Ti = P , then for each edge vi a! v0iis in DB, It has an edge [(v1 : : : vi�1; vi)] a![(v1 : : : vi�1; v0i)]. Additionally, each [(u1 : : : ui)]has an edge to [(u1 : : : ui)]$ labeled by a special� symbol.2. If Ti = F , then for each edge vi a! v0iis in DB, It has an edge [(v1 : : : vi�1; vi)] a![(v1 : : : vi�1; v0i)]$.8

3. If Ti = Pi, (i.e. a path expression), thenfor each node [(v1 : : : vi�1; vi)] and every v0i s.t.L(vi;v0i) \ W (Pi) 6= ;, It contains an edge[(v1 : : : vi�1; vi)] Si! [(v1 : : : vi�1; v0i)]$, where Siis a new symbol.Root nodes - The roots are all the nodes [(v)] wherev is a root of DB.Terminal nodes - Unlike graph databases and 1 and2-indexes, here we distinguish terminal nodes : theseare all nodes of the form [(v1; : : : ; vn)]$.Finally, we remove all nodes not reachable froma root or not having an outgoing path to a ter-minal node, and associate with each terminal node[(v1; : : : ; vn)]$ the extent containing all tuples in[(v1; : : : ; vn)].5Example 5.2 Consider the template t =(�:Restaurant:�:Menu) x P y. The equiva-lence classes are the following. For single nodes,u, there are exactly two classes [(u)]: the �rst, s1,contains all nodes u reachable from a root via a pathmatching �:Restaurant:�:Menu, and the second, s2,contains all the other nodes. Considering pairs next,the equivalence classes are now sets of pairs (u; v)for which u 2 s1 and which have the same languageL(u;v); in addition there are similar equivalenceclasses for pairs (u; v) with u 2 s2.It has one transition s1 S1! s$1, continued withs$1 $! [(u; u)], for all u 2 s1, has arbitrary transitions[(u; v)] a! [(u; v0)] for all edges v a! v0 in DB and allu 2 s1, and �nally has transitions [(u; v)] �! [(u; v)]$,ending in a terminal state. Note that s2 has no outgo-ing edges, hence all nodes [u]; [(u; v)] with u 2 s2 areremoved from the graph. The resulting T-index lookslike a 2-index that considers only the data reachableby a �:Restaurant:�:Menu path.Observe that every path from a root to a termi-nal node traverses exactly n � 1 $-edges. We de�neL[(u1;:::;ui)] to be the language describing paths fromthe root to [(u1; : : : ; ui)], with the slight modi�ca-tion that the � symbols are interpreted as the epsilonmoves (i.e. they are omitted from the strings).Evaluating Query Paths with T -Indexes Inthe simplest scenario the query matches the templatecompletely, i.e. q = R1 x1 : : : Rn xn 2 inst(t). First,let Pq def= P 01:$ : : : $:P 0n, where:P 0i def= � Si when Ti is a constantPi \ (:$)� when Ti is P or F5As in the case of 1 and 2-indexes, when nodes in It havemany outgoing edges, we can further index their labels.

Then evaluate the query path Pq x on It, interpretingthe � edges as epsilon moves. Since Pq has exactlyn� 1 $-signs, all the retrieved nodes are of the form[(v1; : : : ; vn)]$. The answer to the query is the unionof the extents of the retrieved nodes. The followingguarantees the correctness of this algorithm.Proposition 5.3 .(1) Let � be an approximation (i.e. satis�es Equa-tion (3)) on DB. Then, for every i = 1 : : : n andevery i-tuple (v1 : : : vi), we have T(v1;:::;vi)(DB) =L[(v1;:::;vi)]$(It(DB)).(2) a tuple (v1; : : : ; vn) satis�es a query q i� W (Pq)\T(v1;:::;vn)(DB) 6= ;, (where Pq is as de�ned above).Evaluating More Complex Queries Sometimeswe can use a T -index to evaluate queries q 62 inst(t).We illustrate �rst with two examples.Example 5.4 Let t and q be:t = P x ((B:A)�) y C zq = ((A:B)�):A y C zObviously q 62 inst(t), but we can still use It as fol-lows. First instantiate t to p = A x ((B:A)�) y C z 2inst(t) (we have instantiated P with A). Then qcan be expressed as a projection from p, namely asselect y; z from p, because A:(B:A)� = (A:B)� :A.Example 5.5 Let t and q be:t = A x B y C zq = A x B y (C:D) u E vAgain q 62 inst(t). Here t has a single instance,p = A x B y C z. We can use it to computea pre�x of q, namely the variables x and y, thencontinue to compute u; v with a search in the datagraph. That is, we rewrite q as: select (x; y; u; v) fromp; y (C:D) u E v. A subtle point here is that the un-used \tail" of p, namely C z is not harmful (it isimplied by y (C:D) u). In e�ect we have replacedsome pre�x of q with an instance of t: we call thispre�x replacement .The general problem of deciding whether a pathquery q can be rewritten in terms of one or more T -indexes generalizes the query rewriting problem [15]to regular path expressions. We do not attempt tosolve the rewriting problem for regular expressions:this is still open. Instead we identify restrictions un-der which the rewriting can be done e�ciently.Formally, given a template t and query path qwith variables x1; : : : ; xn, we de�ne a pre�x replace-ment of q w.r.t. t to consists of (1) an instance9

p 2 inst(t) (with proper variable renamings), and(2) a post�x q0 of the query path q, such that thequery select (x1; : : : ; xn) from p; q0 is equivalent toq. Checking whether a query path q admits a pre-�x replacement is PSPACE-hard. Indeed, given twoarbitrary regular expressions R;R0, they are equiva-lent i� the query path q = R x has a pre�x replace-ment w.r.t. the template t = R0 y: equivalence ofregular expressions is PSPACE-complete [22]. In thefull version of the paper we prove the converse too:that checking whether there exists a pre�x replace-ment (and �nding one, when it exists) is in PSPACE.The proof consists in a careful reduction of the pre-�x replacement problem to two problems: (1) test-ing equivalence of regular path expressions (which isknown to be in PSPACE), and (2) �nding, for a regu-lar expression R and number n, all n-tuples of regularlanguagesR1; : : : ; Rn for which R = R1:R2 : : : Rn: weprove that this problem is in PSPACE too.Finally, we consider a particular case of templatesand queries which we believe to be more frequent inpractice. De�ne a regular path expression to be sim-ple if it consists of a concatenation of (1) constantsfrom D, (2) , and (3) �. For example �:A: � :B: : :Cis a simple regular path expression. Similarly, de-�ne a template to be simple if all its constant regularexpressions (if it has any) are simple. We prove inthe full version of the paper that checking/�nding apre�x replacement for a simple query w.r.t. a sim-ple template is in PTIME. At the core of this re-sult lies a Lemma stating that containment of sim-ple path expressions can be tested in PTIME. Thismay come at a surprise, since the deterministic au-tomata associated to a simple regular path expressionmay have exponentially many states (proof omitted),hence the traditional containment test of regular lan-guages would be much more expensive. Summerizing:Proposition 5.6 Given a template t and a querypath Q, the problem whether there exists a pre�x re-placement of Q w.r.t. t is PSPACE complete. Whenboth Q and t are simple, then the problem is inPTIME.Connection to Related Work T-indexes are exible structures which can be �ne-tuned to trade-o� space for generality. They capture 1- and 2-indexes, by taking the templates P x and � x P yrespectively. They also generalize traditional rela-tional indexes: assuming the encoding of relationaldatabases as in [7], an index on attribute A ofthe relation R1 can be captured with the template(R1:tup) x A y F z. Finally, they generalize pathindexes in OODBs. For example Kemper and Mo-erkotte describe in [14] access support relation (ASR),

an index structure for query paths in OODBs. ASR'sare designed to evaluate e�ciently paths of the formo:A1:A2 : : : An, where o is an object and A1; : : : ; Anare attribute names. They de�ne an access sup-port relation, ASR, to be an n + 1-ary relation Rsuch that (u; u1; u2; : : : ; un) 2 R i� there exists apath u A1! u1 A2! u2 : : : un in the database. Ignor-ing the mismatch between the object-oriented andthe semistructured data model, there exists a closerelationship between an ASR and the T-index forthe template � x A1 x1 A2 x2 : : : An xn. Thegraph structure of the T-index would be a chain of2n nodes [(r)] ! [(r)]$ ! [(u; u1)] ! [(u; u1)]$ ![(u; u1; u2)] ! : : : ! [(u; u1; u2; : : : ; un)]$, where thelast (terminal) node has an associated extension: thisextension is precisely the ASR.6 ConclusionsWe presented an indexing mechanism, called T-index,aimed to assist in evaluating query paths in semi-structured data. A T -index captures the (possiblypartial) knowledge about the structure of data andthe type of queries in the query mix, as described bya path templates.Abiteboul and Vianu consider in [3] First-Orderequivalence classes over tuples of values in thedatabase. Two tuples (x1; : : : ; xn) and (y1; : : : ; yn)are equivalent if they are indistinguishable by anyFO formula. The language equivalences on which webase our index constructs are only super�cially re-lated to the FO equivalence classes: the queries weconsider to distinguish between two tuples are onlychain queries. Hence the language equivalences arecoarser than FO equivalences, and results in fewerequivalence classes.Buchsbaum, Kanellakis, and Vitter consider in [5]the problem of incrementally maintaining querypaths given by a �xed regular expression under ei-ther database insertions or deletions (but not both).They describe an e�cient method for incremental up-dates. Since their method refers to a �xed regularexpression, it could be used in incremental updatesof T-indexes but only when the template is restrictedto constant regular expressions. We do not addressindex maintenance here, but note that a possible al-ternative to incremental maintenance can be basedon the optimization technique mentioned in the Ap-pendix, of pointing back to the data, doing so when-ever a portion of the index graph is invalidated by anupdate.Acknowledgment: We thank Micky Frankel forthe implementation of the 1-, 2- and T-indexes.10

References[1] S. Abiteboul, D. Quass, J. McHugh, J. Widom,and J. Wiener. The Lorel query language forsemistructured data. International Journal onDigital Libraries, 1(1):68{88, April 1997.[2] Serge Abiteboul. Querying semi-structured data.In ICDT, 1997.[3] Serge Abiteboul and Victor Vianu. Generic com-putation and its complexity. In Proceedings of23rd ACM Symposium on the Theory of Com-puting, 1991.[4] Elisa Bertion andWon Kim. Indexing techniquesfor queries on nested objects. IEEE Transactionson Knowledge and Data Engineering, 1(2):196{214, June 1989.[5] Adam Buchsbaum, Paris Kanellakis, and Jef-frey Scott Vitter. A data structure for arc inser-tion and regular path �nding. Annals of Math-ematics and Arti�cial Intelligence, 3:187{210,1991.[6] Peter Buneman, Susan Davidson, Mary Fernan-dez, and Dan Suciu. Adding structure to un-structured data. In Proceedings of the Inter-national Conference on Database Theory, pages336{350, Deplhi, Greece, 1997. Springer Verlag.[7] Peter Buneman, Susan Davidson, Gerd Hille-brand, and Dan Suciu. A query language andoptimization techniques for unstructured data.In Proceedings of ACM-SIGMOD InternationalConference on Management of Data, 1996.[8] V. Christophides, S. Abiteboul, S. Cluet, andM. Scholl. From structured documents to novelquery facilities. In Richard Snodgrass and Mar-ianne Winslett, editors, Proceedings of 1994ACM SIGMOD International Conference onManagement of Data, Minneapolis, Minnesota,May 1994.[9] P. Flajolet and R. Sedgewick. Digital searchtrees revisited. SIAM Journal on Computing,15:748{767, 1986.[10] Harold Gabow and Robert Tarjan. Faster scalingalgorithms for network problems. SIAM Journalof Computing, 18(5):1013{1036, 1989.[11] Roy Goldman and Jennifer Widom. DataGuides:enabling query formulation and optimization insemistructured databases. In VLDB, September1997.

[12] G. Gonnet. E�cient searching of text and pic-tures (extended abstract). Technical ReportOED-88-02, University of Waterloo, 1988.[13] Monika Henzinger, Thomas Henzinger, and Pe-ter Kopke. Computing simulations on �nite andin�nite graphs. In Proceedings of 20th Sym-posium on Foundations of Computer Science,pages 453{462, 1995.[14] Alfons Kemper and Guido Moerkkotte. Accesssupport relations: an indexing method for ob-ject bases. Information Systems, 17(2):117{145,1992.[15] Alon Levy, Alberto Mendelzon, Yehoshua Sagiv,and Divesh Srivastava. Answering queries usingviews. In Proceedings of the 14th Symposium onPrinciples of Database Systems, San Jose, CA,June 1995.[16] A. Mendelzon, G. Mihaila, and T. Milo. Query-ing the world wide web. In Proceedings of theFourth Conference on Parallel and DistributedInformation Systems, Miami, Florida, December1996.[17] Robin Milner. Communication and concurrency.Prentice Hall, 1989.[18] S. Nestorov, S. Abiteboul, and R. Motwani. In-ferring structure in semistructured data. In Pro-ceedings of the Workshop on Management ofSemi-structured Data, 1997.[19] S. Nestorov, J. Ullman, J. Wiener, andS. Chawathe. Representative objects: con-cise representation of semistructured, hierarchi-cal data. In ICDE, 1997.[20] Robert Paige and Robert Tarjan. Three par-tition re�nement algorithms. SIAM Journal ofComputing, 16:973{988, 1987.[21] A. Salminen and F. W. Tompa. Pat expressions:an algebra for text search. In Papers in Com-putational Lexicography: COMPLEX'92, pages309{332, 1992.[22] L. J. Stockmeyer and A.R. Meyer. Word prob-lems requiring exponential time. In 5th STOC,pages 1{9. ACM, 1973.11

A AppendixBisimulation, Simulation For completeness, weinclude here the de�nition of a bisimulation and asimulation. Note that we need to slightly modify thetraditional de�nitions and \reverse" the directions ofedges, because Lv in our context refers to the set ofpaths leading into v, rather than from v as typicallyfound in the literature.De�nition A.1 Let DB be a data graph. A binaryrelation � on its nodes is called a reversed bisimula-tion if it satis�es:1. If v � v0 and v is a root, then so is v0.2. Conversely, if v � v0 and v0 is a root, then so isv.3. If v � v0, then for any edge u a! v there existsan edge u0 a! v0, s.t. u � u0.4. Conversely, if v � v0, then for any edge u0 a! v0there exists an edge u a! v, s.t. u � u0.A binary relation � is called a reversed simulation,if it satis�es conditions 1 and 3.We say that two nodes v; u are reversed bisimilar ,in notation v �b u, i� there exists a reversed bisim-ulation � s.t. v � u. There always exists a maximalreverse bisimulation, that it is an equivalence rela-tion, and that it is precisely�b. Paige and Tarjan [20]describe an O(m logn) time algorithm for computingthe maximal bisimulation on a unlabeled graph withn nodes and m edges, which can be easily adapted toa O(m logm) algorithm for labeled graphs [6].For the case of reversed simulation, there also existsa maximal one, which we denote �, however it is notan equivalence relation in general, but a preorder6.We say that two nodes v; u are reversed similar , ifv � u and u � v, and use the notation v �s u.Henzinger, Henzinger, and Kopke[13] give an O(mn)algorithm for computing the simulation relation onan unlabeled graph with n nodes and m edges, fromwhich one can derive an O(m2) algorithm for labeledgraphs [6].Proof of Proposition 3.1 Recall that we onlyconsider accessible graph databases, i.e. in which ev-ery node is accessible from some root. We will showthat � is a reversed bisimulation: this proves thatv � u =) v �b u, and the proposition follows. Wecheck the four conditions in De�nition A.1. If v � uand v is a root, then " 2 Lv, hence " 2 Lu, so u is aroot too. This proves items 1 and 2. Let v � u and let6That is, � is re exive and transitive, but not necessarilysymmetric.

d d d d

cc b c b b

d

a a a a aa

x y z

root

a

b

Figure 4: A data graph on which the relations �, �s,and �b di�er.v0 a! v be some edge. Hence Lv = L1:a [ L2, whereL1 = Lv0 , while L2 is a language which does not con-tain any words ending in a (because DB has uniqueincoming labels). It follows that Lu = L1:a [ L2.Since v0 is an accessible node in DB, we have L1 6= ;,hence there exists some edge u0 a! u entering u, andit also follows that Lv0 = Lu0 .Proof of Proposition 3.2 The inclusion Lv � L[v]holds for any equivalence relation �, not only approx-imations: this is because any path v0 a1! v1 a2! v2 : : :in DB, with v0 being a root node, has a correspond-ing path [v0] a1! [v1] a2! [v2] : : : in I . For the converse,we prove by induction on the length of a word w that,if w 2 L[v], then w 2 Lv. When w = " (the emptyword), then [v] is a root of I : hence v � r for someroot r. This implies Lv = Lr, so " 2 Lv. Whenw = w1:a, then we consider the last transition in I :s a! [v], with w1 2 Ls. By de�nition there existsnodes v1 2 [s] and v0 2 [v] and an edge v1 a! v0and, by induction, it follows w1 2 Lv1 . This impliesw 2 Lv0 . Now we use the fact that � is an approxi-mation, to conclude that w 2 Lv.Proof of Theorem 3.4 It su�ces to prove thestatement for the case when � is the reversed bisim-ulation relation. We will show that in this case thereare at most c(k; k; p) equivalence classes under re-versed bisimulation, in any databaseDB with k-shortpaths and at most p distinct labels. Here c(k; d; p) isde�ned by:c(k; 1; p) def= 2(k + 1) (4)c(k; d+ 1; p) def= 2k + 1 + 2p�c(k;d;p) (5)First a matter of terminology. Due to our partic-ular setting, our edges turned out to be in the op-posite direction than traditionally. Thus we have inthis proof reversed bisimulations, and reversed trees,i.e. with paths leading into the root, rather than outof. We will drop the attribute \reversed". Note thatin the case of trees edges lead now from children toparents.12

Consider some node u 2 DB. De�ne T (u) to bethe in�nite, reversed unfolding of DB at u. That isT (u) is a (possibly in�nite) tree, having u as its root,such that for every path in DB labeled a1:a2 : : : anending in u, there exists a unique corresponding pathin T (u) ending in u, with the same labels. Each nodein DB may correspond to several nodes in T (u), pos-sibly to in�nitely many. We will use the same no-tation for the nodes in DB and those in T (u): thuswe will talk about two nodes x; y in T (u) as being\equal", x = y. Recall that DB had its own rootnode(s): we call their unfoldings in T (u) the old roots .The importance of these trees T (u) is the following.For any two nodes u; v in DB, we have u � v i�there exists a bisimulation � between T (u) and T (v).Hence, in order to count the number of equivalenceclasses [u] it su�ces to count the number of equiva-lence classes of in�nite trees T (u). Here, and in thesequel, the de�nition of a bisimulation between twographs T (u) and T (v) is exactly as in De�nition A.1,where items 1 and 2 are required both for \roots" andfor \old roots".As a matter of terminology, we will classify T (u)'snodes into levels: thus the root u is on level 1, itchildren are on level 2, etc. For each such tree T (u)we identify a certain set of nodes which can be cut.For that we consider all paths of length � k + 1in T (u) ending in the root u, x ! uj�1 ! : : : !u2 ! u1 = u, such that uj�1; uj�2; : : : ; u1 are dis-tinct nodes while x is equal to one of uj�1; : : : ; u2; u1,say x = ui, for i < j. We de�ne x to be a cut node,and call i its index . Note that the subtrees of T (u)rooted at ui and at x are isomorphic. Since DB isk-short, all its cut nodes are on levels � k + 1, andthere are only �nitely many.Next we construct a �nite tree D(u), by actuallyperforming the cuts. That is, for each cut node xas before, we delete all its children (and children'schildren etc.): x becomes a leaf. We label the newleaf with a special symbol �i, where i is the index ofx, with the intend to recapture the information lostby cutting: the level number i will help us restorethe lost information. Repeating this for all cut nodesgives us a �nite tree, D(u), of depth � k, in whichsome of the leaves are labeled with one of �1; : : : ; �k.The importance of D(u) lies in the following fact.If there exists a bisimulation between D(u) and D(v),then there also exists a bimimulation between T (u)and T (v). Hence there will be at most as many bisim-ulation equivalence classes for the T (u)'s as for theD(u)'s: the latter are easier to count.We prove the fact �rst. For these new kind oftrees D(u) we change the de�nition of bisimulationby adding the requirement that, whenever x � y andx is labeled with �i, then y is labeled �i too, and

vice versa. Consider a bisimulation � between D(u)and D(v). To show that T (u) and T (v) are bisimi-lar, we consider some intermediate construction �rst.Namely de�ne B(u) (and B(v) similarly) to be thegraph obtained from D(u) by fusing every cut nodex with ui, where i is the index of x. That is, wedelete the node x, redirect the unique edge x! y toui ! y, and keep the same label on the edge: we callthe new edge a back edge. B(u) is not a tree anymore,since back edges introduce cycles. However, its in�-nite unfolding at the node u is isomorphic to T (u),and similarly for T (v) and B(v). Hence it su�ces toprove that B(u) and B(v) are bisimilar. Recall thatwe have a bisimulation relation � between D(u) andD(v). De�ne �rst a subset of the binary relation �as: x � y i� x � y, x; y are on the same level, andtheir parents x0, y07 satisfy x0 � y0. The fact thatD(u) and D(v) are trees ensures that � remains abisimulation. We prove now that � is a bisimulationbetween B(u) and B(v). Obviously it satis�es condi-tions 1 and 2 of De�nition A.1, both for all old rootsand for the \real" roots u and v. We check condi-tion 3: assume y � y0, and consider only back edgesui ! y in B(u) (for regular edges it is trivial), with uia node on level i. It corresponds to an edge x! y inD(u), where x is labeled �i. Since � is a simulationbtw. D(u) and D(v), there exists an edge x0 ! y0 inD(v), with x0 also labeled �i: hence x0 is a cut nodetoo, and in B(v) we will �nd a corresponding backedge u0i ! y0, with u0i also on level i. Since y � y0and both ui and u0i are their ancestors, and on thesame level, it follows that ui � u0i (that's the waywe designed �). This proves that B(u) and B(v) arebisimilar.Finally we count the number of equivalence classesunder bisimulation for �nite trees of depth � d, inwhich some leaves may be labeled with �1; : : : ; �k,and in which some nodes may have been designated\old roots". We prove that c(k; d; p), given by equa-tions (4) and (5) is an upper bound for that number.Indeed, for d = 1, each such tree consists of a singlenode, which is by necessity the (\real") root. In ad-dition it may be designated an old root or not, and itmay be labeled with one of �1; : : : ; �k, or not at all.In total there are 2(k+1) choices. For the inductioncase, consider some tree of depth � d+1. It is eithera single node (which brings us back to the previouscase, and give the 2(k+1) summand in Equation (5)),or has a \real" root with a non-empty set of directchildren. In the latter case we start by dropping theduplicate subtrees. We obtain a bisimilar tree, whichthe root has at most p� c(k; d; p) children (every la-bel on the edge paired with every possible bisimula-tion equivalence class for trees of depth � d). Hence,7That is, there exists edges x! x0; y ! y0.13

Data Graph Size 1-index 2-indexBibtex 150 40 50Web site 1521 198 1100Table 1: Experiments showing index sizethere are at most 2p�c(k;d;p) � 1 equivalence classesunder bisimulation: adding the two summands givesus Equation (5).Experiments Recall that index storage consists oftwo parts: the graph I and the extents. The graphcarries the schematic information, and its size is crit-ical for query performance: the graph distinguishesour index structures from traditional indexes. Weare currently conducting a series of experiments toasses the size of the index graphs: some of the re-sults are reported in Table 1. We are testing thetechnique on a variety of graph types, including rel-atively structured ones (Bibtex data), loosely struc-tured Web data (in particular the Web site of the CSdepartment of Tel-Aviv university), randomly gener-ated graphs, and mixed graphs composed of compo-nents of the above types. We brie y describe theseexperiments here. In order to asses the schematic in-formation we measure only the the number of non-leafnodes in the graphs.We started by considering 1 and 2-indices. Not sur-prisingly, the smallest indices were obtained for theBibTex data: although the structure of BibTex itemsmay vary (hence a collection of such items is nat-urally modeled by the semi-structured data model),the number of possible paths between nodes is ratherlimited. We considered increasingly growing �les andtheir corresponding graph representation. Already at150 nodes the size of the 1 and 2-indices almost sta-bilized having about 40 and 50 vertices resp., stayingat about the same size regardless of the growth ofthe data, and thus providing signi�cant performanceimprovement when querying large �les. Observe thatthe independence of the index size from the data sizeis also implied here from Theorem 3.4, but the exper-imental results show that in practice the index sizeis much smaller than the theoretical upper bound in-duced by the proof of the theorem.To evaluate the technique in a less structured en-vironment we considered the Web site of the CS de-partment in the Tel-Aviv university. It should benoted that the pages in the site are each built andmaintained individually by distinct people withoutsigni�cant constraints on the structure, and are notautomatically generated by some application, as donein some organizations, hence the structure is ratherloose and makes the site a typical example for semi-

structured information. For a graph of about 1500nodes, the size of the 1-index amounts to about 13%of the original size, and that of the 2-index to about72%. Observe that the later is only 0:0475% of thepotential upper bound on the size of the 2-index,which is the square of the number of nodes in thegraph ! This also implies that the e�ort in evaluationof queries of the form � x1 P x2 on the original datacan potentially be as much as square of that neededwhen using the 2-index. (Since on DB we need toevaluate the query from each node, while on I wejust evaluate P x2 from the I 's root.)The usage of T-indices for focusing on speci�c,more interesting, parts of the data was tested onmixed graphs combining randomly generated sub-graphs with BibTex or Web site-like data, and us-ing templates focusing on the BibTex/Web parts.The reduction is size was similar to the one reportedabove and more, depending on the size of the random-generated parts being ignored in the construction dueto the given template.Techniques for Reducing the Size of an IndexGraph We describe here three such methods. We�rst explain how they work for 1-indexes, then de-scribe brie y how the techniques are generalized to2- (or T-) indexes.Normalizing labels: In many cases, distinct labels inthe database graph are synonyms denoting the sameconcept. For example, the labels \�rst name", \�rst-name", \fname", \First Name", may all refer to thesame thing. But still, the 1-index, as described above,stores each of them separately. To avoid this, on maychose to \normalize" the database before construct-ing the index by applying some normalization func-tion � to all the labels inDB, and thus reduce the sizeof the index. We denote the 1-index thus obtained byI(�(DB)). It is easy to see thatProposition A.2 If all the formulas in a query pathq = Px have the property that for every value d 2 D,f(d) holds i� f(�(d)) holds, then q, when evaluatedusing I(�(DB)) computes exactly the answer of q forthe original DB.Pointing back to the database: Assume we have someequivalence class (node) s in the 1-index having manydescendants, all representing very small equivalenceclasses. Note that we gain very little by computingon this part of the index because it is almost as bigas the corresponding part of the data graph. So wewould like to remove it and avoid the duplication ofdata.Let S be a set of nodes in I having this property.We can reduce I 's size by redirecting all edges leav-14

ing from a node s in S to point into the databaseDB. That is, for every node s 2 S we delete allits outgoing edges s a! s0, and for each of them weintroduce new edges from I into DB, from s intoeach v 2 extent(s0). The root(s) of the resulting com-bined structure will still be I 's old root(s), and asbefore, the queries will be evaluated on the hybridindex starting from these roots. The space gain con-sists in removing those parts of I which are no longeraccessible from the root(s). This may come at thecost of increased computation time, since now partof the search is done on DB.Note that this hybrid index does not have anymorethe property of 1-indexes that all the node extentsare disjoint, (because a database node may appear inboth the I and the DB parts of the structure.) Nev-ertheless, we prove in the full paper that computingon this hybrid structure still yields a correct result.Dropping extents: Alternatively, we can keep the en-tire index I , but delete some of the extents. Thecomputation of a regular path expression on such areduced index proceeds as follows. Let A be some(nondeterministic) automaton equivalent to regularexpression, and let G = (I � A)acc be the standardproduct automaton, in which we only retain the ac-cessible part. Consider all states (s; t) of G, where tis a terminal state in A. If s 2 I has an associatedextent, then include extent(s) in the result: so farthe computation is similar to that described in Sec-tion 3. Otherwise, we have to \backtrack" in I up tosome state which does have an extent, and proceedfrom there in the database. This should be easy tovisualize when I is a tree. In the general case, thebacktrack step proceeds as follows. We compute inG a cut , meaning a set of nodes S which separatesG's initial states from (s; t): more precisely, S hasthe property that any path from an initial state in Gto (s; t) goes through some node in S. A cut can befound e�ciently (in PTIME, with low degrees [10]).Moreover, in computing the cut, we only considerstates of the form (s0; x) where s0 has an extent inI . Then, the backtrack step consists in consideringfor each (s0; x) 2 S the automaton A(x; t) obtainedfrom A by considering x the initial state and t theonly terminal state, and computing A(x; t) on DBwith extent(s0) as roots.2- (and T-) indexes: A combination of the techniquesdiscussed above can help us keep the size under con-trol: First, the previous result regarding the normal-ization of labels holds here as well. Next, regardingdropping the extents, note that the extents here con-tain pairs (or tuples) of nodes and not individual ob-jects. So rather than dropping an extent completely,we can also take a compromising approach and just

drop one (or some) of the attributes. For example,if we know that most of the queries are interestedonly in the x1 variable of � x1 P x2, then the x2attribute can be dropped, thus reducing the size ofthe extents. To restore it, if needed, we can switchto the x1 nodes on DB and look for the correspond-ing x2's there. This can be combined with the tech-nique for pointing back to the data, accept that nowwhenever the computation is moved from the indexto the database, we need to remember which x1 valuecaused the transition and pair it with the retrievedx2 nodes. Furthermore, when queries are interestedonly in the x1 values, then more reduction can beobtained by the following observations. Note that inacyclic parts of the graph, the x1 values in a nodeextent is he union of the x1 values in the extents ofits children. So we may decide to drop an extent,and if needed compute it (perhaps recursively) fromits children. Finally, if the index contains chains ofnodes all having the same extent and the same setof outgoing edges, we can skip those nodes, and justrecord the number of repetitions.

15