[Query processing and optimization for XML] === This document: - is a guide to the three papers. - presented on Mon Feb 18 by madhu@cs.unm.edu, - does not delve into the research details but - only provides outline and perspective === WHY XML -- Why NOT: - commercial opportunity for business (info providers) to have vested interest in co-oporating with each other if adopted: applications use commonly understood tags. => hype about standards. merge with - java hype. -- What is XML? XML is a document format. -- what is semistructured data? - data on the web. Other eg: structured files. scientific data etc. - No schema. While data in traditional databases tends to be highly structured, with a clear notion of schema, data on the web is loosely structured. and is often not accompained by any clear notion of schema. -- Search Types - Structured search: search by restricting on attribute values, as in relational db. - Unstructured search: search in unstructured files, as in text. - Semi-structured search: combine both. -- Requires new DATAMODEL, QUERY LANGUAGE, OPTIMIZER And STORAGE SYSTEM What does it mean to query this data? What do you look for when you mine this data? === Three aspects of XML research pertaining to database community: 1. Storage, 2. Indexing, 3. Query Processing and Optimization. - query optimization will form bulk of discussion. -- Quick Overview of 3 papers - interactweb paper: provides use-case motivation/background and an example of using dataguides. - dataguides: concise and accurate structural summaries of semi-structured databases. - qo paper: bulk of methods used in lore system in stanford. -- More XML Hype: How it helps, limitations - XML can turns web into a db. imposing structure - BUT migrating web information to XML is significant 1st step. - USUAL searching the web: special html page parsers required vs text based search - Querying relational db's is form based and rigid. - web data cannot be put in RDBMS. Because it usually isnt structured. - XML is a document format. not a data model. - XML can encode semistructured data. [Jame's lecture wouldve covered a data model and algebra. (malhotra.pdf)] - translation of database research in semistructured queries applies to XML: Mapping: semistructured data can be represented as graph encoded as XML: - attributes=tags - objects=elems - atomic values=cdata - no data model (Problem). - references (Xpointer Xpath standards). - can be queried with regular path queries. - can be optimized: somehow. [My goal this semester is to understand and survey this optimization] - can be stored (integrity constraints, views updates.. patterns) ---- Three aspects of XML research: 1. Storage, 2. Indexing, 3. Query Processing and Optimization. Query Optimization - challenges: -- statistics: what do they mean in irregular data -- evaluation of regular path expressions: [limited forms of recursion?] -- mismatch beween logical (graph) model and physical (storage) schemsas) XML - Storage in RDBMS [Independent Research topic] [graphs stored by fwd pointers, some backward ptrs and indexing on collections.] My initial Goal: Discuss one project or set of related projects -- Why I chose to present Lore 1) this was successful 2) it didnt have heavy corporation efforts to bias. 3) Limitation: project closed, not maintained. Startups. --- Three aspects of XML research: Indexing 1. Storage, 2. Indexing, 3. Query Processing and Optimization. Indexing: - how indexing plays a part => As it relates to Query Optimization [Need to show Bindex Vindex Pindex etc] - Quickly describe {dataguides.pdf} as it describes dataguides which is basically a type of indexing. --- Three aspects of XML research: Queries -- XML - Queries in RDBMS [Seperate reseearch topic] === [Widom Paper: Query Optimization] - NOTE: though this is an older paper, it was resubmitted with changing semi-structured to XML and subimitted in acm's digital VLDB symposium in 99 or 2000. Also another conf is the workshop on Database Programming Languages (DBPL) - NOTE: implementation details: research systems described here and in other similar papers are about prototype systems built in universities (stanford.cs alon levy at uwashington. collaborator with microsoft research.) other languages: UnQL (Upenn) StruQL (AT&T) [Papers: conferences VLDB, DBPL QL] - Eg: STruqQL query :fn from a set of input graphs to an output graph - Note: regarding where the software is: they build their optimizers into query execution engines.[->Draw where the query optimizer sits in DBMS framework] - ORACLE - DB2 - get db2 specifics below - Postgres [Genetic algorithms Query Optimization] --- XML Query Languages (other aspects) XQL. XQuery --- XML Query processing (other aspects) bulk of research: - development of algebra for manipulating XML. - algorithms for reformulating queries --- XML Query Optimization: What is query optimization? - Decision process that selects the best query evaluation strategy from a set of execution plans. -- Basis to QO: - The performance can be prolly improved by 1) providing meta-data about the contents of the database (Eg: in LORE dataguide = metadata) 2) by incorporating the semantics of the database 3) by considering global plans which optimize execution over a set of queries. - Note: Standard techniques in optimization are based on query rewriting. It consists in associating a logical equivalent expression to a given query expression that is less costly to evaluate. == Semistructured Data Model - OEM data model. graph rooted graph = (V,E) - OEM model = translates to native XML DB. [However no ordering of elements inherent in XML textual defn] - self describing model. in lieu of schema [-> Draw OEM database][qo.ps] --- PATH EXPRESSIONS - Simple path expression x.l y, Path Expression = stringtogether spe's - query language based on path expression. - what a Query looks like [-> Draw Query Example] - Query language: Lorel QL path expressions describing traversals --- LORE INDICES: 1. Vindex, Operator: Scan, pointerchasing 2. Lindex, Operator: reverse pointer chasing 3. Bindex (edge): 4. Pindex, Operator: dataguides. --- LORE: Query Example - stratergies: top-down, bottom-up, hybrid. physical and logical plans. - LOGICAL QUERY PLAN: canonical form of query. [->DRAW] - HOW: define binding of variable x to object o during processing - HOW: define evaluation of a query (sub)plan = listing of variables in plan with objects (if any) bound to the variable - NEED to iteratively generate complete evaluations for all variables in query. - break down queries into independent components (order not fixed). - rotation points. - logical operators: Discover, chain, Glue, Temp, Project. - PHYSICAL QUERY PLAN: conversion to a plan based on index operators. --- LORE Query Engine - selects physical query plan. - implements the cost-based optimization - broadly takes into account shape of graph [measuring proximity based on (weighted) path length in the database graph.] --- LORE Statistics - cannot guarantee clustering of objects (page boundaries) in graphs - Model IO coset by # of object fetches (predicted) - Dataguides CANNOT store statistics (full path expressions) - End up storing statistics about all possible subpaths upto length k (tubable) - What is stored: 1. total # of atomic objects reachable from `p' atomic type 2. min.max of these 3. |p| = #instances of p 4. # distinct obj types reachable from p |p|_d 5. # l-labeled subobjects. |p_l| 6. # incoming l-labeled edges |p^l| - Trade off: accuracy of stats vs. time/space costs. -- What is estimated: [A.B x x.C y], x bound. given an evaluation corresponds to traversal in some path in the data, optimizer estimates whow many objects bind to next simple path expressions. - fan-in => |p| * (|p_l|/|p|_d) [Bottom-up: must estimate avg. no. of parents via a B edge for all C's] - fan-out => |p|* (|p^l|/|p|_d|) [top-down: must estimate avg.no of C subobjects. --- LORE COSTS (dataguide: uniform cost model ) - assign costs (CPU, IO) to nodes of Query plan, formulae in Appendix B - compute costs recursively. cost cost of subplans + cost. - cheapness: I/O cost. >+ CPU cost. --- LORE Plan Enumeration - lists heuristics to prune exponential graph space aggressively. (appendix C) - eg. TODO === Back to dataguides paper: overview [[[ - describes DataGuides as dynamic schemas [Aside: DTD: Note that a DTD acts more as a traditional schema, since it restricts the allowable XML data, while a DataGuide infers rather than imposes structure.] - generated from the database - for browsing database structure use in Lore (as UI for structured bowsing) - for formulating queries - For storing information such as statistics and sample values - for enabling query optimization. - theoretical foundations from graph theory and finite state automaton - presnets for their creation - presents an overview for incremental maintenance. - presents performance results, - label path annotations to improve query formulation and optimization - TODO formalize properties that guarantee or prohibit good performance - TODO heuristics that detect when a database may result in poor 1. will probably analyze the formalized database performance properties, 2. for monitoring purposes as source structure changes ]]] - Define dataguide. [->DRAW] - Use Strong datguides: annotated, equivalent to (path index) - each object in dataguide has link to corresponding target set in the source graph - time to search for all source objects reachable via a path = length of length of path of dataguide. - impl (interactweb.ps) dataguides for query formulation AND browsing results - DBG.Group_Member[10000].{Project, Publication[100]} - Answeredge. Answerobject. available for subsequent queries - Eg: (dataguide.ps) [6.1] SELECT DBG.Group_Member.Publication.Troff - standard query. assume 1 troff obj. => 10000000 objects examined - using dataguides: examine {root, DBG, GroupMember, Publication,Troff, targetset} [6.2] SELECT DBG.Group_Member.Publication.Year WHERE Year < 1975 - Use Vindex for 1975.(~ leaf label) to return objects (bottomup) - COmpute intersection wiuth data guide - assuming: DBG.Group_Member.Pub>year = #20000, #1000<1975. - VIndex lookup: 11000 objs [refer text]. [6.3] SELECT DBG.Group_Member.Publication WHERE Year < 1975 - if select path's target set (P) is empty, return no match - if where path's target set (Y), return no match - intersect aka 6.2 Y with Vindex lookup on year => Y* - find P* (parents of Y*). Intersect P* with P to get result. - for "You saved XXX by shopping at dataguides" # refer text. ==== Conclusions. 1. hype 2. opportunity - dead topic in rdbms still live in non structured databases.