主讲:北京师范大学-香港浸会大学 苏伟峰博士
专家简介:苏伟峰,现为北京师范大学-香港浸会大学 联合国际星空(中国)计算机系主任,副教授, 博士生导师。2007年香港科技大学计算机科学与工程系博士,苏博士以第一作者身份于世界顶级期刊与会议中发表了论文10余篇,包括ACM TRANSACTION ON DATABASE SYSTEM (TODS),IEEE Transaction on Knowledge and Data Engineering (TKDE),论文它引超过200次,论文多次被多个大学引入课程教学,苏博士担任多个世界一流期刊和著名的国际会议的审稿人。近年来,主持国家自然科学基金面上项目等项目4项。
报告摘要:An important part of today’s Web is Web databases, in which 80% of the databases are structured databases. To facilitate a user to retrieve relevant records from different Web databases simultaneously, we propose a simultaneous querying system, called SIM-querying, which is comprised of three components: query interface integrator, data extractor and result integrator. In each component, a novel method is presented that performs its function automatically. In the query interface integrator, a holistic schema matching method, HSM, is presented that takes advantage of the attribute occurrence patterns in multiple query interfaces to find the attributes that match in different interfaces within a domain. In the data extractor, a domain-based data extraction method, ODE, is presented. In ODE, a domain ontology is first learned from the information overlap and schema matching in the query results and query interfaces from different Web databases within the domain and the ontology is then used to extract the data encoded in the result HTML pages automatically. In the result integrator, a new duplicate detection method, UDD, is presented to identify the duplicates that exist in the query results from different Web databases. In UDD, a set of negative records is first constructed based on two observations about the query results of Web databases and then, starting from the negative records, an iterative algorithm identifies the duplicates from different Web databases. Experimental results show that each of these novel methods can achieve very high precision and outperform existing methods in the context of Web databases.