An Improved Approach for Deep Web Data Extraction

The World Wide Web is a valuable wellspring of data which contains information in a wide range of organizations. The different organizations of pages go about as a boundary for performing robotized handling. Numerous business associations require information from the World Wide Web for doing insightful undertakings like business knowledge, item insight, serious knowledge, dynamic, assessment mining, notion investigation, and so on Numerous scientists face trouble in tracking down the most fitting diary for their exploration article distribution. Manual extraction is arduous which has directed the requirement for the computerized extraction measure. In this paper, approach called ADWDE is proposed. This drew closer is essentially founded on heuristic methods. The reason for this exploration is to plan an Automated Web Data Extraction System (AWDES) which can recognize the objective of information extraction with less measure of human intercession utilizing semantic marking and furthermore to perform extraction at a satisfactory degree of precision. In AWDES, there consistently exists a compromise between the degree of human intercession and precision. The objective of this examination is to diminish the degree of human intercession and simultaneously give exact extraction results independent of the business space to which the site page has a place.


Introduction
The growth of World Wide Web (WWW), which is a huge repository of information, is unimaginable. Www opens the doors to the huge amount of data for the world. The search engines help us to retrieve the needed information from such a huge repository. It provides us with a list of links to the websites which might contain the requested information. Nevertheless large data remains invisible to the user because huge amount of information on the web is available today only through search interfaces. E.g. if the user wants to find the information about the flights she has to go to that airways web site and fill the details in a search form. It will display the flights and the available seats as a result, which are dynamically generated pages containing the required information. This part of web which is reachable only through search interface is deep web. Deep web pages are generated through dynamic query on the web databases. As compared to surface web or static web, contains larger amount of data with higher quality as well.
In order to get the information from the deep web, customized search engines provided by the individual websites are used. Such search engines submit the user query to the backend database and retrieve the structured records matching the query. Task Difficulties, used techniques, Degree of Automation were considered.
TSIMMIS by Hammer et al. [2]: The target of extraction is specified manually using a specification file written using a declarative language. The command consists of three parts, namely, variables, source and pattern. The variables are used to store the result of extraction. The pattern specifies how to determine the content to be extracted and the source specifies the input text to be extracted. The output of extraction is represented using Object Exchange Model (OEM). It contains the actual data as well as structure of the data.
W4F by Sahuguet et al. [3] (World Wide Web Wrapper Factory): It is a Java Tool Kit used to generate extraction rules. It consists of three steps, namely, retrieval, extraction and mapping layers. The first layer is the retrieval layer which involves cleaning the input HTML document and converting it to the DOM tree. The second layer, namely, the extraction layer involves applying the extraction rules to the DOM tree in order to extract the target data. The output target data is represented as Nested String List structure.

Supervised Extraction Techniques
It requires less human intercession than the past procedure. This method doesn't include manual formation of covering. All things considered, a bunch of marked pages is given as contribution to the framework which creates the covering consequently.  [8], EXALG by Arasu et al. [9], FivaTech by Kayed et al. [10] are page-level extraction techniques whereas certain others like DeLa by Wang et al. [11] and DEPTA by Zhai et al. [12] are record-level extraction techniques.
DeLa (Data Extraction and Label Assignment) by Wang et al. [13]: It consists of four major modules, namely, form crawler, wrapper generator, aligner and label assigner. The form crawler is used to submit queries and to obtain the response pages pertaining to the query. The wrapper generator determines the template in which the data records are embedded. The aligner is responsible for grouping the attribute values so that all the attribute values in a group belong to the same concept. The label assigner assigns meaningful labels to individual data units. Hierarchical Web Crawler is used to perform form submission and to get the Search response pages. DeLa uses wrapper generator derived from IEPAD [14] system. Data-rich Section Extraction (DSE) [15] technique has been employed to strip off non-informative sections of the web page. DSE [16] has been identified by comparing the HTML source code of multiple web pages generated using same server-side templates. In order to identify C-repeated pattern (Continuous Repeated Pattern), suffix-tree has been constructed. Pattern Tree is constructed by starting from an empty root node. Whenever a pattern is determined then it is added as a child to the root node in the Pattern Tree. Longest occurring pattern in the Pattern Tree (PT) represents the general template used for generating the multiple input pages. The shortcoming of this technique is that it won't work with disjunctions in the structure of the web page.
DeLa [17] not only handles extraction but also annotation. The next step post extraction is aligning the attribute values. It computes similarity between data units using features such as tag path, presentation styles, etc., and groups the attribute values into disjoint sets. Annotation phase uses four different heuristics such as form keywords, common prefix/suffix, values matching conventional formats and keywords in the query/response page to determine appropriate label to each attribute value group.
RoadRunner by Crescenzi et al. [18]: This tool attempts to perform unsupervised extraction by comparing two documents generated using the same template. It finds the similarities and differences between the documents. It generates a union-free regular expression which matches both the input documents. It assumes one of the input pages as the wrapper and compares it with the other page. Whenever a mismatch is encountered, the wrapper is generalized. The mismatches are of two types, namely, tag mismatches and string mismatches. Tag mismatches are resolved by including optional or iterator in the wrapper. Whenever a tag mismatch is encountered, it finds the initial tag and the final tag of the mismatched block and denotes it by a square. Then it matches this block to its upper portion of the source code to find whether it represents a repeated pattern. If it represents a repeated pattern, then it is enclosed inside (…) + to represent repetition. If no such pattern can be found then, it tries to search the wrapper to find whether the block can be skipped. In that case, it represents an optional block which is denoted by enclosing the pattern shows result of applying Road Runner technique [19] for two sample input pages. It shows how repetition of block is determined and how the wrapper is generalized. Whenever string mismatch is encountered, it represents data units and therefore it is replaced by #PCDATA which is used to denote any sequence of characters.

Limitations:
i. Time complexity is exponential in the number of tokens of the input documents. In order to reduce the complexity the authors have proposed certain strategies such as reducing the number of alternatives to be considered, the number of backtracking to be performed to identify iterations, and removal of some regular sub-expressions. i. Biases introduced have negative impact on the effectiveness of the technique. ii. It requires the input documents to be wellformed.
iii. It does not perform automatic annotation. OMINI [20] also performs unsupervised extraction similar to Road Runner [21]. It consists of three phases, namely, preparing the input web page, identifying target of extraction, and extracting target the data records. The input web page is pre-processed. Pre-processing involves cleaning the web page in order to ensure that it is well-formed and then transforming it to tag tree representation.
Identifying target of extraction requires two steps, namely, object rich sub-tree discovery and object separator extraction. The portion of sub-tree containing all data objects of interest is identified by considering features such as number of child nodes of a given node (fan out), the number of bytes of data corresponding to each node and the count of sub-trees. Object separator extraction involves finding the tags used as separators of data records. Five different heuristics are applied for this purpose.
Authors have made assumptions such as most of the tokens should be associated with type constructor, tokens belonging to the template must have unique roles, and the data must be enclosed between tags. Most of the real world websites do not satisfy all these assumptions which affect the effectiveness of the technique.
MDR by Liu et al. (Mining Data Records) [21]: MDR uses two observations: i. a group of data records representing similar objects are placed in contiguous section in the web page and formatted using similar HTML elements, ii. a set of similar records appear as child sub-trees of a common parent node in the tag tree representation of the HTML page. In DEPTA [22], tag tree is constructed based on nested rectangles since each HTML element is rendered using a rectangular box by the browser. The co-ordinates of all the boxes are determined and the co-ordinates are compared in order to know the containment relationship among boxes and then the tag tree is constructed accordingly. Figure 1 shows the construction of tag tree from the visual clues. In order to find the data records region, a measure called string edit distance as in OLERA by Chang et al. [23] is used for the comparison of tag strings of individual nodes and combination of multiple adjacent nodes. FiVaTech by Kayed et al. [24] (Page-level web data extraction from template pages): It is an unsupervised approach for carrying out page-level extraction task. It consists of two phases: the first phase involves transforming the input documents into collection of DOM trees. The DOM trees are converted into pattern trees which are used to determine the server-side templates involved in generating the input documents; and the second phase involves producing schema from the pattern tree and extracting the embedded data records. But limitations are 1) FiVaTech requires well-formed input documents and error during parsing affects the effectiveness of the approach. 2) It uses pattern matching algorithm which is computationally expensive. 3) It uses a threshold for identifying peer nodes which impacts accuracy if proper threshold is not used. Almost all methods discussed above methods can extract web data in the mentioned conditions.

3.
PROPOSED ARCHITECTURE From Literature Review it is clear that most of the existing systems have made assumptions such as the availability of multiple input pages; occurrence of missing or optional attributes is rare and so on. Also, these techniques are based on string pattern matching or DOM tree matching and therefore, change in structure of the web page requires occurrence of repeated patterns in the wrappers generated. This problem is addressed in the proposed technique by considering semantic information as the basis of extraction rather than structure or template of the web page. The steps proposed AWDES is as shown in the Figure 1: It involves four steps, namely, Search, Locate, Filter and Extract. The search step involves finding the target web pages given the URL of the website as input. The locate step refers to determining data rich regions in the target web pages and filter step involves the process of extracting the data rich nodes after eliminating non-informative sections such as navigation links, pop-up menus, advertisements, etc. The last step of web data extraction involves extracting the attributevalue pairs from the target web pages, which represents the description of the object.
The proposed approach is based on the observation that the journal home pages linked to publishers' website are well formatted. The uniformity in presentation across several web pages of the website is achieved since they are generated using the same server-side template. If the location of target data is identified for a single web page (XPATH for attribute name-value pairs) from the scientific publisher's website, then the same XPATH can be used for extraction from similarly formatted journal specification pages. Overall architecture is shown in Figure  2: Algorithm for Automatic Navigation Given the URL of the logical distributer's site urlp, explore to the site and concentrate the arrangement of URLs comparing to the diary home pages orchestrated in sequential request (i.e.) A-Z. In the wake of exploring to diary's landing page, it's anything but a bunch of diary particular addressing data like title, ISSN, URL of the diary's landing page, and so on URLs for every one of the diaries are requested by A-Z. The arrangement of target URLs is meant by H = {hi, 1 ≤ I ≤ n}. Every diary landing page hello is parsed to gather the required properties like title, ISSN, Impact

Heuristic Technique for Data Extraction
The point of this procedure is to recognize the area of the objective ascribes, separate the quality qualities, explain and store the removed information records containing property estimations into a social data set. The procedure depends on the heuristics that the objective data to be separated is available as apparent substance which is shown in the substance space of the program. In the DOM tree addressing the site page, all the content hubs will be available as leaf hubs. The method includes visiting every URL and getting the DOM tree of the page containing the diary determination. The calculation getLeafNodes is utilized to acquire the leaf hubs and furthermore, to check whether it's anything but a content hub. On the off chance that they match with the area catchphrases like SJR, SNIP, Impact Factor, and so on, then, at that point its XPath is resolved utilizing the system getFullXPath. The calculation convert_XPath_to_Selector is utilized for changing the XPath over to a selector string. A similar selector string can be utilized to do extraction from correspondingly organized website pages. Following are the algorithm used in proposed approach:

Result & Analysis
The experiment was conducted on a machine having Intel Core 2 Duo T5870 @ 2.0 GHz with 3 GB of RAM. This machine was running with Windows 7 OS. Tests were performed using WAMP server equipped with php v. 5 Figure 6 and 7 shows the performance measures of proposed deep web crawler "DWC" at different quantity of query words received by QIIIEP server that are used to fetch the content from the deep web sites. From the analysis of this graph we can easily visualize that the number of query words received for specific domain, the precision, recall and f-measure of information extraction have continuously increased. So it can be concluded that when enough knowledge base is collected at QIIIEP server, the result is obviously improved.

Conclusion
The presence of high quality information in Deep Web Pages which are dynamically generated by embedding structured data records in server-side templates has made many researchers design techniques for automatic extraction of such data. The limitation of the existing systems includes their dependency on the structure of HTML pages for inducing wrappers. Our proposed research addressed this challenge.
This system enables integration of journal information from multiple publishers' websites and it enables the user to get the complete information about journals in a single interaction with the system. The system can be used to extract data records from websites where the attribute label is present explicitly. It is not always true since some websites like social discussion forums are not having attribute labels mentioned explicitly.