Michel Banâtre,
Valérie Issarny,
Frédéric Leleu,
Boris Charpiot
IRISA/
INRIA, Campus de Beaulieu, 35042 Rennes Cédex, France
{banatre, issarny, leleu, charpiot}@irisa.fr
The long term success of the World Wide Web depends on the provided quality of service. Users need to be confident about the quality of the information that they are accessing and to have the selected information displayed within a short interval. This paper gives an overview of the ETEL newspaper-based distributed information system, subset of the World Wide Web, that meets the above requirement. The ETEL system guarantees relevance of the information through moderation by a newspaper editor. It further ensures responsiveness through the provision of dedicated system policies, implementing in particular predictive prefetching.
The long-term success of the World Wide Web (Web for short) depends on the the provided Quality of Service which may be evaluated according to two types of criteria:
In the framework of the ETEL project that is conducted at IRISA/INRIA in close collaboration with a major French newspaper editor, we are currently examining solutions towards ensuring both content and access quality over the Web. Our approach relies on the provision of a newspaper-based distributed information system that is a subset of the Web.
Press editor has a long experience in providing quality information in terms of both its relevance and its presentation. In particular, it is recognized that quality newspapers give a serious account of the news and reports. Therefore, content quality in the ETEL system is dealt with through the production of information according to newspapers standards. As regards to access quality, our approach relies on the design and implementation of dedicated system policies that together guarantee the system's responsiveness.
This paper is structured as follows. The next section gives an overview of the ETEL system, introducing its architecture. Then, we detail our approach towards ensuring content and access quality over the Web: we address production of the information according to newspapers standards, and we introduce system policies aimed at ensuring responsiveness. Finally, we conclude by summarizing our contribution and discussing our current and future work.
The entry point of the ETEL distributed information system is an electronic newspaper service from which the user has access to the usual content of a newspaper as well as to a set of plugged-in interactive multimedia services that are either ETEL- or Internet-provided. As common in press editing, the content of the information system, including accessible plugged-in services, is moderated by professional editors.
The ETEL architecture is depicted in figure 1. It is composed of a set of end-users terminals which can be PCs as well as Network Computers connected to the ETEL system via Internet. The ETEL system itself runs on a set of standard workstations interconnected by an ATM network, and gives access to ETEL-selected Web servers.
Figure 1: The ETEL Architecture
In order to provide a distributed information system ensuring content quality, the ETEL electronic newspaper service has been designed so as to exhibit the following features:
From the standpoint of edition production, our solution relies on the subdivision of data structures into two groups: (i) the logical data structure embedding the core information that is common to both electronic and paper editions, and (ii) the physical data structure that is specific to each type of editions and that describes how core information is laid out. Thus, with the physical data structure associated to the electronic edition, we are able to build automatically an electronic edition from the edition's logical data that are also used for the construction of the paper version. Electronic editions are built using the PDF format. More precisely, the construction of the electronic editions relies on the use of two software tools:
Figure 2: A Summary page
The electronic edition enriches the paper version through the provision of edition customization with respect to the user's profile (e.g., see [2]). An electronic edition is organized as a set of pages (i.e, PDF files) where each page embeds either a summary or an article. A summary page corresponds to a page of the paper version except that its content is customized with respect to the user's profile and that every article is summarized into the article's title and photograph (if any), and a hyperlink to the corresponding article. Every summary page further integrates customized hyperlinks to thematic headings according to the user's profile (see figure 2).
In addition to the presentation of articles, the electronic newspaper service gives access to a set of interactive multimedia services. As mentioned previously, these include ETEL-provided services derived from the general information embedded in usual newspapers (e.g., theatre and TV programs, weather forecast). The system further offers encyclopedia-like services that exploit the information base of the newspaper editor. In particular, such services allow end-users to query for customized thematic newspaper-like documents.
Another type of offered services results from the integration of a gateway to the Web. However, in order to guarantee the quality of the provided information, ETEL provides only access to a set of Web servers that are selected by the newspaper editor.
To summarize, the ETEL system gives access to a structured hyperdocument made of a collection of files which are provided either by the newspaper service or plugged-in services. The next section discusses our approach towards ensuring quality access to the ETEL hyperdocument.
In order to guarantee access quality to the ETEL users, we are designing dedicated system policies that are based on the users' profiles in terms of the accessed files. More precisely, we are focusing on the design of the two following policies:
Up to now, we have been concentrating on the design and evaluation of the profile-based predictive prefetching policy. Concerning the profile-based load balancing policy, we are currently examining the use of different data analysis methods [1, 11] for identifying user groups. In the remainder, after providing an overview of predictive prefetching, we detail the profile-based predictive prefetching policy together with its evaluation through simulation experiments.
A predictive prefetching policy aimed at an information system relies on the definition of the following components:
The design of a predictive prefetching policy in the framework of the Web has been previously addressed in [7, 4, 13]. The proposal that is the closest to ours is the one detailed in [13] which undertakes prefetching decisions based on individual access patterns whereas other existing solutions do not differentiate accesses originating from different clients (or users) when recording performed accesses.
The proposal of [13] relies on the use of the extended HTTP protocol, described in [12]. The resulting protocol uses a single long-lived TCP connection for multiple HTTP transactions. The proposed prefetching policy manages a set of frequency graphs on the Web server where each graph reflects the access pattern to that server for a given client during one session (i.e., long-lived TCP connection). Concentrating on one graph, each node corresponds to a previously accessed file and there is an arc from node A to node B when B is accessed from A, the arc being weighed with the corresponding access frequency. Considering a user that is accessing a given file of the Web server, the server is able to identify the files that are likely to be accessed next by means of the frequency graph associated to the user's session. The server then notifies this set of files to the user that decides to prefetch the files based on local environmental factors such as the content of the local cache or the current system load. Results of the evaluation of the proposed prefetching policy through simulation show that the access time to Web files can be significantly reduced. However, this is at the expense of an increase in the amount of data transferred from the server to the clients, and hence in the network traffic. More precisely, the resulting traffic increase is conversely proportional to the prefetching accuracy.
In the above policy, frequency graphs are not permanent data structures; the lifetime of a frequency graph equals the one of a session. In the framework of a newspaper-based information system, it is our belief that each user will exhibit similar access patterns over distinct connections to the system due to the ETEL hyperdocument's structure that is invariant. This leads us to manage a permanent frequency graph, hence getting a frequency graph whose accuracy improves over time. Furthermore, since the frequency graph is user-specific, predictive prefetching can be fully managed on the user site, which discharges the server from this task. In addition, this allows to guarantee privacy of the information regarding the user's profile. The next paragraph details the key components of profile-based predictive prefetching that builds upon the above points. Notice that the invariant structure which is assumed for the hyperdocument that is made available to the ETEL users may not hold for documents that are accessible through plugged-in Web servers. In this case, profile-based predictive prefetching should be complemented with the implementation of a predictive prefetching policy, similar to the one proposed in [13], within Web servers that export unstable documents.
For keeping track of the access pattern of a user, profile-based predictive prefetching relies on a frequency graph that is close to the one proposed in [13]. Furthermore, as previously shown in [8], prediction accuracy can be improved if the frequency of accessing a file is given with respect to the preceding sequence of accesses and not just the currently accessed file. We therefore base prediction decisions on a patterned frequency graph that contains a path for each sequence of accesses. However, in order to minimize the size of the graph, each path that corresponds to the re-access to a 2-nodes sub-sequence is folded leading to introduce the corresponding cycle. The fact that a re-access is identified at the granularity of 2 nodes is due to both the structure of the ETEL hyperdocument and the goal of minimizing the graph size. However, a similar solution can be undertaken for longer sub-sequences if required by the hyperdocument structure. For illustration, a patterned frequency graph is depicted in figure 3 where weighs are omitted. In the figure, paths made of shaded nodes correspond to re-accesses and hence are folded, leading to introduce the bold arcs.
Figure 3: A patterned frequency graph
Given a patterned frequency graph, the access prediction for a node is computed according to the weighs (i.e., access frequencies) of the arcs leaving that node: the greater is the access frequency, the more the pointed node is likely to be accessed. More precisely, a prediction consists of a sequence of file identifiers that are ordered according to the decreasing order of the corresponding access frequencies. The shortcoming of using access frequencies for predicting future accesses is that it may take time to capture changes in the user's access pattern. A change in the access pattern is taken into account in the access prediction only when the order of the access frequencies becomes modified. In order to detect earlier access pattern changes in the prediction process, we use a method based on statistical quality-control schemes [14]. Briefly stated, statistical quality-control schemes are aimed at both controlling that a fabrication process is stable with respect to the required quality specification, and reducing the number of products that do not conform with the product's specification. In our framework, statistical control charts allow to detect that an access pattern deviates from the one reflected by the patterned frequency graph by associating an EWMA control chart to each arc [6, 10]. A change in the access pattern is then taken into account through the following prediction algorithm [9]. If there is no deviation detected using the EWMA charts, the access prediction for a file A is computed according to the weighs of the arcs leaving A within the patterned frequency graph. On the other hand, if at least one EWMA chart indicates a deviation, the files that are accessible from A are ordered according to the decreasing order of detected deviations.
The prefetching algorithm consists of anticipating the user's access requests according to the computation of predictions. Given the prediction computed for the file that is currently accessed, the prediction is read sequentially from the first element up to at most the size of the user's cache in terms of file entries. On each iteration step, the file that is given by the prediction's element, is looked for within the user cache. If the file is absent then a request for the corresponding file is issued to the ETEL system while no action is taken otherwise. Notice that the increase in the network traffic resulting from predictive prefetching is proportional to the number of elements of a prediction that are prefetched. In particular, if the computation of predictions is known to be highly accurate then the best solution from the standpoint of the resulting network traffic is to prefetch only the first element of the prediction. Given the use of EWMA charts, we are able to detect changes in the access pattern and hence that the accuracy of predictions is lowering. Therefore, our approach consists of prefetching only the first element of the prediction while the user's access pattern remains stable. On the other hand, a larger subset of the prediction is prefetched when the user's access pattern is changing, the size of the subset depending on the detected deviations.
The predictive prefetching policy interacts with the event manager (e.g., Netscape with an ETEL plug-in) running on the user site and that handles the user's access requests. The following concurrent actions are undertaken upon each request:
The actual benefit of using predictive prefetching depends on a number of parameters, including the accuracy of the prediction algorithm, the user's cache size together with the network bandwidth, the number of files that are accessible from one file, and the average time the user takes for reading a file. In the following, we propose an evaluation of profile-based predictive prefetching through simulation using the QNAP (Queuing Network Analysis Package) tool. We model the system through the following modules:
Furthermore, based on the features of the ETEL hyperdocument, the average size of a file is equal to 80 Kbytes.
Figure 4: Hyperdocument graph
For simulation experiments, we have considered the hyperdocument graph depicted in figure 4. We have generated 4000 sessions (i.e. connections to the ETEL system), 8 accesses being performed during each session. Furthermore, access patterns over distinct sessions are repeated a number of times varying randomly from 1 to 30, and then randomly changed. Concerning the computation of predictions, simulations have been run using the two following types of frequency graph: the patterned frequency graph and the hyperdocument graph with arcs weighed with the corresponding access frequencies. Notice that the latter graph, qualified as weighed hyperdocument graph in the remainder, corresponds to the type of frequency graph used in the predictive prefetching policy proposed for the Web in [13]. Finally, predictions based on the patterned frequency graph have been computed using either the graph only or the graph in conjunction with EWMA control charts.
Figure 5: Average response time
Figure 5 gives the average response time per file according to the network bandwidth where it is assumed that only 10% of the network bandwidth is available for the ETEL user. Results show that a requested file can be made available to the user in less than 1 second, hence providing a highly responsive service.
Figure 6: Prediction accuracy
Figure 6 evaluates prediction accuracy by giving the rank of the actually accessed file in the prediction over the total number of predictions (i.e., accesses). When the patterned frequency graph is used together with EWMA control charts, most of the actually accessed files appear first in the prediction. As a result, increase in the network traffic due to predictive prefetching may be negligible. Furthermore, the fact that some of the computed predictions are less accurate is due to the time taken for learning the user's access pattern within the prediction algorithm. It follows that extra-accesses occur only during the user's very first connections and when the user's access pattern is changing.
The proposed results show that profile-base predictive prefetching allows to significantly reduce the time taken for making a requested file available to the user. In addition, it is worth mentioning that the increase in the network traffic is negligible although we have run our simulation experiments on a frequently changing access pattern.
Much work has been undertaken in the Web community to ensure responsiveness (e.g., see [5, 3, 7, 4]). However, to our knowledge, the design of a policy based on individual access behaviors has only been previously examined in [13]. Compared to our work, this proposal considers the individual behavior that is exhibited during the lifetime of a session while we take into account behaviors exhibited over the distinct connections to the ETEL system. From our point of view, our approach allows to approximate more closely the resources that are actually needed to serve user requests since the accuracy of the information relating to users' behaviors is highly proportional to the number of observations. Furthermore, the policy introduced in [13] is mostly implemented on the server site. On the other hand, the prefetching policy that we propose is fully implemented on the user site, hence discharging the server from this task.
The design of the ETEL system has begun in October 1995. A prototype demonstrating the automatic construction of electronic editions, and the presentation of the editions on the user's display is now operational. We are currently integrating the profile-based predictive prefetching policy that has been discussed in this paper. We will then focus on the implementation of a profile-based load balancing strategy that complements the predictive prefetching policy from the standpoint of the scalability issue.