• The World Wide Web (WWW) is a repository of information linked together from points all over the world.

  • The WWW has a unique combination of flexibility, portability, and user-friendly features that distinguish it from other services provided by the Internet.

  • The WWW project was initiated by CERN (European Laboratory for Particle Physics) to create a system to handle distributed resources necessary for scientific research.

  • We first discuss issues related to the Web. We then discuss a protocol, HTIP, that is used to retrieve information from the Web.

  • The WWW today is a distributed client/server service, in which a client using a browser can access a service using a server. However, the service provided is distributed over many locations called sites.

  • Architecture of WWW
  • Each site holds one or more documents, referred to as Web pages. Each Web page can contain a link to other pages in the same site or at other sites. The pages can be retrieved and viewed by using browsers.

  • Let us go through the scenario shown in Figure below. The client needs to see some information that it knows belongs to site A.

  • It sends a request through its browser, a program that is designed to fetch Web documents. The request, among other information, includes the address of the site and the Web page, called the URL.

  • The server at site A finds the document and sends it to the client. When the user views the document, she finds some references to other documents, including a Web page at site B.

  • The reference has the URL for the new site. The user is also interested in seeing this document. The client sends another request to the new site, and the new page is retrieved.

  • A variety of vendors offer commercial browsers that interpret and display a Web document, and all use nearly the same architecture.

  • Each browser usually consists of three parts: a controller, client protocol, and interpreters.

  • The controller receives input from the keyboard or the mouse and uses the client programs to access the document. After the document has been accessed, the controller uses one of the interpreters to display the document on the screen.

  • The client protocol can be one of the protocols described previously such as FTP or HTTP.

  • The interpreter can be HTML, Java, or JavaScript, depending on the type of document.

  • Server : The Web page is stored at the server. Each time a client request arrives, the corresponding document is sent to the client.

  • To improve efficiency, servers normally store requested files in a cache in memory; memory is faster to access than disk.

  • A server can also become more efficient through multithreading or multiprocessing.

  • In this case, a server can answer more than one request at a time.

  • browser

Uniform Resource Locator

  • A client that wants to access a Web page needs the address. To facilitate the access of documents distributed throughout the world, HTTP uses locators.

  • The uniform resource locator (URL) is a standard for specifying any kind of information on the Internet. The URL defines four things: protocol, host computer, port, and path .

  • The protocol is the client/server program used to retrieve the document.

  • Many different protocols can retrieve a document; among them are FTP or HTTP. The most common today is HTTP.

  • The host is the computer on which the information is located, although the name of the computer can be an alias.

  • Web pages are usually stored in computers, and computers are given alias names that usually begin with the characters "www".

  • This is not mandatory, however, as the host can be any name given to the computer that hosts the Web page. The URL can optionally contain the port number of the server.

  • If the port is included, it is inserted between the host and the path, and it is separated from the host by a colon.

  • Path is the pathname of the file where the information is located. Note that the path can itself contain slashes that, in the UNIX operating system, separate the directories from the subdirectories and files.

  • The World Wide Web was originally designed as a stateless entity. A client sends a request; a server responds. Their relationship is over. The original design of WWW, retrieving publicly available documents, exactly fits this purpose.

  • Creation and Storage of Cookies

    1. When a server receives a request from a client, it stores information about the client in a file or a string.

    2. The information may include the domain name of the client, the contents of the cookie (information the server has gathered about the client such as name, registration number, and so on), a timestamp, and other information-depending on the implementation.

    3. The server includes the cookie in the response that it sends to the client.

    4. When the client receives the response, the browser stores the cookie in the cookie directory, which is sorted by the domain server name.

  • Using Cookies :

    1. When a client sends a request to a server, the browser looks in the cookie directory to see if it can find a cookie sent by that server.

    2. If found, the cookie is included in the request. When the server receives the request, it knows that this is an old client, not a new one.

    3. Note that the contents of the cookie are never read by the browser or disclosed to the user. It is a cookie made by the server and eaten by the server.

    4. Now let us see how a cookie is used for the four previously mentioned purposes:

    5. 1. The site that restricts access to registered clients only sends a cookie to the client when the client registers for the first time. For any repeated access, only those clients that send the appropriate cookie are allowed.

    6. 2. An electronic store (e-commerce) can use a cookie for its client shoppers. When a client selects an item and inserts it into a cart, a cookie that contains information about the item, such as its number and unit price, is sent to the browser. If the client selects a second item, the cookie is updated with the new selection information. And so on. When the client finishes shopping and wants to check out, the last cookie is retrieved and the total charge is calculated.

    7. 3. A Web portal uses the cookie in a similar way. When a user selects her favorite pages, a cookie is made and sent. If the site is accessed again, the cookie is sent to the server to show what the client is looking for.

    8. 4. A cookie is also used by advertising agencies. An advertising agency can place banner ads on some main website that is often visited by users. The advertising agency supplies only a URL that gives the banner address instead of the banner itself. When a user visits the main website and clicks on the icon of an advertised corporation, a request is sent to the advertising agency. The advertising agency sends the banner, a GIF file, for example, but it also includes a cookie with the will of the user.

    9. Any future use of the banners adds to the database that profiles the Web behavior of the user. The advertising agency has compiled the interests of the user and can sell this information to other parties. This use of cookies has made them very controversial. Hopefully, some new regulations will be devised to preserve the privacy of users.

  • The Hypertext Transfer Protocol (HTTP) is a protocol used mainly to access data on the World Wide Web.

  • HTTP functions as a combination of FTP and SMTP. It is similar to FTP because it transfers files and uses the services of TCP.

  • However, it is much simpler than FTP because it uses only one TCP connection.

  • There is no separate control connection; only data are transferred between the client and the server.

  • HTTP is like SMTP because the data transferred between the client and the server look like SMTP messages.

  • In addition, the format of the messages is controlled by MIME-like headers.

  • Unlike SMTP, the HTTP messages are not destined to be read by humans; they are read and interpreted by the HTTP server and HTTP client (browser).

  • SMTP messages are stored and forwarded, but HTTP messages are delivered immediately.

  • The commands from the client to the server are embedded in a request message.

  • The contents of the requested file or other information are embedded in a response message. HTTP uses the services of TCP on well-known port 80.

  • HTTP Transaction : Figure below illustrates the HTTP transaction between the client and server. Although HTTP uses the services of TCP, HTTP itself is a stateless protocol. The client initializes the transaction by sending a request message. The server replies by sending a response.

  • Messages : The formats of the request and response messages are similar; both are shown in Fig- ure below. A request message consists of a request line, a header, and sometimes a body. A response message consists of a status line, a header, and sometimes a body.

  • HTTP transaction
  • Request and Status Lines : The first line in a request message is called a request line; the first line in the response message is called the status line. There is one common field, as shown in Figure below.

  • Request and status lines

  • Request type : This field is used in the request message. In version 1.1 of HTTP, several request types are defined. The request type is categorized into methods as defined in Table below

  • Method Action
    GET Requests a document from the server
    HEAD Requests information about a document but not the document itself
    POST Sends some information from the client to the server
    PUT Sends a document from the server to the client
    TRACE Echoes the incoming request
    CONNECT Reserved
    OPTION Inquires about available options
  • URL : We discussed the URL earlier in the chapter.

  • Version : The most current version of HTTP is 1.1

  • Status code : This field is used in the response message.

  • The status code field is similar to those in the FTP and the SMTP protocols.

  • It consists of three digits. Whereas the codes in the 100 range are only informational, the codes in the 200 range indicate a successful request.

  • The codes in the 300 range redirect the client to another URL, and the codes in the 400 range indicate an error at the client site. Finally, the codes in the 500 range indicate an error at the server site. We list the most common codes in Table below

  • Status phrase : This field is used in the response message. It explains the status code in text form. Table below also gives the status phrase.

Code - Phrase Description Code - Phrase Description
100 - Continue The initial part of the request has been received, and the client may continue with its request. 101 - Switching The server is complying with a client request to switch protocols defined in the upgrade header.
200 - OK The request is successful. 201 - Created A new URL is created.
202 - Accepted The request is accepted, but it is not immediately acted upon. 204 - No content There is no content in the body.
301 Moved permanently The requested URL is no longer used by the server. 302 Moved temporarily The requested URL has moved temporarily.
304 Not modified The document has not been modified. 400 Bad request There is a syntax error in the request.
401 Unauthorized The request lacks proper authorization. 403 Forbidden Service is denied.
404 Not found The document is not found. 405 Method not allowed The method is not supported in this URL.
406 Not acceptable The format requested is not acceptable. 500 Internal server error There is an error, such as a crash, at the server site.
501 Not implemented The action requested cannot be performed. 503 Service unavailable The service is temporarily unavailable, but may be requested in the future.

  • Header : The header exchanges additional information between the client and the server. For example, the client can request that the document be sent in a special format, or the server can send extra information about the document. The header can consist of one or more header lines. Each header line has a header name, a colon, a space, and a header value (see Figure below). We will show some header lines in the examples at the end of this chapter. A header line belongs to one of four categories: general header, request header, response header, and entity header. A request message can contain only general, request, and entity headers. A response message, on the other hand, can contain only general, response, and entity headers.

  • Request and status lines
  • General header : The general header gives general information about the message and can be present in both a request and a response. Table below lists some general headers with their descriptions.

  • General headers Description
    Cache-control Specifies information about caching
    Connection Shows whether the connection should be closed or not
    Date Shows the current date
    MIME-version Shows the MIME version used
    Upgrade Specifies the preferred communication protocol
  • Request header : The request header can be present only in a request message. It specifies the client's configuration and the client's preferred document format. See Table below for a list of some request headers and their descriptions.

  • request-header
  • Response header : The response header can be present only in a response message. It specifies the server's configuration and special information about the request. See Table below for a list of some response headers with their descriptions.

  • response-header
  • Entity header : The entity header gives information about the body of the document. Although it is mostly present in response messages, some request messages, such as POST or PUT methods, that contain a body also use this type of header. See Table below for a list of some entity headers and their descriptions.

  • entity-header
  • Body : The body can be present in a request or response message. Usually, it contains the document to be sent or received.

  • Below example retrieves a document. We use the GET method to retrieve an image with the path /usr/bin/image/. The request line shows the method (GET), the URL, and the HTTP version (1.1). The header has two lines that show that the client can accept images in the GIF or JPEG format. The request does not have a body. The response message contains the status line and four lines of header. The header lines define the date, server, MIME version, and length of the document. The body of the document follows the header.

  • http-mechanisms
  • In the below example, the client wants to send data to the server. We use the POST method. The request line shows the method (POST), URL, and HTTP version (1.1). There are four lines of headers. The request body contains the input information. The response message contains the status line and four lines of headers. The created document, which is a CGI document, is included as the body

  • http-mechanisms

  • HTTP prior to version 1.1 specified a nonpersistent connection, while a persistent connection is the default in version 1.1.

  • Nonpersistent Connection

  • In a nonpersistent connection, one TCP connection is made for each request/response.

  • The following lists the steps in this strategy:

    1. 1. The client opens a TCP connection and sends a request.

    2. 2. The server sends the response and closes the connection.

    3. 3. The client reads the data until it encounters an end-of-file marker; it then closes the connection.

  • In this strategy, for N different pictures in different files, the connection must be opened and closed N times. The nonpersistent strategy imposes high overhead on the server because the server needs N different buffers and requires a slow start procedure each time a connection is opened.

  • Persistent Connection

  • HTTP version 1.1 specifies a persistent connection by default. In a persistent connection, the server leaves the connection open for more requests after sending a response. The server can close the connection at the request of a client or if a time-out has been reached.

  • The sender usually sends the length of the data with each response. However, there are some occasions when the sender does not know the length of the data.

  • This is the case when a document is created dynamically or actively. In these cases, the server informs the client that the length is not known and closes the connection after sending the data so the client knows that the end of the data has been reached.

  • Proxy Server

  • HTTP version 1.1 specifies a persistent connection by default.

  • HTTP supports proxy servers. A proxy server is a computer that keeps copies of responses to recent requests.

  • The HTTP client sends a request to the proxy server. The proxy server checks its cache. If the response is not stored in the cache, the proxy server sends the request to the corresponding server.

  • Incoming responses are sent to the proxy server and stored for future requests from other clients.

  • The proxy server reduces the load on the original server, decreases traffic, and improves latency.

  • However, to use the proxy server, the client must be configured to access the proxy instead of the target server.