Tutorial 17

Accessing the Web with the Speech Toolkit

Although this application uses some tcl code, you don’t need to know how to program in Tcl. You can easily complete the Tcl portion of this tutorial by copying and pasting a few lines of text provided here.

 

Many popular RAD applications involve retrieving information from the World Wide Web. This tutorial teaches how to build applications that interact with web pages. Using speech, the user will be able to retrieve information from an on-line dictionary. Your application will parse the information and presents it to the user.

The World Wide Web is a dynamic entity. Web sites go up and come down every second without notification. When building a spoken language interface to the Web, some care should go into the selection of the site. First, try to select a site whose format and existence you know to be stable. Second, try to select a site where the interesting information is relatively easy to extract from the morass of HTML (Hyper Text Markup Language) tags and surrounding text.

 

 

A little background

The following web site returns the definition of the word you type in the form field.

Using your web browser, take a moment to visit the site.

http://www.aitech.ac.jp/~iteslj/s/ib/wedt.html

Enter the word "England" and hit the "Search" button.

 

Form field


 

 

 

 

 

 

 

 

 

After hitting Search, we see that the site produced a standard dictionary search for the word "England," complete with part-of-speech, definition etc.

 

 

On your web browser, select View –> Source to display the full HTML source code. Every web page is made from source code that looks similar to this. Your web browser interprets this source code to provide the web pages you are used to seeing. The source code includes the text, image names, hyperlinks, and formatting information you browser needs to present the document. Our task is to make a RAD application that interacts with this web page as you just did, separate the relevant information from the HTML source and present it to the user. The Tcl language includes various commands that help interact with web documents. The HTTP package provides all HTTP 1.0 client side capabilities.

How it works

All web browser have a window that displays the address of the current page. When you submit a form on a web page, as we did when we submitted our search for "England," the information you provide is transmitted to the Web server via this same field. For example, we visited the web page below and typed in "England"

 

The resulting web page address field contained the following text.

Notice that this field contains a standard web address:

http://machaut.uchicago.edu/cgi-bin/wedt_terse.sh

Followed by the text string:

?word=England&searchtype=default&constraint=1

The "?" means we are submitting a query

The "=" sign connects the identifier with it’s value (i.e. the word is England)

The "&" symbol is the symbol that connects identifier/value pairs into one string

All the information needed to perform the word search is contained in this text string. Our word is "England" the Searchtype is "default" and the constraint is 1. If you look closely at the original page, you can see the other choices available for your word search. These options relate to the Searctype and contraint fields. Now we know what kind of query the web site expects. Our RAD application will simply mimic the query string above to retrieve the information we need.

There is another way to anticipate what items we need to include in the query string. Using your web browser, go back to the first page web we visited, http://www.aitech.ac.jp/~iteslj/s/ib/wedt.html , select View -> Source. The following text should be displayed.

This is a standard CGI script that allows a web page to present a form and submit information back to the web server. Relax, we don’t need to learn CGI to get the information we need.

The line that reads <FORM ACTION="http://macaut.uchicago.edu………..

Indicates the web address where all queries are sent. We will send our query to this address followed by our query sting.

The line that reads <INPUT TYPE="text" NAME="word" ……..

Indicates important fields in the query sting. In this case we see that after hitting the SEARCH button a field named "word" must be supplied.

The line that reads <INPUT TYPE="radio" NAME="Searchtype"………

Indicates the field Searchtype will need to be included in the query sting.

Finally the line that reads <SELECT NAME="constraint"…………….

Indicates the field Constraint must be included as well.

That’s it!

By combining the web address indicated by <FORM ACTION…. and the query string above, we can retrieve information from a web site without using a web browser. Kind of like cheating. Enough background, lets get started.

 

 

Instructions

Drag and arrange states onto the canvas so that you have the following setup:

This lesson will describe a country based on the description that the on-line dictionary provides, and asks the user to identify it.

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Using the listbuilder, make a list of countries you are interested in. Re-name the list "word" Select "Random" at the bottom left corner of the widow.

 

 

 

 

 

In the action object named "get_www_info" insert the following tcl code. We’ll review the code line by line so you should just copy and paste the text to save time and reduce possible errors.

 

#Defines the information passed between your application and the web site

package require http 2.0

set queryString [::http::formatQuery word $word searchtype default constraint 1]

set url [list http://machaut.uchicago.edu/cgi-bin/wedt_terse.sh?$queryString]

# retrieves the web address

set filename temp.html

geturl $url $filename

set fhandle1 [open $filename r]

set html [read $fhandle1]

close $fhandle1

set clean [htmlClean $html]

# gets the text from the word 'DEF' to the end of the document

set text [string range $clean [string last DEF: $clean] end]

# gets the text from the word 'DEF' to the end of the line

set text [string range $text [string last DEF: $text] [string first \n $text]]

# gets the actual text of the definition

set definition [string range $text [string wordend $text [string last nbsp $text]] [string first \n $text]]

 

 

 

In the object named "question" enter the following prompt.

Name the country, it is $definition

In the left recognition port enter, $word

In the right recognition port enter, *any

 

 

In the object named "try-again" enter the following prompt.

Nope, try again

 

 

 

In the Object named "that’s_right" enter the following prompt

That’s right

 

 

 

Make sure that you are on-line, then Build and Run your program


Let’s review the Tcl code line by line

 

package require http 2.0

Loads the functions provided by Tcl that we’ll use for viewing web content. (See your favorite Tcl manual for other available packages)

set queryString [::http::formatQuery word $word searchtype default constraint 1]

The built in command "::http::formatQuery" takes as arguments a list of identifiers and values pairs and returns a properly formatted http query string (including the &,= we saw above)

This line saves the string in the variable called "queryString"

set url [list http://machaut.uchicago.edu/cgi-in/wedt_terse.sh?$queryString]

This sets the variable "url" to the web address we determined all queries need to be sent, followed by the "?" symbol to indicate a query, followed by the query string. Notice the "?" and $queryString at the end of this line.

# retrieves the web address

This is just a comment. All comments begin with the # sign

set filename temp.html

A file name is needed for the geturl procedure below

geturl $url $filename

The procedure "geturl" takes as arguments, a web page and a file name. It retrieves the web page and stores the HTML source code in the file named $filename.

set fhandle1 [open $filename r]

Opens the file we just created for reading and assigns it a "handle" or "access point" so we can read the contents. (This is how files are read in Tcl)

set html [read $fhandle1]

Reads the file refereed to by the file handle and sets the variable named html to the contents of the file. In essence, this line sets the variable named html to the web page source code.

close $fhandle1

Closes the file.

set clean [htmlClean $html]

The Toolkit includes the procedure htmlClean that removes most of the formatting HTML code we see in the web page source but saves all of the text. The result is saved in the variable called "clean," named so because all the excess HTML is stripped out. Try this: Insert the line "puts $clean" after this line and run your application again. Look at the difference between HTML source and cleaned HTML.

 

set text [string range $clean [string last DEF: $clean] end]

Notice the text now stored in the clean variable. It still contains lots of information about part of speech, some header information, and some trailing text. We want to extract just the text of the word definition and we need to find a way to extract it for every word we choose to look up. We can use certain words that always appear as markers and remove the text between the markers. In this case, the definition always appears on the line that starts with "DEF:" and finishes before the next <return>.

This line saves the string returned between the ranges where DEF starts and the end of the entire document.

set text [string range $text [string last DEF: $text] [string first \n $text]]

This line saves the text between the word DEF: and the end of the line by searching for the first occurrence of the <return> symbol ("\n" in ASCII text )

set definition [string range $text [string wordend $text [string last nbsp $text]] [string first \n $text]]

This line finally isolates the definition grabbing the text between the last "nbsp" characters and the first <return> symbol. The result is saved in the variable called "definition."

 

Remember where $definition was used in your RAD application? Check the object named "question" To become proficient at parsing information from web pages, you should familiarize yourself with the REGEXP and STRING commands in Tcl. In this exercise, we cleaned the HTML and then parsed the information we needed. However, you might find it useful in some cases to parse the raw HTML first.

 

Improving on this application

This is a simple application that shows the basics of retrieving web pages. Try making some improvements to this application on your own.

  1. Using the media object, make a pop up window that says. "Searching the web……wait please" and removes the window once the page is retrieved.
  2. Expand the application to include other types of words.
  3. Give the user a choice of what subject to study. For example, have the computer ask "would you like to take a history, geography or grammar lesson?" The application should have a list of words for each subject.
  4. Have a set of images that get displayed for each word in the search. Hint: Use the variable capability of the media object.