UB Information Technology


UB Google - Central Campus Search Engine

The Google Search Appliance replaces UltraSeek as UB's Search Engine. This page will assist Web Developers to utilize the Google Search Appliance service.

FAQ

How To



FAQ

What is crawling?

Crawling, or web crawling is the method search engines use to create an index of the web. Pages are downloaded by an automated service, called a robot or a spider, and processed by the search engine to allow fast centralized searching. When a search engine downloads a page it generally looks at the hyperlinks on that page and processes them based on a set of policies.

What is caching?

Some search engines provide an additional feature called caching. When a search engine caches a site it copies some or all of that site and stores it locally. Caching allows users surfing the web to view the content of a site directly through the search engine. This can be helpful if a resource on your website becomes unavailable because of maintence or other outage.



How To

Prevent search engine from crawling your site

There are two methods to prevent UB's Google Search Appliance from crawling your site. The first method involves creating a robots.txt file using the Robots Exclusion Protocol. The other method involves including META tag information into each page.

Method 1 [ robots.txt ]:
A robots.txt file is a plain text file which is read by Search Engine spiders and allows a webmaster to allow or disallow crawling of a site. Simply create the robots.txt file and place it in the root directory of your site to control crawling for the entire site. The access is denied by using the spider name, or User-agent. The User-agent of UB's Google Search Appliance is ubgsa.

Here are three examples of robots.txt files that may be helpful for you. You can also get more information, including how to disallow certain subdirectories of a site, by visiting the Web Robots FAQ.

1. To prevent the UB search engine from crawling your site, while allowing all other search engines in the world to, include the following information in the robots.txt file:

User-agent: ubgsa
Disallow: /

2. To allow only the UB search engine to crawl your site, include the following information in the robots.txt file:

User-agent: ubgsa
Disallow:

User-agent: *
Disallow: /

3. To prevent all search engines from crawling your site, include the following information in the robots.txt file:

User-agent: *
Disallow: /

Method 2 [ META Tags ]:
To prevent all Search Engines from crawling an individual Web page, you can use the following Robots META tags between the <head> and </head> tags of an HTML page. More information can be found here.

<meta name="robots" content="noindex, nofollow">

Prevent a search engine from caching your page

To prevent all search engines from showing a "Cached" link for your page, place this tag in the section of your page::

<meta name="robots" content="noarchive">

To allow other search engines to show a "Cached" link, preventing only the UB Search Engine and google.com from displaying one, use the following tag:

<meta name="googlebot" content="noarchive">

Note: This tag only removes the "Cached" link for the page. The UB Search Engine and google.com will continue to index the page and display a snippet.


Add a search box to your webpage to search all of UB

To search the UB Google Search Appliance from your site, you can add a search box to your webpages . This is done by adding an HTML form to your Web page. You also may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only). To add a search field like the one below to your site, add the following to your HTML code:

<form method="GET" action="http://search.buffalo.edu/search">
 <input type="text" name="q" size="32" maxlength="2048" value="">
 <input type="submit" name="btnG" value="Search UB">
 <input type="hidden" name="client" value="UB">
 <input type="hidden" name="proxystylesheet" value="UB">
 <input type="hidden" name="output" value="xml_no_dtd">
 <input type="hidden" name="site" value="UB">
</form>

Restrict searches to a specific website or domain

To restrict the search box to only returns results from an entire Web domain, a single Web site, or a subset of a Web site, add the as_sitesearch parameter in your search form:

<input type="hidden" name="as_sitesearch" value="your_url_here">

You can also restrict the search to a specific directory under a domain:

<input type="hidden" name="as_sitesearch" value="your_url_here/directory">

Note: If a trailing slash '/' is used at the end of the URL value, then the search will be restricted to only that specific folder. In the example above, which does not use a trailing slash, results will be returned for the directory folder and all subfolders under it.

Important: Sometimes there are multiple URLs you can use to view a site at UB, such as hostname.buffalo.edu, www.hostname.buffalo.edu, or wings.buffalo.edu/directory_path. The UB search engine has been configured to not crawl every alternate URL and generally will only crawl each a using one address. Therefore, make sure that the URL / directory you specify in the "as_sitesearch" parameter is the one listed in the UB Search Engine first!

To specify the search be done across multiple Web sites, use the "as_q" parameter instead of "as_sitesearch", and make sure to use "site:" and one or more "OR" operators in the value:

<input type="hidden" name="as_q" value="site:your_first_url OR site:your_second_url">

It would also be helpful to show that this search is a site restricted search in the search button. To do that, modify the code for the btnG parameter:

<input type="submit" name="btnG" value="Search your_sitename_here">

For an example, I will show the code used to create a custom search box to search only the ubit.buffalo.edu website:

<form method="GET" action="http://search.buffalo.edu/search">
 <input type="text" name="q" size="32" maxlength="2048" value="">
 <input type="submit" name="btnG" value="Search UBIT">
 <input type="hidden" name="client" value="UB">
 <input type="hidden" name="proxystylesheet" value="UB">
 <input type="hidden" name="output" value="xml_no_dtd">
 <input type="hidden" name="site" value="UB">
 <input type="hidden" name="as_sitesearch" value="ubit.buffalo.edu">
</form>

Get additional help

If you have any questions, please fill out the Special Requests Form and we will try to respond as soon as possible.