• Archive
  • RSS
banner

Return of documentation search!

Now that video and page searching is working pretty well, I’m going to return to one of the earliest problems I tried to solve: looking up information about Python-related terms. In earlier posts, I built a documentation parser to save terms and definitions, a data model to store Python terms, and a few lines to find them in the search lookup code. After thinking about this problem for a while, I think I’ve found a better way to retrieve Python terms, without using a parser, storing them in the database, or checking each query. This will be cheaper and faster than the original idea.

An update to the contest page on the forums mentioned the DuckDuckHack API. I checked it out, and found that DuckDuckGo also provides a “zero-click info” API that includes information from Python documentation. This should be much easier to use than storing and parsing terms in the DaveDaveFind database.

As it turns out, the API is very easy to use. Simply adding '&format=json' to the end of a normal search URL returns zero-click info results in the JSON format, which was designed for Javascript but looks a lot like a Python dictionary. So a search for ‘python str’ like this:

http://api.duckduckgo.com/?q=python+str

becomes this:

http://api.duckduckgo.com/?q=python+str&format=json

and returns an object that looks like this:

{"Definition":"","DefinitionSource":"","Heading":"str (Python)","AbstractSource":"Python Documentation"…}

There’s a basic module in the standard library for reading JSON data into Python, so getting this information into the search engine should be pretty easy. Here’s the code I came up with, in the process_search() procedure of main.py:

# Check if the search query starts with 'python'.
if query.find('python') == 0:
    pyquery = query[7:]
else:
    pyquery = query

# Save the DuckDuckGo API root URL. 
ddgurl_root = 'http://duckduckgo.com/?q=python+'
# Encode the query as a URL and generate an API URL.
ddgurl_suffix = urllib.quote(pyquery) + '&format=json'

# Get the JSON response from DuckDuckGo
response = urllib.urlopen(ddgurl_root + ddgurl_suffix)
response_json = response.read()

# Parse the JSON and convert it to a Python dictionary.
pythonterm = json.loads(response_json)

Now that we have the response, we need to make sense of it. I looked at the JSON from a few different search terms, and many of them included <code> tags to help format the results. Unfortunately, Bottle prohibits passing HTML directly to the template. (This is actually a good policy, since accepting unencoded HTML can open a lot of security holes). Instead, we can parse these with BeautifulSoup to get the <code> blocks and then reformat them in the template. Here’s how:

# If there's a response...
if pythonterm:
    pyterm_info = {}
    # If the response is from the Python Documentation...
    if pythonterm['AbstractSource'] == 'Python Documentation':
        # Get its description and try to find a <code> block.
        pyterm = BeautifulSoup(pythonterm['AbstractText'])
        try:
            pyterm_code = pyterm.find('code').string
            pyterm.pre.decompose()
            pyterm_info['code'] = pyterm_code
        except: 
            pyterm_info['code'] = None
        pyterm_desc = pyterm.get_text()
        pyterm_info['desc'] = pyterm_desc
        pyterm_info['url'] = pythonterm['AbstractURL']
        # We found something, so set results to True
        results = True
else: 
    pyterm_info = None

That’s it! As long as we pass the pyterm_info dictionary to the template in the return line at the bottom of this procedure, the template will have all the Python information it needs. Here are the blocks in results.tpl that display it:

%if pyterm_info:
    <div class="row">
    <div class="span6 well">
        %if pyterm_info['code']:
        <p><a href="{{ pyterm_info['url'] }}"><code>{{ pyterm_info['code'] }}</code></a></p>
        %end
    <blockquote>{{ pyterm_info['desc'] }}</blockquote>
    <p class="source">Read more: <a href="{{ pyterm_info['url'] }}"><img class="icon" src="/styles/py.png" height="15" width="15">  Python documentation</a></p>
    <ul class="nav nav-list">
    <li class="divider"></li>
    <li><p class="pull-right">Python search powered by <a href="http://duckduckgo.com/"><img class="icon" src="/styles/ddg.png" height="15" width="15"> DuckDuckGo</a></p></li>
    </ul>
    </div>
    </div>
    %end

For each element in the pyterm_info dict (there should only be one), the template prints the code inside a <code> tag, with the description beneath it. The conditions of the DuckDuckGo API require attribution to the original source and DuckDuckGo, so there’s a block at the bottom to give credit. I added a few styles from Bootstrap to make these results look nice, tested a few queries on the development server, and it looks like it’s working well. Using the API saved a lot of time—it took less than an hour to get the last big feature working, and it was pretty painless after figuring out how to parse JSON objects. DaveDaveFind is almost done!

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About

Development notes for DaveDaveFind, a simple search engine I built for Udacity CS101.

See also:

  • @ecmendenhall on Twitter
  • ecmendenhall on github
  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union