• Archive
  • RSS
banner

Putting it all together

Now that the new models are in the database and the crawler is indexing videos, documents, and Udacity URLs, it’s time to put it all together in the search engine script main.py. Last time we visited, it was still pretty simple. Now, there’s a lot of new code. I’ll step through it bit by bit.

First, remember how the process_search() procedure that handles search lookups works:

def process_search():
    search_query = request.GET.get('search_query', '').strip()
    query = search_query.lower()

It gets the string search_query from the search form, and converts it to lowercase. This could be a multiword string, one word, or an empty string, and it’s important to make sure the procedure can handle all these cases.

Next, I set a couple variables. If nothing intervenes, they will stay False and get passed on to the template:

show_daverank = False
results = False

The next block is a cool feature inspired by !bang syntax on DuckDuckGo (I had no idea Gabriel Weinberg would be a contest judge when I started this project, by the way. I hope he doesn’t mind that this project rips off a few features from DDG!). By checking search queries for certain keywords prepended with two dashes, like “—cs101”, DaveDaveFind will return results from other sites. This code catches the strings “—cs101”, “—cs373”, and “—python”, and redirects to search the course forums or Python documentation. It also looks for the argument “—show_daverank”, which prints a page’s Daverank underneath its URL. It’s pretty easy with the string.find() method and a little slicing:

if query.find('--') == 0:
    if query.find('--cs101') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs101/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--cs373') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs373/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--python') == 0:
        redirect_url = 'http://docs.python.org/search.html?q=' + urllib.quote(query[9:])
        return redirect(redirect_url)
    if query.find('--daverank') == 0:
        query = query[11:]
        search_query = query
        show_daverank = True

The method urllib.quote() converts the rest of the search string into a format suitable for sending in a URL. For example, the search:

--cs101 url encoding

Becomes:

http://www.udacity-forums.com/cs101/search/?q=url%20encoding

It’s important to encode strings so that they don’t get messed up when they’re sent to the browser.

Next up, I’ve edited the code to handle multi-word queries. The procedure splits the search query into a list, and iterates through each element, getting the URLS associated with each term from the database, which are now stored in a Python list:

query_words = query.split()
query_urls = []
for term in query_words:
    # Get all SearchTerm objects that match the search_query.
    q = SearchTerm.all().filter('term =', term).get()
    if q:
        query_urls.append(set(q.urls))

Next, a big if block to handle the results. The next few code snippets are all part of this block. (In fact, it’s so long that it’s probably time to move it into its own procedure).

if query_urls:
    query_url_set = set.intersection(*query_urls)
    query_url_list = list(query_url_set)    

    results = True
    if len(query_url_list) > 30:
        query_url_list = query_url_list[0:30]

    page_results = Page.all().filter('url IN', query_url_list).order('-dave_rank').fetch(5)
    page_dicts = []

First, we want to make sure that DaveDaveFind returns pages with all the search terms in a multiword query. To do so, I used a Python set, which is a lot like a list, but contiains only unique elements. The previous for block stored each list of URLs as a set, and the first two lines of this block retrieve the intersection of each set. The asterisk inside the method is a new Python trick I learned: it “unpacks” each element from a list or similar type. I couldn’t get this method to accept a list, but it worked with the asterisk (I’m not totally sure why).

If the database query returns results, we toggle results to True, which will get passed to the template later.

The next block is a kludgy hack that I hope to fix later. The new data models are much more efficient since they don’t perform a bunch of one-to-many lookups, but I ran into a new error: the filter() method can only handle 30 items 'IN' a particular query. In other words, it can only look up 30 URLs at a time. Most of the time, it’s okay, but the function threw an error for popular terms like ‘Python.’ For now, when a term has more than 30 associated URLs, the code simply limits the URLs returned to the first 30. This is okay, but it doesn’t necessarily return the best URLs. I’m thinking about how to fix this.

Next, we get some information about each page in the results, and put it in a dictionary, page_info, that will be passed to the template:

    for page in page_results:
        page_info = {}
        query_index = page.text.find(query)
        if query_index != -1:
            i = query_index - 50
            j = query_index + 450
        else:
            i = 0
            j = 500
        text_string = page.text[i:j]
        page_info['text'] = text_string
        page_info['title'] = page.title
        page_info['url'] = page.url
        page_info['daverank'] = page.dave_rank
        page_dicts.append(page_info)

The i and j indexing gets a snippet from the full text of the page to display with the search results. This is another bit of code to improve in the future. It’s pretty dumb right now, and usually cuts off words and sentences. Better code would try to find full words, and wouldn’t use hard-coded index values.

Next, we do the same for video objects. This is a pretty gnarly block of nested if statements, and I’m thinking about how to clean it up. But here it is, in all its gory detail:

# Get the top 3 videos for the search term.
video_results = Video.all().filter('url IN', query_url_list).order('-views').fetch(3)
    video_dicts = []
    #Iterate through each video and store its information in a dictionary.
    for video in video_results:
        video_info = {}
        #Get a video's subtitles and find the search query.
        subtitles = video.text.lower()
        query_index = subtitles.find(query)
        time_string = ''
        #If the full search query is in the video, find it...
        if query_index != -1:
            #...by splitting the subtitles into a list of lines...
            subtitle_list = subtitles.splitlines()
            #...and iterating over them to find the query.
            for phrase in subtitle_list:
                if phrase.find(query) != -1:
                    #Get the timestamp associated with the search term.
                    timestamp_index = subtitle_list.index(phrase) - 1
                    timestamp = subtitle_list[timestamp_index]
                    if len(timestamp) > 1:
                        #Save its minutes and seconds information
                        minutes = timestamp[3:5]
                        seconds = timestamp[6:8]
                        #Add it to a string
                        time_string = '#t=' + minutes + 'm' + seconds + 's'
                        start = 60 * int(minutes) + int(seconds)

        if time_string:
            url = video.url + time_string
        else:
            url = video.url
            start = 0           
        video_info['title'] = video.title
        video_info['url'] = url
        video_info['subtitle'] = video.text[-20:query_index:20]
        video_info['id'] = video.id
        video_info['start'] = start
        video_dicts.append(video_info)

This is basically the same as the last snippet, except for the timestamp stuff. I considered importing the pysrt library to handle subtitles, but decided to use string slicing instead. Since the indexed subtitles contain timestamps, DaveDaveFind can easily do something pretty cool: if it finds the full search query in a video transcript, it will “deep link” directly to the phrase inside the video. This feels pretty magical in practice, but it’s actually just parsing the timestamp information and adding it to the YouTube URLs that it generates.

Finally, if there’s a detailed time string, it’s added to the URL. If not, the video is set to start at the beginning. Just like with the webpage lookup, we add some information to a dictionary that’s passed on to the template in these last few lines:

else:
    page_dicts = None
    video_dicts = None


return template('templates/results', search_query=search_query, page_dicts=page_dicts, video_dicts=video_dicts, show_daverank=show_daverank, results=results)

The process_search() procedure takes all this information and sends it to the template to render. If there are no page or video results, it sends None to the template. Next up, we’ll take a look at the changes to the template to handle all this new information.  

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About

Development notes for DaveDaveFind, a simple search engine I built for Udacity CS101.

See also:

  • @ecmendenhall on Twitter
  • ecmendenhall on github
  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union