Apr 15

Final Thoughts

I’ve finished up all the code behind DaveDaveFind, made sure the pages look nice, and run one last big crawl of the Udacity site and forums. It’s time to deploy the site for good and post it as a contest submission.

I’m impressed at how much I’ve learned in the last two weeks of work on DaveDaveFind (and the last seven weeks of Udacity classes). I now have a pretty good grasp of the Bottle framework, Google App Engine, using BeautifulSoup, interacting with a database, and styling pages with Bootstrap. In fact, I’ve even used DaveDaveFind a few times to look up information from the course, which must be a good sign! This project has turned out better than I expected, and I think it’s met my original goals: search all the course materials and use as much of the original search engine code as possible.

I entered CS101 as a total newcomer to computer science and a novice Python programmer. I knew how to use dictionaries, but had no idea that they worked well because they use hash tables. I knew how to write a for loop, but had no idea why and when some might take longer than others. I knew it was possible to write a recursive function, but not how to do it correctly and think about the base case.

I’ve kept these notes in part to show how well the bite-sized thinking I’ve learned from Udacity can work. I brought a little bit of previous knowledge to this project, but I’ve taught myself much more in the last few weeks. Changing our basic search crawler code into a working web application took a lot of new code, but I wrote it all by breaking big problems into little ones, carefully reading documentation, and asking for help when I got stuck. I didn’t realize how much I’d learned until I finished this project, and I hope these notes will help other Udacity students do the same.

I owe huge thanks to Peter and Professor Evans for an excellent course and the opportunity to take part in this contest. To see the final production code for DaveDaveFind, visit the GitHub repository here.

Appendix: Using Bootstrap

I’ve mostly focused on Python in these posts, but I want to go over the way DaveDaveFind uses Twitter Bootstrap styles. Styling pages with Bootstrap feels a little bit like putting blocks together. Build your page in HTML, find the right class attributes for each piece, and Bootstrap will make it look professional. The Bootstrap documentation is very good, but it can still take a lot of experimentation if you haven’t used CSS in a while. (I haven’t, so most of what’s written here is the product of lots of trial and error. Take my advice at your own risk!).

Adding Bootstrap to a project is very easy. Once the CSS files are somewhere in your project directory, add links in the <head> of your HTML:

<head>
    <link rel="stylesheet" href="/styles/bootstrap.css" type="text/css">
    <link rel="stylesheet" href="/styles/home.css" type="text/css">
    <link rel="stylesheet" href="/styles/bootstrap-responsive.css" type="text/css">
</head>

The first link is the main Bootstrap style sheet, the second is the custom changes I made for DaveDaveFind (see it here), and the third is Bootstrap’s responsive style sheet, which makes it easy to adjust the layout for screens and devices of different sizes. Make sure the responsive sheet is linked last, or it won’t work correctly.

As for the HTML structure of the site, there are only two templates that make up DaveDaveFind: home.tpl, which displays the search box, and results.tpl, which displays results. Here’s the basic structure of the homepage, annotated with the CSS class of each piece:

These are contained in a <div> structure that looks like this, following the Bootstrap documentation:

<div class="container">
    <div class="row">
    Search form here!
    </div>
</div>
<div class="navbar navbar-fixed-bottom">
Navbar contents!
</div>

Getting the structure inside the navbar to work took some experimentation, but here it is:

<div class="navbar-inner">
    <div class="container">
        <ul class="nav pull-left" data-no-collapse="true">
            <li class="dropdown" data-no-collapse="true">
                <a href="#" class="dropdown-toggle" data-toggle="dropdown">
                About
                <b class="caret"></b>
                </a>
                <div class="dropdown-menu infobox" data-no-collapse="true">
                About popup text.
                </div>
            </li>
            <li class="dropdown" data-no-collapse="true">
                <a href="#" class="dropdown-toggle" data-toggle="dropdown">
                Help
                <b class="caret"></b>
                </a>
                <div class="dropdown-menu infobox" data-no-collapse="true">
                Help popup text
                </div>
            </li>
        </ul>
        <div class="pull-right">
        Twitter and Google Plus buttons.
        </div>
    </div>
</div>

The popup boxes in the navbar are designed to be menu items, but with some careful tinkering with their styles, I got them to work as information boxes. Here are the relevant classes from the custom style sheet, which overrides the Bootstrap styles:

.dropdown-menu a {
    display: inline;
    padding: inherit;
    clear: none;
    font-weight: normal;
    line-height: inherit;
    color: #F7A900;
} 

.dropdown-menu p { 
    margin-left: 5px;
    color: #686868;
}

.dropdown-menu { 
    min-width: 240px;
    background-color: #f5f5f5;
    padding: 5px;
}

I also followed this tutorial to keep the boxes from popping up when the window is very small. The rest of the changes to the custom style sheet are mostly cosmetic: colors, gradients, and so forth. You can see the details on GitHub. The only other ingredient on this page is a little Javascript to handle the popup. Here’s the code from the HTML header:

<script src="/styles/jquery-1.7.2.min.js" type="text/javascript"></script>
<script src="/styles/bootstrap-dropdown.js" type="text/javascript"></script>

Bootstrap’s interactive features will work automatically if the right scripts are included in the page header. They all use JQuery (the first script). Since this page only uses the dropdown plugin, I downloaded it separately from the Bootstrap site.

Here’s the results template:

Its structure is similar to home.tpl, but the navbar is at the top and there are three rows of content: the search terms, the Python term results, and the video\webpage results.

<div class="navbar navbar-fixed-top">
Navbar
</div>
<div class="container">
    <div class="row">
        <div class="span6">
        "You searched for..."
        </div>
    </div>
    <div class="row">
        <div class="span6 well">
        Python term box
        </div>
    </div>
    <div class="row">
        <div class="span5" id="fixed">
        Video results
        </div>
        <div class="span7">
        Webpage results
        </div>
    </div>
</div>

Here’s the structure of the navbar, which now contains a search box:

<div class="navbar navbar-fixed-top">
    <div class="navbar-inner">
        <div class="container">
        <span class="brand">
        <h2><a href="/">DaveDave<strong class="orange">Find</strong></a></h2></span>
        <form class="navbar-form form-inline" action="/search" method="GET" >
            <input type="text" name="search_query" class="input-xlarge">
            <button type="submit" class="btn btn-warning"><i class="icon-search icon-white"></i></button>
        </form>
        </div>
    </div>
</div>

The custom styles for this page are mostly changes to the default colors, but there is one important structural change. The <div> containing videos has the attribute id="fixed". Here’s the associated CSS:

#fixed { width: 470px; }

This ensures that the video column will always be wide enough to hold the embedded video, even if the window is squished to a strange size.

And that’s it! Using mostly standard Bootstrap classes with a few color changes is an easy way to build a simple, attractive page.

Better Lookup

DaveDaveFind is now returning pretty good results, but there are a few problems I’ve put off that I still need to fix:

Here’s the code I added to deal with stopwords. First, I added them in a list at the very top of main.py. App Engine won’t allow file reading and writing, so I can’t get them from an external file. I considered storing them in the database, but that means writing a new model and performing an extra database query with each search. Hardcoding them might not be the best solution, but it works. This block (below the Python term stuff), breaks the query into a list of words and gets rid of stopwords:

query_words = query.split()
    for word in query_words:
        if word in stopwords:
            query_words.remove(word)

These two extra lines were all it took to solve the stopword problem. Improving the results links was pretty easy, too. Before passing all the information to the template, I split the query string into a list called query_string_words and passed this in place of query_words. We can’t pass in query_words because the stopwords are gone and the string won’t make sense when it’s printed for the user. And we can’t create this list earlier, because the changes made to query_words will also affect it (remember Secret Agent Man?).

Here’s how the results template handles the new information (remember, it’s now a list instead of a string):

<h2>You searched for: <strong class="orange">
    %for word in query_string_words:
    <a href="/search?search_query={{ word }}">{{ word }}</a>
    %end

Pretty simple: it prints each word as a link, with a URL that points to a DaveDaveFind search for that word. This means that stopwords will be links, but this isn’t a big problem—they just won’t return any results.

Last (but definitely not least) is fixing the order of results. I solved this by adding an extra entry to the dictionary associated with each page: True if it contained an exact match, and False otherwise. Here’s the block that iterates through each result and stores its information:

for page in page_results:
        page_info = {}
        query_index = page.text.find(query)
        if query_index != -1:
            i = page.text.find(' ', query_index-25)
            excerpt_words = page.text[i:].split(' ')
            page_info['exact_match'] = True 
        else:
            excerpt_words = page.text.split(' ')
            page_info['exact_match'] = False
        excerpt = ' '.join(excerpt_words[:50])

        page_info['text'] = excerpt
        page_info['title'] = page.title
        page_info['url'] = page.url
        page_info['daverank'] = page.dave_rank
        page_info['doc'] = page.doc
        page_dicts.append(page_info)
page_dicts.sort(key=itemgetter('exact_match'), reverse=True)

The last line is the most important. Before the pages are passed to the template, it sorts the list of dictionaries by the value of the key 'exact_match'. The itemgetter method from the operator library is responsible (make sure you add from operator import itemgetter!) for this wizardry. After trying this out, I added reverse=True, which reverses the order, so that pages with an exact match are put at the beginning of the list. One cool thing about this method is that it doesn’t change the order of the elements without an exact match. They stay at the end of the list, ordered by Daverank, and those with exact matches get pulled to the beginning. The template doesn’t use the 'exact_match' key, and the structure of the list hasn’t changed, so this is all we need to fix the result rankings.

It feels more like a whimper than a bang, but I think I’m done with the code for DaveDaveFind! The rest of the changes I need to make involve cleaning up the HTML and styles and performing one last big crawl to provide a detailed search index. I might make minor changes to the code, but all the essentials are done. If I change anything major, I’ll make sure to note it in a post here. To see the latest version, click here.

Return of documentation search!

Now that video and page searching is working pretty well, I’m going to return to one of the earliest problems I tried to solve: looking up information about Python-related terms. In earlier posts, I built a documentation parser to save terms and definitions, a data model to store Python terms, and a few lines to find them in the search lookup code. After thinking about this problem for a while, I think I’ve found a better way to retrieve Python terms, without using a parser, storing them in the database, or checking each query. This will be cheaper and faster than the original idea.

An update to the contest page on the forums mentioned the DuckDuckHack API. I checked it out, and found that DuckDuckGo also provides a “zero-click info” API that includes information from Python documentation. This should be much easier to use than storing and parsing terms in the DaveDaveFind database.

As it turns out, the API is very easy to use. Simply adding '&format=json' to the end of a normal search URL returns zero-click info results in the JSON format, which was designed for Javascript but looks a lot like a Python dictionary. So a search for ‘python str’ like this:

http://api.duckduckgo.com/?q=python+str

becomes this:

http://api.duckduckgo.com/?q=python+str&format=json

and returns an object that looks like this:

{"Definition":"","DefinitionSource":"","Heading":"str (Python)","AbstractSource":"Python Documentation"…}

There’s a basic module in the standard library for reading JSON data into Python, so getting this information into the search engine should be pretty easy. Here’s the code I came up with, in the process_search() procedure of main.py:

# Check if the search query starts with 'python'.
if query.find('python') == 0:
    pyquery = query[7:]
else:
    pyquery = query

# Save the DuckDuckGo API root URL. 
ddgurl_root = 'http://duckduckgo.com/?q=python+'
# Encode the query as a URL and generate an API URL.
ddgurl_suffix = urllib.quote(pyquery) + '&format=json'

# Get the JSON response from DuckDuckGo
response = urllib.urlopen(ddgurl_root + ddgurl_suffix)
response_json = response.read()

# Parse the JSON and convert it to a Python dictionary.
pythonterm = json.loads(response_json)

Now that we have the response, we need to make sense of it. I looked at the JSON from a few different search terms, and many of them included <code> tags to help format the results. Unfortunately, Bottle prohibits passing HTML directly to the template. (This is actually a good policy, since accepting unencoded HTML can open a lot of security holes). Instead, we can parse these with BeautifulSoup to get the <code> blocks and then reformat them in the template. Here’s how:

# If there's a response...
if pythonterm:
    pyterm_info = {}
    # If the response is from the Python Documentation...
    if pythonterm['AbstractSource'] == 'Python Documentation':
        # Get its description and try to find a <code> block.
        pyterm = BeautifulSoup(pythonterm['AbstractText'])
        try:
            pyterm_code = pyterm.find('code').string
            pyterm.pre.decompose()
            pyterm_info['code'] = pyterm_code
        except: 
            pyterm_info['code'] = None
        pyterm_desc = pyterm.get_text()
        pyterm_info['desc'] = pyterm_desc
        pyterm_info['url'] = pythonterm['AbstractURL']
        # We found something, so set results to True
        results = True
else: 
    pyterm_info = None

That’s it! As long as we pass the pyterm_info dictionary to the template in the return line at the bottom of this procedure, the template will have all the Python information it needs. Here are the blocks in results.tpl that display it:

%if pyterm_info:
    <div class="row">
    <div class="span6 well">
        %if pyterm_info['code']:
        <p><a href="{{ pyterm_info['url'] }}"><code>{{ pyterm_info['code'] }}</code></a></p>
        %end
    <blockquote>{{ pyterm_info['desc'] }}</blockquote>
    <p class="source">Read more: <a href="{{ pyterm_info['url'] }}"><img class="icon" src="/styles/py.png" height="15" width="15">  Python documentation</a></p>
    <ul class="nav nav-list">
    <li class="divider"></li>
    <li><p class="pull-right">Python search powered by <a href="http://duckduckgo.com/"><img class="icon" src="/styles/ddg.png" height="15" width="15"> DuckDuckGo</a></p></li>
    </ul>
    </div>
    </div>
    %end

For each element in the pyterm_info dict (there should only be one), the template prints the code inside a <code> tag, with the description beneath it. The conditions of the DuckDuckGo API require attribution to the original source and DuckDuckGo, so there’s a block at the bottom to give credit. I added a few styles from Bootstrap to make these results look nice, tested a few queries on the development server, and it looks like it’s working well. Using the API saved a lot of time—it took less than an hour to get the last big feature working, and it was pretty painless after figuring out how to parse JSON objects. DaveDaveFind is almost done!

Apr 11

Template tinkering

The last step in adding videos and page information to DaveDaveFind is editing the results template to interpret the data it recieves from the main search script. It’s not as tough as you might think! Here’s the HTML that handles all the search results:

<div class="row">
    <div class="span6">
        <h2>You searched for: <strong class="orange">{{ search_query }}</strong></h2>
        %if not results:
        <p>No results found for {{ search_query }}.<p>
        %end
        </div>

<div class="row">
    <div class="span5" id="fixed">
        %if video_dicts:
            <div>
            <h2>Videos:</h2>
            %for video in video_dicts:
            <div class="results well">
            <strong>{{ video['title'] }}</strong>
            <p><a href="{{ video['url'] }}">{{ video['url'][:70] }}</a></p> 
            <iframe width="430" height="248" src="http://www.youtube.com/embed/{{ video['id'] }}?rel=0&start={{ video['start'] }}&wmode=transparent" frameborder="0" allowfullscreen></iframe>          
            </div>
            %end
            </div>
        </div>
        <div class="span7">
        %if page_dicts:
            <div>
            <h2>Webpages:</h2>
            %for page in page_dicts:
            <div class="results well">
            <strong>{{ page['title'] }}</strong>
            <p><a href="{{ page['url'] }}">{{ page['url'][:70] }}</a></p>
                %if show_daverank:
                <p>DaveRank: {{ page['daverank'] }}</p>
                %end
            <p>{{ page['text'] }}</p>               
            </div>
            %end
         %end
    </div>
</div>

Among all the HTML tags, there are three %if blocks. One handles the ‘no results’ case, one handles videos, and one handles webpages. The video and webpage blocks iterate through each of the items in the dictionary passed in from the search script, and put some information into the HTML tags. Notice this pretty neat feature:

<iframe width="430" height="248" src="http://www.youtube.com/embed/{{ video['id'] }}?rel=0&start={{ video['start'] }}&wmode=transparent" frameborder="0" allowfullscreen></iframe>
<p><a href="{{ page['url'] }}">{{ page['url'][:70] }}</a></p>

Stuff passed to the template in Python can be inserted directly into tags and URLs, which means the Python script can easily set video and page URLs. This means we can embed each YouTube video, with a link directly to the point in the video where a search term appears! I tested this out last night, and it’s working pretty well, but I accidentally deleted the local datastore, so I don’t have a screenshot for now. Don’t worry, though–DaveDaveFind is coming along well, and soon I’ll have a usable demo of the application.

In addition to these changes to the template, I made some style changes in the header, and messed with the HTML structure a little bit. This involved a lot of tinkering on my part–I’m not a CSS whiz. To see these changes, take a look at the code on GitHub.

Putting it all together

Now that the new models are in the database and the crawler is indexing videos, documents, and Udacity URLs, it’s time to put it all together in the search engine script main.py. Last time we visited, it was still pretty simple. Now, there’s a lot of new code. I’ll step through it bit by bit.

First, remember how the process_search() procedure that handles search lookups works:

def process_search():
    search_query = request.GET.get('search_query', '').strip()
    query = search_query.lower()

It gets the string search_query from the search form, and converts it to lowercase. This could be a multiword string, one word, or an empty string, and it’s important to make sure the procedure can handle all these cases.

Next, I set a couple variables. If nothing intervenes, they will stay False and get passed on to the template:

show_daverank = False
results = False

The next block is a cool feature inspired by !bang syntax on DuckDuckGo (I had no idea Gabriel Weinberg would be a contest judge when I started this project, by the way. I hope he doesn’t mind that this project rips off a few features from DDG!). By checking search queries for certain keywords prepended with two dashes, like “—cs101”, DaveDaveFind will return results from other sites. This code catches the strings “—cs101”, “—cs373”, and “—python”, and redirects to search the course forums or Python documentation. It also looks for the argument “—show_daverank”, which prints a page’s Daverank underneath its URL. It’s pretty easy with the string.find() method and a little slicing:

if query.find('--') == 0:
    if query.find('--cs101') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs101/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--cs373') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs373/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--python') == 0:
        redirect_url = 'http://docs.python.org/search.html?q=' + urllib.quote(query[9:])
        return redirect(redirect_url)
    if query.find('--daverank') == 0:
        query = query[11:]
        search_query = query
        show_daverank = True

The method urllib.quote() converts the rest of the search string into a format suitable for sending in a URL. For example, the search:

--cs101 url encoding

Becomes:

http://www.udacity-forums.com/cs101/search/?q=url%20encoding

It’s important to encode strings so that they don’t get messed up when they’re sent to the browser.

Next up, I’ve edited the code to handle multi-word queries. The procedure splits the search query into a list, and iterates through each element, getting the URLS associated with each term from the database, which are now stored in a Python list:

query_words = query.split()
query_urls = []
for term in query_words:
    # Get all SearchTerm objects that match the search_query.
    q = SearchTerm.all().filter('term =', term).get()
    if q:
        query_urls.append(set(q.urls))

Next, a big if block to handle the results. The next few code snippets are all part of this block. (In fact, it’s so long that it’s probably time to move it into its own procedure).

if query_urls:
    query_url_set = set.intersection(*query_urls)
    query_url_list = list(query_url_set)    

    results = True
    if len(query_url_list) > 30:
        query_url_list = query_url_list[0:30]

    page_results = Page.all().filter('url IN', query_url_list).order('-dave_rank').fetch(5)
    page_dicts = []

First, we want to make sure that DaveDaveFind returns pages with all the search terms in a multiword query. To do so, I used a Python set, which is a lot like a list, but contiains only unique elements. The previous for block stored each list of URLs as a set, and the first two lines of this block retrieve the intersection of each set. The asterisk inside the method is a new Python trick I learned: it “unpacks” each element from a list or similar type. I couldn’t get this method to accept a list, but it worked with the asterisk (I’m not totally sure why).

If the database query returns results, we toggle results to True, which will get passed to the template later.

The next block is a kludgy hack that I hope to fix later. The new data models are much more efficient since they don’t perform a bunch of one-to-many lookups, but I ran into a new error: the filter() method can only handle 30 items 'IN' a particular query. In other words, it can only look up 30 URLs at a time. Most of the time, it’s okay, but the function threw an error for popular terms like ‘Python.’ For now, when a term has more than 30 associated URLs, the code simply limits the URLs returned to the first 30. This is okay, but it doesn’t necessarily return the best URLs. I’m thinking about how to fix this.

Next, we get some information about each page in the results, and put it in a dictionary, page_info, that will be passed to the template:

    for page in page_results:
        page_info = {}
        query_index = page.text.find(query)
        if query_index != -1:
            i = query_index - 50
            j = query_index + 450
        else:
            i = 0
            j = 500
        text_string = page.text[i:j]
        page_info['text'] = text_string
        page_info['title'] = page.title
        page_info['url'] = page.url
        page_info['daverank'] = page.dave_rank
        page_dicts.append(page_info)

The i and j indexing gets a snippet from the full text of the page to display with the search results. This is another bit of code to improve in the future. It’s pretty dumb right now, and usually cuts off words and sentences. Better code would try to find full words, and wouldn’t use hard-coded index values.

Next, we do the same for video objects. This is a pretty gnarly block of nested if statements, and I’m thinking about how to clean it up. But here it is, in all its gory detail:

# Get the top 3 videos for the search term.
video_results = Video.all().filter('url IN', query_url_list).order('-views').fetch(3)
    video_dicts = []
    #Iterate through each video and store its information in a dictionary.
    for video in video_results:
        video_info = {}
        #Get a video's subtitles and find the search query.
        subtitles = video.text.lower()
        query_index = subtitles.find(query)
        time_string = ''
        #If the full search query is in the video, find it...
        if query_index != -1:
            #...by splitting the subtitles into a list of lines...
            subtitle_list = subtitles.splitlines()
            #...and iterating over them to find the query.
            for phrase in subtitle_list:
                if phrase.find(query) != -1:
                    #Get the timestamp associated with the search term.
                    timestamp_index = subtitle_list.index(phrase) - 1
                    timestamp = subtitle_list[timestamp_index]
                    if len(timestamp) > 1:
                        #Save its minutes and seconds information
                        minutes = timestamp[3:5]
                        seconds = timestamp[6:8]
                        #Add it to a string
                        time_string = '#t=' + minutes + 'm' + seconds + 's'
                        start = 60 * int(minutes) + int(seconds)

        if time_string:
            url = video.url + time_string
        else:
            url = video.url
            start = 0           
        video_info['title'] = video.title
        video_info['url'] = url
        video_info['subtitle'] = video.text[-20:query_index:20]
        video_info['id'] = video.id
        video_info['start'] = start
        video_dicts.append(video_info)

This is basically the same as the last snippet, except for the timestamp stuff. I considered importing the pysrt library to handle subtitles, but decided to use string slicing instead. Since the indexed subtitles contain timestamps, DaveDaveFind can easily do something pretty cool: if it finds the full search query in a video transcript, it will “deep link” directly to the phrase inside the video. This feels pretty magical in practice, but it’s actually just parsing the timestamp information and adding it to the YouTube URLs that it generates.

Finally, if there’s a detailed time string, it’s added to the URL. If not, the video is set to start at the beginning. Just like with the webpage lookup, we add some information to a dictionary that’s passed on to the template in these last few lines:

else:
    page_dicts = None
    video_dicts = None


return template('templates/results', search_query=search_query, page_dicts=page_dicts, video_dicts=video_dicts, show_daverank=show_daverank, results=results)

The process_search() procedure takes all this information and sends it to the template to render. If there are no page or video results, it sends None to the template. Next up, we’ll take a look at the changes to the template to handle all this new information.  

The joy of automation

After fixing my data models, I checked back on my question in the Udacity forums. I found one helpful response, but it didn’t get much traction with anyone else. So I went looking once more for an automated way to download YouTube captions. In the end, I found a mostly-automated solution that was a big improvement on visiting each page and finding its timedtext file. (I won’t share it here, since it seems that Udacity doesn’t want the videos to be too easy to find). It took about four minutes to have a folder full of .srt subtitle files for almost every video in the CS101 curriculum (a few of them weren’t subtitled for one reason or another).

Indexing videos (and their content, thanks to the subtitles) will make DaveDaveFind much more useful. Although it already includes a pretty good index of the Udacity site and forums, the links aren’t always the most useful. In fact, I noticed that the highest-ranked site in the index is the “Legal” page, presumably because it’s linked a lot from other sites in the Udaci-verse.

It’s also one of the only pages on the Udacity site that’s mostly text, and thus legible to our web crawler. Since the main Udacity site is mostly made of embedded videos and coding exercises, it’s (ironically) not very easy for our search crawler to read! Downloading subtitles makes the video information readable by the crawler.

Getting my hands on the subtitles turned out to be pretty easy. But making them readable for the crawler was a little harder. I needed the unique YouTube ID for each video, but couldn’t find a good automated way to get them (it looks like the Udacity site uses a lot of JavaScript). So, I played a few of my favorite podcasts and buckled down to get this information by hand.

Four and a half hours later, I had learned an important lesson about automation: it really sucks to collect data by hand. But now I had each video’s ID in a Python-readable .csv file, which is the key to adding them to the index. Sometimes it’s tough to avoid a little hard work, even with Python.

After a short break, I stopped to think about other information I might need from each video. Since the video links aren’t readable by the crawler, I needed some other way to determine which videos might be more useful than others. I decided to store each video’s view count in the spreadsheet, too. I also had a small realization: Udacity has a good reason for making certain videos hard to find. Since the course will be offered again, it might not be a good idea to make homework solutions and quiz answers searchable (even if they are useful to current students). As a student who had already completed the course, I lost sight of this in my zeal to index everything. To make it easier to disable quiz and homework videos later, I made sure to code each video as a quiz, problem, solution, or lecture. If DaveDaveFind stays around after this contest, hopefully it will be easy to add and remove videos from the index according to course progress, or add some sort of integration with Udacity logins to prevent students from looking up answers before they’re allowed.

In the end, the information I decided to store for each video is a lot like the information for each page:

class Video(db.Model):
    """Models a video in the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    filename = db.StringProperty()
    id = db.StringProperty()
    type = db.StringProperty()
    views = db.IntegerProperty()
    text = db.TextProperty()

Storing the full text of the transcript will help later with implementing multi-word search the easy way (a topic for a later post).

I also downloaded the supplementary documents from the Udacity site: the glossary and each chapter’s notes and Python reference. With all this information in place, it was just a matter of figuring out how to get it into the index. Here’s the script I wrote to index pdf files, and here’s the one for adding videos to the index.

You can probably see how I reused some of the code that we wrote for the crawler. The principle behind each script is the same: break up each document or transcript into its individual words, get the page’s URL and maybe some additional information, add it to the index or pagedata dict, and write out the final data to a .csv file readable by Google App Engine.

Hooking these procedures up to the webcrawler was as easy as importing the scripts and passing index to each one in the crawl_web() procedure:

from index_pdfs import index_pdfs
from add_videos import add_videos_to_index

# Beginning of crawl_web() and the tocrawl loop goes here...

index, pagedata = index_pdfs(index, pagedata)
index = add_videos_to_index('subtitle_index.csv', '/Users/connormendenhall/Python/DaveDaveFind/DaveDaveFind/data/video_info.csv', index)
index = undupe_index(index)
return index, graph, pagedata

The new undupe_index() procedure is pretty simple, too. It checks the finalized index for duplicate URLs and removes them, so they don’t clutter up the database:

def undupe_index(index):
    for key in index.keys():
        index[key] = list(set(index[key]))
    print "[undupe_index()] Index un-duped"
    return index

Finally, I added a new procedure to the crawler that does one new and very important thing: stores a webpage’s full text in a dictionary, which the script later writes to the page info .csv file:

def get_page_data(page, url, dict):
    try:
        title = page.title.string
    except:
        title = url
    try:
        text = page.body.get_text()
    except:
        text = ''
    dict[url] = [title, text]   

It’s pretty simple, but this makes a huge difference for doing multi-word lookups (and it’s a lot easier than keeping track of string indexes like the final exam question). I noticed as I posted this code that I used the built-in type dict as a variable name. That’s a bad idea, so I’ll make sure to change it in my next update.

These are the biggest changes to the crawler code, but I encourage you to check out all the files in the /crawler/ directory on GitHub to see them for yourself. Next up: the changes I’ve made to the search engine script.

Remodeling

Remember the data models I wrote a few posts ago? They were just a few lines of code describing how DaveDaveFind would store crawler data in the database, but as it turns out, they had a huge impact on the way the application worked. Here’s a reminder:

class PythonTerm(db.Model):
    """Models a term from the Python glossary."""
    term = db.StringProperty()
    definition = db.TextProperty()

class SearchTerm(db.Model):
    """Models a search term."""
    term = db.StringProperty()

class PageUrl(db.Model):
    """Models a URL and its Daverank from the index."""
    # A search term can be associated with many pages...
    page = db.ReferenceProperty(SearchTerm,
                                collection_name='pages')

    #...but each page has a URL and Daverank.
    url = db.StringProperty()
    dave_rank = db.FloatProperty()

While I was using these models, the development server was extremely slow. I also tested the site out on the production server (this is a bad idea, but I was stuck), and noticed immediately that the database was doing a ton of write operations. In fact, I wasn’t even able to upload the full index without exceeding the daily quota for a free user on App Engine. I started to suspect that something was up with my models, so I did a little bit of research on the ways that real search engines store their indexes.

Taking a step back and doing some reading was a good idea. I found this helpful blog post, and re-read some of the App Engine documentation on data models. There, I discovered that it’s possible to store a Python list directly in the database. Pretty cool, since our original index mapped keywords to lists of URLs.

I found a way to store the index without using a ReferenceProperty, which was slowing down the applicaiton and making uploads interminably long. Here are the new models I came up with:

class SearchTerm(db.Model):
    """Models a search term and its associated URLs."""
    term = db.StringProperty()
    urls = db.StringListProperty()

class Page(db.Model):
    """Models a Page and its Daverank from the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    text = db.TextProperty()
    dave_rank = db.FloatProperty()
    doc = db.BooleanProperty()

class Video(db.Model):
    """Models a video in the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    filename = db.StringProperty()
    id = db.StringProperty()
    type = db.StringProperty()
    views = db.IntegerProperty()
    text = db.TextProperty()

The SearchTerm model now stores each search term and a list of associated URLs, just like the original data structure in our crawler code. Instead of having the database look up each URL according to a ReferenceProperty, the search engine code iterates through the list in Python instead, which is much faster. The Page model now stores a lot more information, including page titles and their full text! But since the model doesn’t use a ReferenceProperty, it’s actually faster to load onto the server than the earlier models. The Video model is new (more on video indexing later), but it’s more or less like the Page model.

Of course, I had to write new loader scripts, too. Fortunately, they worked more or less the same way as the earlier ones I’d written. To take a look, you can check out this folder on GitHub.

The good, the bad, and the ugly

I haven’t posted in a couple days, but I’ve been hard at work making a bunch of improvements to DaveDaveFind. I’ve spent a lot of time struggling to figure out Google App Engine, and haven’t kept these posts completely concurrent with the code I’ve written. I hate it when a readable tutorial stops without any notice, and I hope these posts will be useful for other students, so I’ll try to cover most of the changes I’ve made in the last few days, even if I might not go through every line of code. This will be a brief summary post, and I’ll cover a few more interesting things on their own in greater detail. As always, you can see all the changes I’ve made step by step in the repository on GitHub.

The good

DaveDaveFind can now search videos and documents, includes a few shortcuts inspired by !bang syntax, supports multiword queries, and seems to be pretty fast and reliable. I’ve checked off everything on the TODO list except for adding better Python term queries, and that shouldn’t be too difficult. There are a lot of little improvements to make, and I can always improve the quality of the results, but I think the end is in sight. In fact, when I went to finish up my final exam, I found myself using DaveDaveFind to look up a couple videos—and it worked! I’m pretty impressed.

The bad

I have struggled mightily with Google App Engine. Uploading data to test out on the development server sometimes took hours, even for comparatively small indexes (like, under 1mb), and as a beginner, it’s always difficult to tell if I’m doing things the right way. I spent a lot of time working on eliminating duplicate entries in the index, but the database was still really slow. In the end, I decided to take another look at my data models, and do some research on the way real search engines store information. As it turns out, my data structures were bad, costly, and inefficient (remember all the one-to-many keys?), so I had to rewrite them. On the plus side, I discovered that it’s possible to store Python lists in the App Engine datastore, which is pretty cool. On the other hand, it took me hours to figure out how to to get a Python list from my hard drive onto the server.

The ugly

I’ve made a lot of changes and fixed a lot of little problems, but I can already feel my code slipping from the simple, readable procedures we wrote in class to lots of nested ifs and except blocks designed to catch little, idiosyncratic errors. Straying from the Udacity method of small, documented steps while I was frustrated with App Engine has probably contributed to this. Before I go too far, I should step back and see if I can make my code a little easier to read. But it feels very difficult to fight this tendency as I add more and more procedures and features.

Apr 06

Learning to crawl (again)

In this post, I’ll try to fix some of the smaller problems that have been piling up across my search engine code. Here are the three I came up with at the end of my last post:

I’d also like to test out the engine with a slightly bigger index and take another look at identifying search terms that are also Python words.

I’ll start with the crawler. It splits words in the procedure add_page_to_index(), which hasn’t changed yet:

def add_page_to_index(index, url, content):
    try:
        text = content.get_text()
    except:
        return
    words = text.split()
    for word in words:

        add_to_index(index, word, url)

Adding these lines should strip out the punctuation and convert all strings to lowercase. Since I’m only worried about punctuation at the start and end of each string, I’ll only check the first and last character.

punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    for word in words:
        if word[0] in punctuation:
            word = word[1:]
        if word[-1] in punctuation:
            word = word[:-1]
        word = word.lower()
        add_to_index(index, word, url)

To fix case-sensitivity, I’ll also add a method call to the first line of process_search() in main.py:

search_query = request.GET.get('search_query', '').strip().lower()

On second thought, this isn’t a good idea, because I’d like to save the search term with its original capitalization. Let’s add it to its own line;

search_query = request.GET.get('search_query', '').strip()

While I’m there, I should think about how to catch search terms that aren’t in the database. When the database tries to look up a term that’s not in the index, it returns None, which causes an error. This took some experimentation, but here’s how I fixed it:

# Get all SearchTerm objects that match the search_query.
q = SearchTerm.all().filter('term =', lowercase_query).get()    

# Now get the PageUrls that are associated with the term...
# ...if they exist!
if q:
    page_urls = q.pages
    # Sort them by dave_rank and return the top five.
    results = page_urls.order('-dave_rank').fetch(5)

# If not, pass None to the results.
else:
    results = None

And here’s how I modified the template for a search term that’s not in the index:

<h3>You searched for: {{ search_query }}</h3>
        %if results:
            %for page in results:
            <a href="{{ page.url }}">{{ page.url }}</a>
            %end
        %else:
            <p>No results found for {{ search_query }}.<p>
        %end

One tricky thing about templates is that all Python code blocks must have an %end statement, and not just at the end of all the code. Initially, I forgot the first %end block, which returned an HTML page that cut off immediately after the <h3> tags at the top. At first, it wasn’t clear that this was the problem. Since the HTML degraded pretty well, it just looked like nothing was happening. After checking the details of SimpleTemplate syntax, I figured out what was wrong.

Now DaveDaveFind accepts any search term with any capitalization, and the crawler ignores punctuation at the beginning and end of words. I’ll try to load a bigger index and see if anything goes wrong. Let’s try crawling 25 pages starting at the main Udacity page, with a depth of 10. For now, I’ve modified the bottom of the code like this:

cache = {}
max_pages = 25
max_depth = 10

def start_crawl():              
    index, graph = crawl_web('http://www.udacity.com/', max_pages, max_depth)
    ranks = compute_ranks(graph)
    write_search_terms('search_terms.csv', index)
    write_url_info('url_info.csv', index, ranks)

    print "INDEX: ", index
    print ""
    print "GRAPH: ", graph
    print ""
    print "RANKS: ", ranks

if __name__ == "__main__":
    start_crawl()

The lines at the bottom that run the crawler, compute ranks, and write the data to external files are now in a procedure of their own. The last two lines are a Python idiom that runs a procedure if the code is run from the command line, but not if it’s imported into something else. Here’s how it works. This code will run the procedure start_crawl() when I run it from the command line, but won’t start crawling if I import it in the interactive terminal to test something out.

You might notice that I haven’t used the cache at all. It’s probably a good idea to think about how I could incorporate it.

When I ran the crawler, it crashed right away! Here’s the error message:

File "udacity_crawler.py", line 85, in add_page_to_index
    if word[-1] in punctuation:
IndexError: string index out of range

The punctuation-stripping code I wrote earlier is too fragile, so I’ll have to find a better solution. I checked the Python reference and found the string methods lstrip() and rstrip() which remove characters from the beginning and end (‘left’ and ‘right’) of strings. Thus:

for word in words:
    word = word.lstrip(punctuation)
    word = word.rstrip(punctuation)
    word = word.lower()
    add_to_index(index, word, url)

It works! The new index is uploading to the development server as I type, and it’s pretty slow! Crawling just a few more pages resulted in a massive increase in the size of the index: the .csv file that contains URLs, terms, and Daveranks is now 1.3 megabytes! Meanwhile, the list of search terms is only 35k. Looking over the terms and URLs, a few immediate problems to solve are clear:

The database is still (!) loading, so I’m going to wrap up this post for now and find another problem to work on.