<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>Development notes for DaveDaveFind, a simple search engine I built for Udacity CS101.</description><title>http://davedavefind.tumblr.com/</title><generator>Tumblr (3.0; @davedavefind)</generator><link>http://davedavefind.tumblr.com/</link><item><title>Final Thoughts</title><description>&lt;p&gt;I&amp;#8217;ve finished up all the code behind DaveDaveFind, made sure the pages look nice, and run one last big crawl of the Udacity site and forums. It&amp;#8217;s time to deploy the site for good and post it as a contest submission.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m impressed at how much I&amp;#8217;ve learned in the last two weeks of work on DaveDaveFind (and the last seven weeks of Udacity classes). I now have a pretty good grasp of the Bottle framework, Google App Engine, using BeautifulSoup, interacting with a database, and styling pages with Bootstrap. In fact, I&amp;#8217;ve even used DaveDaveFind a few times to look up information from the course, which must be a good sign! This project has turned out better than I expected, and I think it&amp;#8217;s met my original goals: search all the course materials and use as much of the original search engine code as possible.&lt;/p&gt;

&lt;p&gt;I entered CS101 as a total newcomer to computer science and a novice Python programmer. I knew how to use dictionaries, but had no idea that they worked well because they use hash tables. I knew how to write a &lt;code&gt;for&lt;/code&gt; loop, but had no idea why and when some might take longer than others. I knew it was possible to write a recursive function, but not how to do it correctly and think about the base case.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve kept these notes in part to show how well the bite-sized thinking I&amp;#8217;ve learned from Udacity can work. I brought a little bit of previous knowledge to this project, but I&amp;#8217;ve taught myself much more in the last few weeks. Changing our basic search crawler code into a working web application took a lot of new code, but I wrote it all by breaking big problems into little ones, carefully reading documentation, and asking for help when I got stuck. I didn&amp;#8217;t realize how much I&amp;#8217;d learned until I finished this project, and I hope these notes will help other Udacity students do the same.&lt;/p&gt;

&lt;p&gt;I owe huge thanks to Peter and Professor Evans for an excellent course and the opportunity to take part in this contest. To see the final production code for DaveDaveFind, visit the GitHub repository &lt;a href="https://github.com/ecmendenhall/DaveDaveFind"&gt;here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/21160971148</link><guid>http://davedavefind.tumblr.com/post/21160971148</guid><pubDate>Sun, 15 Apr 2012 14:52:08 -0400</pubDate></item><item><title>Appendix: Using Bootstrap </title><description>&lt;p&gt;I&amp;#8217;ve mostly focused on Python in these posts, but I want to go over the way DaveDaveFind uses &lt;a href="http://twitter.github.com/bootstrap/"&gt;Twitter Bootstrap&lt;/a&gt; styles. Styling pages with Bootstrap feels a little bit like putting blocks together. Build your page in HTML, find the right &lt;code&gt;class&lt;/code&gt; attributes for each piece, and Bootstrap will make it look professional. The Bootstrap &lt;a href="http://twitter.github.com/bootstrap/scaffolding.html"&gt;documentation&lt;/a&gt; is very good, but it can still take a lot of experimentation if you haven&amp;#8217;t used CSS in a while. (I haven&amp;#8217;t, so most of what&amp;#8217;s written here is the product of lots of trial and error. Take my advice at your own risk!).&lt;/p&gt;

&lt;p&gt;Adding Bootstrap to a project is very easy. Once the CSS files are somewhere in your project directory, add links in the &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; of your HTML:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;head&amp;gt;
    &amp;lt;link rel="stylesheet" href="/styles/bootstrap.css" type="text/css"&amp;gt;
    &amp;lt;link rel="stylesheet" href="/styles/home.css" type="text/css"&amp;gt;
    &amp;lt;link rel="stylesheet" href="/styles/bootstrap-responsive.css" type="text/css"&amp;gt;
&amp;lt;/head&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first link is the main Bootstrap style sheet, the second is the custom changes I made for DaveDaveFind (see it &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/styles/home.css"&gt;here&lt;/a&gt;), and the third is Bootstrap&amp;#8217;s responsive style sheet, which makes it easy to adjust the layout for screens and devices of different sizes. Make sure the responsive sheet is linked last, or it won&amp;#8217;t work correctly.&lt;/p&gt;

&lt;p&gt;As for the HTML structure of the site, there are only two templates that make up DaveDaveFind: &lt;code&gt;home.tpl&lt;/code&gt;, which displays the search box, and &lt;code&gt;results.tpl&lt;/code&gt;, which displays results. Here&amp;#8217;s the basic structure of the homepage, annotated with the CSS class of each piece:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2j9qsFIUU1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;These are contained in a &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; structure that looks like this, following the Bootstrap documentation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;div class="container"&amp;gt;
    &amp;lt;div class="row"&amp;gt;
    Search form here!
    &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;div class="navbar navbar-fixed-bottom"&amp;gt;
Navbar contents!
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Getting the structure inside the navbar to work took some experimentation, but here it is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;div class="navbar-inner"&amp;gt;
    &amp;lt;div class="container"&amp;gt;
        &amp;lt;ul class="nav pull-left" data-no-collapse="true"&amp;gt;
            &amp;lt;li class="dropdown" data-no-collapse="true"&amp;gt;
                &amp;lt;a href="#" class="dropdown-toggle" data-toggle="dropdown"&amp;gt;
                About
                &amp;lt;b class="caret"&amp;gt;&amp;lt;/b&amp;gt;
                &amp;lt;/a&amp;gt;
                &amp;lt;div class="dropdown-menu infobox" data-no-collapse="true"&amp;gt;
                About popup text.
                &amp;lt;/div&amp;gt;
            &amp;lt;/li&amp;gt;
            &amp;lt;li class="dropdown" data-no-collapse="true"&amp;gt;
                &amp;lt;a href="#" class="dropdown-toggle" data-toggle="dropdown"&amp;gt;
                Help
                &amp;lt;b class="caret"&amp;gt;&amp;lt;/b&amp;gt;
                &amp;lt;/a&amp;gt;
                &amp;lt;div class="dropdown-menu infobox" data-no-collapse="true"&amp;gt;
                Help popup text
                &amp;lt;/div&amp;gt;
            &amp;lt;/li&amp;gt;
        &amp;lt;/ul&amp;gt;
        &amp;lt;div class="pull-right"&amp;gt;
        Twitter and Google Plus buttons.
        &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The popup boxes in the navbar are designed to be menu items, but with some careful tinkering with their styles, I got them to work as information boxes. Here are the relevant classes from the &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/styles/home.css"&gt;custom style sheet&lt;/a&gt;, which overrides the Bootstrap styles:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.dropdown-menu a {
    display: inline;
    padding: inherit;
    clear: none;
    font-weight: normal;
    line-height: inherit;
    color: #F7A900;
} 

.dropdown-menu p { 
    margin-left: 5px;
    color: #686868;
}

.dropdown-menu { 
    min-width: 240px;
    background-color: #f5f5f5;
    padding: 5px;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I also followed &lt;a href="http://taylor.fausak.me/2012/03/15/dropdown-menu-in-twitter-bootstraps-collapsed-navbar/"&gt;this tutorial&lt;/a&gt; to keep the boxes from popping up when the window is very small. The rest of the changes to the custom style sheet are mostly cosmetic: colors, gradients, and so forth. You can see the details &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/styles/home.css"&gt;on GitHub&lt;/a&gt;. The only other ingredient on this page is a little Javascript to handle the popup. Here&amp;#8217;s the code from the HTML header:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;script src="/styles/jquery-1.7.2.min.js" type="text/javascript"&amp;gt;&amp;lt;/script&amp;gt;
&amp;lt;script src="/styles/bootstrap-dropdown.js" type="text/javascript"&amp;gt;&amp;lt;/script&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Bootstrap&amp;#8217;s &lt;a href="http://twitter.github.com/bootstrap/javascript.html"&gt;interactive features&lt;/a&gt; will work automatically if the right scripts are included in the page header. They all use JQuery (the first script). Since this page only uses the &lt;a href="http://twitter.github.com/bootstrap/javascript.html#dropdowns"&gt;dropdown plugin&lt;/a&gt;, I downloaded it separately from the Bootstrap site.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the results template:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m2j9o3AMbg1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Its structure is similar to &lt;code&gt;home.tpl&lt;/code&gt;, but the navbar is at the top and there are three rows of content: the search terms, the Python term results, and the video\webpage results.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;div class="navbar navbar-fixed-top"&amp;gt;
Navbar
&amp;lt;/div&amp;gt;
&amp;lt;div class="container"&amp;gt;
    &amp;lt;div class="row"&amp;gt;
        &amp;lt;div class="span6"&amp;gt;
        "You searched for..."
        &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
    &amp;lt;div class="row"&amp;gt;
        &amp;lt;div class="span6 well"&amp;gt;
        Python term box
        &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
    &amp;lt;div class="row"&amp;gt;
        &amp;lt;div class="span5" id="fixed"&amp;gt;
        Video results
        &amp;lt;/div&amp;gt;
        &amp;lt;div class="span7"&amp;gt;
        Webpage results
        &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;#8217;s the structure of the navbar, which now contains a search box:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;div class="navbar navbar-fixed-top"&amp;gt;
    &amp;lt;div class="navbar-inner"&amp;gt;
        &amp;lt;div class="container"&amp;gt;
        &amp;lt;span class="brand"&amp;gt;
        &amp;lt;h2&amp;gt;&amp;lt;a href="/"&amp;gt;DaveDave&amp;lt;strong class="orange"&amp;gt;Find&amp;lt;/strong&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/h2&amp;gt;&amp;lt;/span&amp;gt;
        &amp;lt;form class="navbar-form form-inline" action="/search" method="GET" &amp;gt;
            &amp;lt;input type="text" name="search_query" class="input-xlarge"&amp;gt;
            &amp;lt;button type="submit" class="btn btn-warning"&amp;gt;&amp;lt;i class="icon-search icon-white"&amp;gt;&amp;lt;/i&amp;gt;&amp;lt;/button&amp;gt;
        &amp;lt;/form&amp;gt;
        &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The custom styles for this page are mostly changes to the default colors, but there is one important structural change. The &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; containing videos has the attribute &lt;code&gt;id="fixed"&lt;/code&gt;. Here&amp;#8217;s the associated CSS:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#fixed { width: 470px; }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This ensures that the video column will always be wide enough to hold the embedded video, even if the window is squished to a strange size.&lt;/p&gt;

&lt;p&gt;And that&amp;#8217;s it! Using mostly standard Bootstrap classes with a few color changes is an easy way to build a simple, attractive page.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/21158998451</link><guid>http://davedavefind.tumblr.com/post/21158998451</guid><pubDate>Sun, 15 Apr 2012 14:19:00 -0400</pubDate></item><item><title>Better Lookup</title><description>&lt;p&gt;DaveDaveFind is now returning pretty good results, but there are a few problems I&amp;#8217;ve put off that I still need to fix:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Searches including stopwords return no results, because the stopwords aren&amp;#8217;t in the index. The search code should ignore stopwords, too.&lt;/li&gt;
&lt;li&gt;Since multiword lookup still isn&amp;#8217;t perfect, it would be nice if the line above each search that says &amp;#8220;You searched for:&amp;#8221; included separate links for each word in the search query.&lt;/li&gt;
&lt;li&gt;Everything is ordered by Daverank/views. Search results that match the exact query but have low Daverank show up below less relevant pages. &lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Here&amp;#8217;s the code I added to deal with stopwords. First, I added them in a list at the very top of &lt;code&gt;main.py&lt;/code&gt;. App Engine won&amp;#8217;t allow file reading and writing, so I can&amp;#8217;t get them from an external file. I considered storing them in the database, but that means writing a new model and performing an extra database query with each search. Hardcoding them might not be the best solution, but it works. This block (below the Python term stuff), breaks the query into a list of words and gets rid of stopwords:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;query_words = query.split()
    for word in query_words:
        if word in stopwords:
            query_words.remove(word)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These two extra lines were all it took to solve the stopword problem. Improving the results links was pretty easy, too. Before passing all the information to the template, I split the query string into a list called &lt;code&gt;query_string_words&lt;/code&gt; and passed this in place of &lt;code&gt;query_words&lt;/code&gt;. We can&amp;#8217;t pass in &lt;code&gt;query_words&lt;/code&gt; because the stopwords are gone and the string won&amp;#8217;t make sense when it&amp;#8217;s printed for the user. And we can&amp;#8217;t create this list earlier, because the changes made to &lt;code&gt;query_words&lt;/code&gt; will also affect it (remember &lt;a href="http://davedavefind.appspot.com/search?search_query=agent"&gt;Secret Agent Man?&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s how the results template handles the new information (remember, it&amp;#8217;s now a list instead of a string):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;h2&amp;gt;You searched for: &amp;lt;strong class="orange"&amp;gt;
    %for word in query_string_words:
    &amp;lt;a href="/search?search_query={{ word }}"&amp;gt;{{ word }}&amp;lt;/a&amp;gt;
    %end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pretty simple: it prints each word as a link, with a URL that points to a DaveDaveFind search for that word. This means that stopwords will be links, but this isn&amp;#8217;t a big problem—they just won&amp;#8217;t return any results.&lt;/p&gt;

&lt;p&gt;Last (but definitely not least) is fixing the order of results. I solved this by adding an extra entry to the dictionary associated with each page: &lt;code&gt;True&lt;/code&gt; if it contained an exact match, and &lt;code&gt;False&lt;/code&gt; otherwise. Here&amp;#8217;s the block that iterates through each result and stores its information:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;for page in page_results:
        page_info = {}
        query_index = page.text.find(query)
        if query_index != -1:
            i = page.text.find(' ', query_index-25)
            excerpt_words = page.text[i:].split(' ')
            page_info['exact_match'] = True 
        else:
            excerpt_words = page.text.split(' ')
            page_info['exact_match'] = False
        excerpt = ' '.join(excerpt_words[:50])

        page_info['text'] = excerpt
        page_info['title'] = page.title
        page_info['url'] = page.url
        page_info['daverank'] = page.dave_rank
        page_info['doc'] = page.doc
        page_dicts.append(page_info)
page_dicts.sort(key=itemgetter('exact_match'), reverse=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The last line is the most important. Before the pages are passed to the template, it sorts the list of dictionaries by the value of the key &lt;code&gt;'exact_match'&lt;/code&gt;. The &lt;code&gt;itemgetter&lt;/code&gt; method from the &lt;code&gt;operator&lt;/code&gt; library is responsible (make sure you add &lt;code&gt;from operator import itemgetter&lt;/code&gt;!) for this wizardry. After trying this out, I added &lt;code&gt;reverse=True&lt;/code&gt;, which reverses the order, so that pages with an exact match are put at the beginning of the list. One cool thing about this method is that it doesn&amp;#8217;t change the order of the elements without an exact match. They stay at the end of the list, ordered by Daverank, and those with exact matches get pulled to the beginning. The template doesn&amp;#8217;t use the &lt;code&gt;'exact_match'&lt;/code&gt; key, and the structure of the list hasn&amp;#8217;t changed, so this is all we need to fix the result rankings.&lt;/p&gt;

&lt;p&gt;It feels more like a whimper than a bang, but I think I&amp;#8217;m done with the code for DaveDaveFind! The rest of the changes I need to make involve cleaning up the HTML and styles and performing one last big crawl to provide a detailed search index. I might make minor changes to the code, but all the essentials are done. If I change anything major, I&amp;#8217;ll make sure to note it in a post here. To see the latest version, &lt;a href="https://github.com/ecmendenhall/DaveDaveFind"&gt;click here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/21147983380</link><guid>http://davedavefind.tumblr.com/post/21147983380</guid><pubDate>Sun, 15 Apr 2012 11:04:10 -0400</pubDate></item><item><title>Return of documentation search!</title><description>&lt;p&gt;Now that video and page searching is working pretty well, I&amp;#8217;m going to return to one of the earliest problems I tried to solve: looking up information about Python-related terms. In earlier posts, I built a &lt;a href="http://davedavefind.tumblr.com/post/20538755037/proof-of-concept"&gt;documentation parser&lt;/a&gt; to save terms and definitions, a &lt;a href="http://davedavefind.tumblr.com/post/20555458924/making-models"&gt;data model&lt;/a&gt; to store Python terms, and a few lines to find them in the search lookup code. After thinking about this problem for a while, I think I&amp;#8217;ve found a better way to retrieve Python terms, without using a parser, storing them in the database, or checking each query. This will be cheaper and faster than the original idea.&lt;/p&gt;

&lt;p&gt;An update to the contest page on the forums mentioned the &lt;a href="http://duckduckhack.com/"&gt;DuckDuckHack API&lt;/a&gt;. I checked it out, and found that DuckDuckGo also provides a &lt;a href="http://duckduckgo.com/api.html"&gt;&amp;#8220;zero-click info&amp;#8221;&lt;/a&gt; API that includes information from Python documentation. This should be much easier to use than storing and parsing terms in the DaveDaveFind database.&lt;/p&gt;

&lt;p&gt;As it turns out, the API is very easy to use. Simply adding &lt;code&gt;'&amp;amp;format=json'&lt;/code&gt; to the end of a normal search URL returns zero-click info results in the JSON format, which was designed for Javascript but looks a lot like a Python dictionary. So &lt;a href="http://duckduckgo.com/?q=python+str"&gt;a search&lt;/a&gt; for &amp;#8216;python str&amp;#8217; like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&lt;a href="http://api.duckduckgo.com/?q=python+str"&gt;http://api.duckduckgo.com/?q=python+str&lt;/a&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;becomes this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&lt;a href="http://api.duckduckgo.com/?q=python+str&amp;amp;format=json"&gt;http://api.duckduckgo.com/?q=python+str&amp;amp;format=json&lt;/a&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and returns an object that &lt;a href="http://duckduckgo.com/?q=python+str&amp;amp;format=json"&gt;looks like this&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{"Definition":"","DefinitionSource":"","Heading":"str (Python)","AbstractSource":"Python Documentation"…}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There&amp;#8217;s a &lt;a href="http://docs.python.org/library/json.html"&gt;basic module&lt;/a&gt; in the standard library for reading JSON data into Python, so getting this information into the search engine should be pretty easy. Here&amp;#8217;s the code I came up with, in the &lt;code&gt;process_search()&lt;/code&gt; procedure of &lt;code&gt;main.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Check if the search query starts with 'python'.
if query.find('python') == 0:
    pyquery = query[7:]
else:
    pyquery = query

# Save the DuckDuckGo API root URL. 
ddgurl_root = 'http://duckduckgo.com/?q=python+'
# Encode the query as a URL and generate an API URL.
ddgurl_suffix = urllib.quote(pyquery) + '&amp;amp;format=json'

# Get the JSON response from DuckDuckGo
response = urllib.urlopen(ddgurl_root + ddgurl_suffix)
response_json = response.read()

# Parse the JSON and convert it to a Python dictionary.
pythonterm = json.loads(response_json)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now that we have the response, we need to make sense of it. I looked at the JSON from a few different search terms, and many of them included &lt;code&gt;&amp;lt;code&amp;gt;&lt;/code&gt; tags to help format the results. Unfortunately, Bottle prohibits passing HTML directly to the template. (This is actually a good policy, since accepting unencoded HTML can open a lot of security holes). Instead, we can parse these with BeautifulSoup to get the &lt;code&gt;&amp;lt;code&amp;gt;&lt;/code&gt; blocks and then reformat them in the template. Here&amp;#8217;s how:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# If there's a response...
if pythonterm:
    pyterm_info = {}
    # If the response is from the Python Documentation...
    if pythonterm['AbstractSource'] == 'Python Documentation':
        # Get its description and try to find a &amp;lt;code&amp;gt; block.
        pyterm = BeautifulSoup(pythonterm['AbstractText'])
        try:
            pyterm_code = pyterm.find('code').string
            pyterm.pre.decompose()
            pyterm_info['code'] = pyterm_code
        except: 
            pyterm_info['code'] = None
        pyterm_desc = pyterm.get_text()
        pyterm_info['desc'] = pyterm_desc
        pyterm_info['url'] = pythonterm['AbstractURL']
        # We found something, so set results to True
        results = True
else: 
    pyterm_info = None
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s it! As long as we pass the &lt;code&gt;pyterm_info&lt;/code&gt; dictionary to the template in the &lt;code&gt;return&lt;/code&gt; line at the bottom of this procedure, the template will have all the Python information it needs. Here are the blocks in &lt;code&gt;results.tpl&lt;/code&gt; that display it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;%if pyterm_info:
    &amp;lt;div class="row"&amp;gt;
    &amp;lt;div class="span6 well"&amp;gt;
        %if pyterm_info['code']:
        &amp;lt;p&amp;gt;&amp;lt;a href="{{ pyterm_info['url'] }}"&amp;gt;&amp;lt;code&amp;gt;{{ pyterm_info['code'] }}&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;
        %end
    &amp;lt;blockquote&amp;gt;{{ pyterm_info['desc'] }}&amp;lt;/blockquote&amp;gt;
    &amp;lt;p class="source"&amp;gt;Read more: &amp;lt;a href="{{ pyterm_info['url'] }}"&amp;gt;&amp;lt;img class="icon" src="/styles/py.png" height="15" width="15"&amp;gt;  Python documentation&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;
    &amp;lt;ul class="nav nav-list"&amp;gt;
    &amp;lt;li class="divider"&amp;gt;&amp;lt;/li&amp;gt;
    &amp;lt;li&amp;gt;&amp;lt;p class="pull-right"&amp;gt;Python search powered by &amp;lt;a href="http://duckduckgo.com/"&amp;gt;&amp;lt;img class="icon" src="/styles/ddg.png" height="15" width="15"&amp;gt; DuckDuckGo&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;/li&amp;gt;
    &amp;lt;/ul&amp;gt;
    &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
    %end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For each element in the &lt;code&gt;pyterm_info&lt;/code&gt; dict (there should only be one), the template prints the code inside a &lt;code&gt;&amp;lt;code&amp;gt;&lt;/code&gt; tag, with the description beneath it. The conditions of the DuckDuckGo API require attribution to the original source and DuckDuckGo, so there&amp;#8217;s a block at the bottom to give credit. I added a few styles from Bootstrap to make these results look nice, tested a few queries on the development server, and it looks like it&amp;#8217;s working well. Using the API saved a lot of time—it took less than an hour to get the last big feature working, and it was pretty painless after figuring out how to parse JSON objects. DaveDaveFind is almost done!&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/21146099413</link><guid>http://davedavefind.tumblr.com/post/21146099413</guid><pubDate>Sun, 15 Apr 2012 10:21:39 -0400</pubDate></item><item><title>Template tinkering</title><description>&lt;p&gt;The last step in adding videos and page information to DaveDaveFind is editing the results template to interpret the data it recieves from the main search script. It&amp;#8217;s not as tough as you might think! Here&amp;#8217;s the HTML that handles all the search results:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;div class="row"&amp;gt;
    &amp;lt;div class="span6"&amp;gt;
        &amp;lt;h2&amp;gt;You searched for: &amp;lt;strong class="orange"&amp;gt;{{ search_query }}&amp;lt;/strong&amp;gt;&amp;lt;/h2&amp;gt;
        %if not results:
        &amp;lt;p&amp;gt;No results found for {{ search_query }}.&amp;lt;p&amp;gt;
        %end
        &amp;lt;/div&amp;gt;

&amp;lt;div class="row"&amp;gt;
    &amp;lt;div class="span5" id="fixed"&amp;gt;
        %if video_dicts:
            &amp;lt;div&amp;gt;
            &amp;lt;h2&amp;gt;Videos:&amp;lt;/h2&amp;gt;
            %for video in video_dicts:
            &amp;lt;div class="results well"&amp;gt;
            &amp;lt;strong&amp;gt;{{ video['title'] }}&amp;lt;/strong&amp;gt;
            &amp;lt;p&amp;gt;&amp;lt;a href="{{ video['url'] }}"&amp;gt;{{ video['url'][:70] }}&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt; 
            &amp;lt;iframe width="430" height="248" src="http://www.youtube.com/embed/{{ video['id'] }}?rel=0&amp;amp;start={{ video['start'] }}&amp;amp;wmode=transparent" frameborder="0" allowfullscreen&amp;gt;&amp;lt;/iframe&amp;gt;          
            &amp;lt;/div&amp;gt;
            %end
            &amp;lt;/div&amp;gt;
        &amp;lt;/div&amp;gt;
        &amp;lt;div class="span7"&amp;gt;
        %if page_dicts:
            &amp;lt;div&amp;gt;
            &amp;lt;h2&amp;gt;Webpages:&amp;lt;/h2&amp;gt;
            %for page in page_dicts:
            &amp;lt;div class="results well"&amp;gt;
            &amp;lt;strong&amp;gt;{{ page['title'] }}&amp;lt;/strong&amp;gt;
            &amp;lt;p&amp;gt;&amp;lt;a href="{{ page['url'] }}"&amp;gt;{{ page['url'][:70] }}&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;
                %if show_daverank:
                &amp;lt;p&amp;gt;DaveRank: {{ page['daverank'] }}&amp;lt;/p&amp;gt;
                %end
            &amp;lt;p&amp;gt;{{ page['text'] }}&amp;lt;/p&amp;gt;               
            &amp;lt;/div&amp;gt;
            %end
         %end
    &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Among all the HTML tags, there are three &lt;code&gt;%if&lt;/code&gt; blocks. One handles the &amp;#8216;no results&amp;#8217; case, one handles videos, and one handles webpages. The video and webpage blocks iterate through each of the items in the dictionary passed in from the search script, and put some information into the HTML tags. Notice this pretty neat feature:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;iframe width="430" height="248" src="http://www.youtube.com/embed/{{ video['id'] }}?rel=0&amp;amp;start={{ video['start'] }}&amp;amp;wmode=transparent" frameborder="0" allowfullscreen&amp;gt;&amp;lt;/iframe&amp;gt;
&amp;lt;p&amp;gt;&amp;lt;a href="{{ page['url'] }}"&amp;gt;{{ page['url'][:70] }}&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Stuff passed to the template in Python can be inserted directly into tags and URLs, which means the Python script can easily set video and page URLs. This means we can embed each YouTube video, with a link directly to the point in the video where a search term appears! I tested this out last night, and it&amp;#8217;s working pretty well, but I accidentally deleted the local datastore, so I don&amp;#8217;t have a screenshot for now. Don&amp;#8217;t worry, though–DaveDaveFind is coming along well, and soon I&amp;#8217;ll have a usable demo of the application.&lt;/p&gt;

&lt;p&gt;In addition to these changes to the template, I made some style changes in the header, and messed with the HTML structure a little bit. This involved a lot of tinkering on my part–I&amp;#8217;m not a CSS whiz. To see these changes, take a look at &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/templates/results.tpl"&gt;the code&lt;/a&gt; on GitHub.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20927976887</link><guid>http://davedavefind.tumblr.com/post/20927976887</guid><pubDate>Wed, 11 Apr 2012 20:55:00 -0400</pubDate></item><item><title>Putting it all together</title><description>&lt;p&gt;Now that the new models are in the database and the crawler is indexing videos, documents, and Udacity URLs, it&amp;#8217;s time to put it all together in the search engine script &lt;code&gt;main.py&lt;/code&gt;. Last time we visited, it was still &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/f857e63435acc0175e6ce376856715fd6f26ff7e/main.py"&gt;pretty simple&lt;/a&gt;. Now, there&amp;#8217;s a lot of &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/main.py"&gt;new code&lt;/a&gt;. I&amp;#8217;ll step through it bit by bit.&lt;/p&gt;

&lt;p&gt;First, remember how the &lt;code&gt;process_search()&lt;/code&gt; procedure that handles search lookups works:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def process_search():
    search_query = request.GET.get('search_query', '').strip()
    query = search_query.lower()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It gets the string &lt;code&gt;search_query&lt;/code&gt; from the search form, and converts it to lowercase. This could be a multiword string, one word, or an empty string, and it&amp;#8217;s important to make sure the procedure can handle all these cases.&lt;/p&gt;

&lt;p&gt;Next, I set a couple variables. If nothing intervenes, they will stay &lt;code&gt;False&lt;/code&gt; and get passed on to the template:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;show_daverank = False
results = False
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The next block is a cool feature inspired by &lt;a href="https://duckduckgo.com/bang.html"&gt;!bang syntax&lt;/a&gt; on DuckDuckGo (I had no idea Gabriel Weinberg would be a contest judge when I started this project, by the way. I hope he doesn&amp;#8217;t mind that this project rips off a few features from DDG!). By checking search queries for certain keywords prepended with two dashes, like &amp;#8220;&amp;#8212;cs101&amp;#8221;, DaveDaveFind will return results from other sites. This code catches the strings &amp;#8220;&amp;#8212;cs101&amp;#8221;, &amp;#8220;&amp;#8212;cs373&amp;#8221;, and &amp;#8220;&amp;#8212;python&amp;#8221;, and redirects to search the course forums or Python documentation. It also looks for the argument &amp;#8220;&amp;#8212;show_daverank&amp;#8221;, which prints a page&amp;#8217;s Daverank underneath its URL. It&amp;#8217;s pretty easy with the &lt;code&gt;string.find()&lt;/code&gt; method and a little slicing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if query.find('--') == 0:
    if query.find('--cs101') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs101/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--cs373') == 0:
        redirect_url = 'http://www.udacity-forums.com/cs373/search/?q=' + urllib.quote(query[8:])
        return redirect(redirect_url)   
    if query.find('--python') == 0:
        redirect_url = 'http://docs.python.org/search.html?q=' + urllib.quote(query[9:])
        return redirect(redirect_url)
    if query.find('--daverank') == 0:
        query = query[11:]
        search_query = query
        show_daverank = True
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The method &lt;code&gt;urllib.quote()&lt;/code&gt; converts the rest of the search string into a format suitable for sending in a URL. For example, the search:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;--cs101 url encoding
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Becomes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&lt;a href="http://www.udacity-forums.com/cs101/search/?q=url%20encoding"&gt;http://www.udacity-forums.com/cs101/search/?q=url%20encoding&lt;/a&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s important to encode strings so that they don&amp;#8217;t get messed up when they&amp;#8217;re sent to the browser.&lt;/p&gt;

&lt;p&gt;Next up, I&amp;#8217;ve edited the code to handle multi-word queries. The procedure splits the search query into a list, and iterates through each element, getting the URLS associated with each term from the database, which are now stored in a Python list:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;query_words = query.split()
query_urls = []
for term in query_words:
    # Get all SearchTerm objects that match the search_query.
    q = SearchTerm.all().filter('term =', term).get()
    if q:
        query_urls.append(set(q.urls))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, a big &lt;code&gt;if&lt;/code&gt; block to handle the results. The next few code snippets are all part of this block. (In fact, it&amp;#8217;s so long that it&amp;#8217;s probably time to move it into its own procedure).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if query_urls:
    query_url_set = set.intersection(*query_urls)
    query_url_list = list(query_url_set)    

    results = True
    if len(query_url_list) &amp;gt; 30:
        query_url_list = query_url_list[0:30]

    page_results = Page.all().filter('url IN', query_url_list).order('-dave_rank').fetch(5)
    page_dicts = []
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;First, we want to make sure that DaveDaveFind returns pages with &lt;strong&gt;all&lt;/strong&gt; the search terms in a multiword query. To do so, I used a Python &lt;code&gt;set&lt;/code&gt;, which is a lot like a list, but contiains only unique elements. The previous &lt;code&gt;for&lt;/code&gt; block stored each list of URLs as a set, and the first two lines of this block retrieve the intersection of each set. The asterisk inside the method is a new Python trick I learned: it &amp;#8220;unpacks&amp;#8221; each element &lt;a href="http://docs.python.org/tutorial/controlflow.html#tut-unpacking-arguments"&gt;from a list&lt;/a&gt; or similar type. I couldn&amp;#8217;t get this method to accept a list, but it worked with the asterisk (I&amp;#8217;m not totally sure why).&lt;/p&gt;

&lt;p&gt;If the database query returns results, we toggle &lt;code&gt;results&lt;/code&gt; to True, which will get passed to the template later.&lt;/p&gt;

&lt;p&gt;The next block is a kludgy hack that I hope to fix later. The new data models are much more efficient since they don&amp;#8217;t perform a bunch of one-to-many lookups, but I ran into a new error: the &lt;code&gt;filter()&lt;/code&gt; method can only handle 30 items &lt;code&gt;'IN'&lt;/code&gt; a particular query. In other words, it can only look up 30 URLs at a time. Most of the time, it&amp;#8217;s okay, but the function threw an error for popular terms like &amp;#8216;Python.&amp;#8217; For now, when a term has more than 30 associated URLs, the code simply limits the URLs returned to the first 30. This is okay, but it doesn&amp;#8217;t necessarily return the best URLs. I&amp;#8217;m thinking about how to fix this.&lt;/p&gt;

&lt;p&gt;Next, we get some information about each page in the results, and put it in a dictionary, &lt;code&gt;page_info&lt;/code&gt;, that will be passed to the template:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;    for page in page_results:
        page_info = {}
        query_index = page.text.find(query)
        if query_index != -1:
            i = query_index - 50
            j = query_index + 450
        else:
            i = 0
            j = 500
        text_string = page.text[i:j]
        page_info['text'] = text_string
        page_info['title'] = page.title
        page_info['url'] = page.url
        page_info['daverank'] = page.dave_rank
        page_dicts.append(page_info)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;i&lt;/code&gt; and &lt;code&gt;j&lt;/code&gt; indexing gets a snippet from the full text of the page to display with the search results. This is another bit of code to improve in the future. It&amp;#8217;s pretty dumb right now, and usually cuts off words and sentences. Better code would try to find full words, and wouldn&amp;#8217;t use hard-coded index values.&lt;/p&gt;

&lt;p&gt;Next, we do the same for video objects. This is a pretty gnarly block of nested &lt;code&gt;if&lt;/code&gt; statements, and I&amp;#8217;m thinking about how to clean it up. But here it is, in all its gory detail:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Get the top 3 videos for the search term.
video_results = Video.all().filter('url IN', query_url_list).order('-views').fetch(3)
    video_dicts = []
    #Iterate through each video and store its information in a dictionary.
    for video in video_results:
        video_info = {}
        #Get a video's subtitles and find the search query.
        subtitles = video.text.lower()
        query_index = subtitles.find(query)
        time_string = ''
        #If the full search query is in the video, find it...
        if query_index != -1:
            #...by splitting the subtitles into a list of lines...
            subtitle_list = subtitles.splitlines()
            #...and iterating over them to find the query.
            for phrase in subtitle_list:
                if phrase.find(query) != -1:
                    #Get the timestamp associated with the search term.
                    timestamp_index = subtitle_list.index(phrase) - 1
                    timestamp = subtitle_list[timestamp_index]
                    if len(timestamp) &amp;gt; 1:
                        #Save its minutes and seconds information
                        minutes = timestamp[3:5]
                        seconds = timestamp[6:8]
                        #Add it to a string
                        time_string = '#t=' + minutes + 'm' + seconds + 's'
                        start = 60 * int(minutes) + int(seconds)

        if time_string:
            url = video.url + time_string
        else:
            url = video.url
            start = 0           
        video_info['title'] = video.title
        video_info['url'] = url
        video_info['subtitle'] = video.text[-20:query_index:20]
        video_info['id'] = video.id
        video_info['start'] = start
        video_dicts.append(video_info)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is basically the same as the last snippet, except for the timestamp stuff. I considered importing the &lt;a href="http://pypi.python.org/pypi/pysrt"&gt;pysrt&lt;/a&gt; library to handle subtitles, but decided to use string slicing instead. Since the indexed subtitles contain timestamps, DaveDaveFind can easily do something pretty cool: if it finds the full search query in a video transcript, it will &amp;#8220;deep link&amp;#8221; directly to the phrase inside the video. This feels pretty magical in practice, but it&amp;#8217;s actually just parsing the timestamp information and adding it to the YouTube URLs that it generates.&lt;/p&gt;

&lt;p&gt;Finally, if there&amp;#8217;s a detailed time string, it&amp;#8217;s added to the URL. If not, the video is set to start at the beginning. Just like with the webpage lookup, we add some information to a dictionary that&amp;#8217;s passed on to the template in these last few lines:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;else:
    page_dicts = None
    video_dicts = None


return template('templates/results', search_query=search_query, page_dicts=page_dicts, video_dicts=video_dicts, show_daverank=show_daverank, results=results)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;process_search()&lt;/code&gt; procedure takes all this information and sends it to the template to render. If there are no page or video results, it sends &lt;code&gt;None&lt;/code&gt; to the template. Next up, we&amp;#8217;ll take a look at the changes to the template to handle all this new information. 
 &lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20927925036</link><guid>http://davedavefind.tumblr.com/post/20927925036</guid><pubDate>Wed, 11 Apr 2012 18:55:00 -0400</pubDate></item><item><title>The joy of automation  </title><description>&lt;p&gt;After fixing my data models, I checked back on &lt;a href="http://www.udacity-forums.com/cs101/questions/57350/would-the-udacity-staff-mind-posting-public-video-transcripts"&gt;my question&lt;/a&gt; in the Udacity forums. I found one helpful response, but it didn&amp;#8217;t get much traction with anyone else. So I went looking once more for an automated way to download YouTube captions. In the end, I found a mostly-automated solution that was a big improvement on visiting each page and finding its &lt;code&gt;timedtext&lt;/code&gt; file. (I won&amp;#8217;t share it here, since it seems that Udacity doesn&amp;#8217;t want the videos to be too easy to find). It took about four minutes to have a folder full of &lt;code&gt;.srt&lt;/code&gt; subtitle files for almost every video in the CS101 curriculum (a few of them weren&amp;#8217;t subtitled for one reason or another).&lt;/p&gt;

&lt;p&gt;Indexing videos (and their content, thanks to the subtitles) will make DaveDaveFind much more useful. Although it already includes a pretty good index of the Udacity site and forums, the links aren&amp;#8217;t always the most useful. In fact, I noticed that the highest-ranked site in the index is the &amp;#8220;Legal&amp;#8221; page, presumably because it&amp;#8217;s linked a lot from other sites in the Udaci-verse.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s also one of the only pages on the Udacity site that&amp;#8217;s mostly text, and thus legible to our web crawler. Since the main Udacity site is mostly made of embedded videos and coding exercises, it&amp;#8217;s (ironically) not very easy for our search crawler to read! Downloading subtitles makes the video information readable by the crawler.&lt;/p&gt;

&lt;p&gt;Getting my hands on the subtitles turned out to be pretty easy. But making them readable for the crawler was a little harder. I needed the unique YouTube ID for each video, but couldn&amp;#8217;t find a good automated way to get them (it looks like the Udacity site uses a lot of JavaScript). So, I played a few of my favorite podcasts and buckled down to get this information by hand.&lt;/p&gt;

&lt;p&gt;Four and a half hours later, I had learned an important lesson about automation: it really sucks to collect data by hand. But now I had each video&amp;#8217;s ID in a Python-readable &lt;code&gt;.csv&lt;/code&gt; file, which is the key to adding them to the index. Sometimes it&amp;#8217;s tough to avoid a little hard work, even with Python.&lt;/p&gt;

&lt;p&gt;After a short break, I stopped to think about other information I might need from each video. Since the video links aren&amp;#8217;t readable by the crawler, I needed some other way to determine which videos might be more useful than others. I decided to store each video&amp;#8217;s view count in the spreadsheet, too. I also had a small realization: Udacity has a good reason for making certain videos hard to find. Since the course will be offered again, it might not be a good idea to make homework solutions and quiz answers searchable (even if they are useful to current students). As a student who had already completed the course, I lost sight of this in my zeal to index everything. To make it easier to disable quiz and homework videos later, I made sure to code each video as a quiz, problem, solution, or lecture. If DaveDaveFind stays around after this contest, hopefully it will be easy to add and remove videos from the index according to course progress, or add some sort of integration with Udacity logins to prevent students from looking up answers before they&amp;#8217;re allowed.&lt;/p&gt;

&lt;p&gt;In the end, the information I decided to store for each video is a lot like the information for each page:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class Video(db.Model):
    """Models a video in the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    filename = db.StringProperty()
    id = db.StringProperty()
    type = db.StringProperty()
    views = db.IntegerProperty()
    text = db.TextProperty()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Storing the full text of the transcript will help later with implementing multi-word search the easy way (a topic for a later post).&lt;/p&gt;

&lt;p&gt;I also downloaded the supplementary documents from the Udacity site: the glossary and each chapter&amp;#8217;s notes and Python reference. With all this information in place, it was just a matter of figuring out how to get it into the index. Here&amp;#8217;s the script I wrote to &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/crawler/index_pdfs.py"&gt;index pdf files&lt;/a&gt;, and here&amp;#8217;s the one for &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/master/crawler/add_videos.py"&gt;adding videos to the index&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can probably see how I reused some of the code that we wrote for the crawler. The principle behind each script is the same: break up each document or transcript into its individual words, get the page&amp;#8217;s URL and maybe some additional information, add it to the &lt;code&gt;index&lt;/code&gt; or &lt;code&gt;pagedata&lt;/code&gt; dict, and write out the final data to a &lt;code&gt;.csv&lt;/code&gt; file readable by Google App Engine.&lt;/p&gt;

&lt;p&gt;Hooking these procedures up to the webcrawler was as easy as importing the scripts and passing &lt;code&gt;index&lt;/code&gt; to each one in the &lt;code&gt;crawl_web()&lt;/code&gt; procedure:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from index_pdfs import index_pdfs
from add_videos import add_videos_to_index

# Beginning of crawl_web() and the tocrawl loop goes here...

index, pagedata = index_pdfs(index, pagedata)
index = add_videos_to_index('subtitle_index.csv', '/Users/connormendenhall/Python/DaveDaveFind/DaveDaveFind/data/video_info.csv', index)
index = undupe_index(index)
return index, graph, pagedata
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The new &lt;code&gt;undupe_index()&lt;/code&gt; procedure is pretty simple, too. It checks the finalized index for duplicate URLs and removes them, so they don&amp;#8217;t clutter up the database:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def undupe_index(index):
    for key in index.keys():
        index[key] = list(set(index[key]))
    print "[undupe_index()] Index un-duped"
    return index
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally, I added a new procedure to the crawler that does one new and very important thing: stores a webpage&amp;#8217;s full text in a dictionary, which the script later writes to the page info &lt;code&gt;.csv&lt;/code&gt; file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def get_page_data(page, url, dict):
    try:
        title = page.title.string
    except:
        title = url
    try:
        text = page.body.get_text()
    except:
        text = ''
    dict[url] = [title, text]   
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s pretty simple, but this makes a huge difference for doing multi-word lookups (and it&amp;#8217;s a lot easier than keeping track of string indexes like the final exam question). I noticed as I posted this code that I used the built-in type &lt;code&gt;dict&lt;/code&gt; as a variable name. That&amp;#8217;s a bad idea, so I&amp;#8217;ll make sure to change it in my next update.&lt;/p&gt;

&lt;p&gt;These are the biggest changes to the crawler code, but I encourage you to check out all the files in the &lt;code&gt;/crawler/&lt;/code&gt; directory &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/master/crawler"&gt;on GitHub&lt;/a&gt; to see them for yourself. Next up: the changes I&amp;#8217;ve made to the search engine script.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20926086015</link><guid>http://davedavefind.tumblr.com/post/20926086015</guid><pubDate>Wed, 11 Apr 2012 18:26:29 -0400</pubDate></item><item><title>Remodeling</title><description>&lt;p&gt;Remember the data models I wrote a few posts ago? They were just a few lines of code describing how DaveDaveFind would store crawler data in the database, but as it turns out, they had a huge impact on the way the application worked. Here&amp;#8217;s a reminder:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class PythonTerm(db.Model):
    """Models a term from the Python glossary."""
    term = db.StringProperty()
    definition = db.TextProperty()

class SearchTerm(db.Model):
    """Models a search term."""
    term = db.StringProperty()

class PageUrl(db.Model):
    """Models a URL and its Daverank from the index."""
    # A search term can be associated with many pages...
    page = db.ReferenceProperty(SearchTerm,
                                collection_name='pages')

    #...but each page has a URL and Daverank.
    url = db.StringProperty()
    dave_rank = db.FloatProperty()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;While I was using these models, the development server was extremely slow. I also tested the site out on the production server (this is a bad idea, but I was stuck), and noticed immediately that the database was doing a ton of write operations. In fact, I wasn&amp;#8217;t even able to upload the full index without exceeding the daily quota for a free user on App Engine. I started to suspect that something was up with my models, so I did a little bit of research on the ways that real search engines store their indexes.&lt;/p&gt;

&lt;p&gt;Taking a step back and doing some reading was a good idea. I found &lt;a href="http://sbyholm.hubpages.com/hub/Search-Engine-Database-Schema"&gt;this helpful blog post&lt;/a&gt;, and re-read some of the App Engine documentation on data models. There, I discovered that it&amp;#8217;s possible to store a &lt;a href="https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#ListProperty"&gt;Python list directly&lt;/a&gt; in the database. Pretty cool, since our original index mapped keywords to lists of URLs.&lt;/p&gt;

&lt;p&gt;I found a way to store the index without using a &lt;code&gt;ReferenceProperty&lt;/code&gt;, which was slowing down the applicaiton and making uploads interminably long. Here are the new models I came up with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class SearchTerm(db.Model):
    """Models a search term and its associated URLs."""
    term = db.StringProperty()
    urls = db.StringListProperty()

class Page(db.Model):
    """Models a Page and its Daverank from the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    text = db.TextProperty()
    dave_rank = db.FloatProperty()
    doc = db.BooleanProperty()

class Video(db.Model):
    """Models a video in the index."""
    url = db.StringProperty()
    title = db.StringProperty()
    filename = db.StringProperty()
    id = db.StringProperty()
    type = db.StringProperty()
    views = db.IntegerProperty()
    text = db.TextProperty()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;SearchTerm&lt;/code&gt; model now stores each search term and a list of associated URLs, just like the original data structure in our crawler code. Instead of having the database look up each URL according to a &lt;code&gt;ReferenceProperty&lt;/code&gt;, the search engine code iterates through the list in Python instead, which is much faster. The &lt;code&gt;Page&lt;/code&gt; model now stores a lot more information, including page titles and their full text! But since the model doesn&amp;#8217;t use a &lt;code&gt;ReferenceProperty&lt;/code&gt;, it&amp;#8217;s actually faster to load onto the server than the earlier models. The &lt;code&gt;Video&lt;/code&gt; model is new (more on video indexing later), but it&amp;#8217;s more or less like the &lt;code&gt;Page&lt;/code&gt; model.&lt;/p&gt;

&lt;p&gt;Of course, I had to write new loader scripts, too. Fortunately, they worked more or less the same way as the earlier ones I&amp;#8217;d written. To take a look, you can check out &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/master/loaders"&gt;this folder&lt;/a&gt; on GitHub.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20926051063</link><guid>http://davedavefind.tumblr.com/post/20926051063</guid><pubDate>Wed, 11 Apr 2012 18:25:56 -0400</pubDate></item><item><title>The good, the bad, and the ugly </title><description>&lt;p&gt;I haven&amp;#8217;t posted in a couple days, but I&amp;#8217;ve been hard at work making a bunch of improvements to DaveDaveFind. I&amp;#8217;ve spent a lot of time struggling to figure out Google App Engine, and haven&amp;#8217;t kept these posts completely concurrent with the code I&amp;#8217;ve written. I hate it when a readable tutorial stops without any notice, and I hope these posts will be useful for other students, so I&amp;#8217;ll try to cover most of the changes I&amp;#8217;ve made in the last few days, even if I might not go through every line of code. This will be a brief summary post, and I&amp;#8217;ll cover a few more interesting things on their own in greater detail. As always, you can see all the changes I&amp;#8217;ve made step by step in the &lt;a href="https://github.com/ecmendenhall/DaveDaveFind"&gt;repository on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;The good&lt;/h2&gt;

&lt;p&gt;DaveDaveFind can now search videos and documents, includes a few shortcuts inspired by &lt;a href="http://duckduckgo.com/bang.html"&gt;!bang syntax&lt;/a&gt;, supports multiword queries, and seems to be pretty fast and reliable. I&amp;#8217;ve checked off everything on the TODO list except for adding better Python term queries, and that shouldn&amp;#8217;t be too difficult. There are a lot of little improvements to make, and I can always improve the quality of the results, but I think the end is in sight. In fact, when I went to finish up my final exam, I found myself using DaveDaveFind to look up a couple videos—and it worked! I&amp;#8217;m pretty impressed.&lt;/p&gt;

&lt;h2&gt;The bad&lt;/h2&gt;

&lt;p&gt;I have struggled mightily with Google App Engine. Uploading data to test out on the development server sometimes took hours, even for comparatively small indexes (like, under 1mb), and as a beginner, it&amp;#8217;s always difficult to tell if I&amp;#8217;m doing things the right way. I spent a lot of time working on eliminating duplicate entries in the index, but the database was still really slow. In the end, I decided to take another look at my data models, and do some research on the way real search engines store information. As it turns out, my data structures were bad, costly, and inefficient (remember all the one-to-many keys?), so I had to rewrite them. On the plus side, I discovered that it&amp;#8217;s possible to store Python lists in the App Engine datastore, which is pretty cool. On the other hand, it took me hours to figure out how to to get a Python list from my hard drive onto the server.&lt;/p&gt;

&lt;h2&gt;The ugly&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;ve made a lot of changes and fixed a lot of little problems, but I can already feel my code slipping from the simple, readable procedures we wrote in class to lots of nested &lt;code&gt;if&lt;/code&gt;s and &lt;code&gt;except&lt;/code&gt; blocks designed to catch little, idiosyncratic errors. Straying from the Udacity method of small, documented steps while I was frustrated with App Engine has probably contributed to this. Before I go too far, I should step back and see if I can make my code a little easier to read. But it feels very difficult to fight this tendency as I add more and more procedures and features.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20926022169</link><guid>http://davedavefind.tumblr.com/post/20926022169</guid><pubDate>Wed, 11 Apr 2012 18:25:28 -0400</pubDate></item><item><title>Learning to crawl (again)</title><description>&lt;p&gt;In this post, I&amp;#8217;ll try to fix some of the smaller problems that have been piling up across my search engine code. Here are the three I came up with at the end of my last post:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;The crawler should strip punctuation from terms it saves in the index.&lt;/li&gt;
&lt;li&gt;Search results should not be case-sensitive.&lt;/li&gt;
&lt;li&gt;The procedures in &lt;code&gt;main.py&lt;/code&gt; should handle terms that aren&amp;#8217;t in the database.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;I&amp;#8217;d also like to test out the engine with a slightly bigger index and take another look at identifying search terms that are also Python words.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ll start with the crawler. It splits words in the procedure &lt;code&gt;add_page_to_index()&lt;/code&gt;, which hasn&amp;#8217;t changed yet:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def add_page_to_index(index, url, content):
    try:
        text = content.get_text()
    except:
        return
    words = text.split()
    for word in words:

        add_to_index(index, word, url)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Adding these lines should strip out the punctuation and convert all strings to lowercase. Since I&amp;#8217;m only worried about punctuation at the start and end of each string, I&amp;#8217;ll only check the first and last character.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;punctuation = '!"#$%&amp;amp;\'()*+,-./:;&amp;lt;=&amp;gt;?@[\\]^_`{|}~'
    for word in words:
        if word[0] in punctuation:
            word = word[1:]
        if word[-1] in punctuation:
            word = word[:-1]
        word = word.lower()
        add_to_index(index, word, url)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To fix case-sensitivity, I&amp;#8217;ll also add a method call to the first line of &lt;code&gt;process_search()&lt;/code&gt; in &lt;code&gt;main.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;search_query = request.GET.get('search_query', '').strip().lower()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;On second thought, this isn&amp;#8217;t a good idea, because I&amp;#8217;d like to save the search term with its original capitalization. Let&amp;#8217;s add it to its own line;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;search_query = request.GET.get('search_query', '').strip()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;While I&amp;#8217;m there, I should think about how to catch search terms that aren&amp;#8217;t in the database. When the database tries to look up a term that&amp;#8217;s not in the index, it returns &lt;code&gt;None&lt;/code&gt;, which causes an error. This took some experimentation, but here&amp;#8217;s how I fixed it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Get all SearchTerm objects that match the search_query.
q = SearchTerm.all().filter('term =', lowercase_query).get()    

# Now get the PageUrls that are associated with the term...
# ...if they exist!
if q:
    page_urls = q.pages
    # Sort them by dave_rank and return the top five.
    results = page_urls.order('-dave_rank').fetch(5)

# If not, pass None to the results.
else:
    results = None
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here&amp;#8217;s how I modified the template for a search term that&amp;#8217;s not in the index:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;h3&amp;gt;You searched for: {{ search_query }}&amp;lt;/h3&amp;gt;
        %if results:
            %for page in results:
            &amp;lt;a href="{{ page.url }}"&amp;gt;{{ page.url }}&amp;lt;/a&amp;gt;
            %end
        %else:
            &amp;lt;p&amp;gt;No results found for {{ search_query }}.&amp;lt;p&amp;gt;
        %end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One tricky thing about templates is that all Python code blocks must have an &lt;code&gt;%end&lt;/code&gt; statement, and not just at the end of all the code. Initially, I forgot the first &lt;code&gt;%end&lt;/code&gt; block, which returned an HTML page that cut off immediately after the &lt;code&gt;&amp;lt;h3&amp;gt;&lt;/code&gt; tags at the top. At first, it wasn&amp;#8217;t clear that this was the problem. Since the HTML degraded pretty well, it just looked like nothing was happening. After checking the details of &lt;a href="http://bottlepy.org/docs/stable/stpl.html"&gt;SimpleTemplate syntax&lt;/a&gt;, I figured out what was wrong.&lt;/p&gt;

&lt;p&gt;Now DaveDaveFind accepts any search term with any capitalization, and the crawler ignores punctuation at the beginning and end of words. I&amp;#8217;ll try to load a bigger index and see if anything goes wrong. Let&amp;#8217;s try crawling 25 pages starting at the main Udacity page, with a depth of 10. For now, I&amp;#8217;ve modified the bottom of the code like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cache = {}
max_pages = 25
max_depth = 10

def start_crawl():              
    index, graph = crawl_web('http://www.udacity.com/', max_pages, max_depth)
    ranks = compute_ranks(graph)
    write_search_terms('search_terms.csv', index)
    write_url_info('url_info.csv', index, ranks)

    print "INDEX: ", index
    print ""
    print "GRAPH: ", graph
    print ""
    print "RANKS: ", ranks

if __name__ == "__main__":
    start_crawl()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The lines at the bottom that run the crawler, compute ranks, and write the data to external files are now in a procedure of their own. The last two lines are a Python idiom that runs a procedure if the code is run from the command line, but not if it&amp;#8217;s imported into something else. Here&amp;#8217;s &lt;a href="http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#modules-scripts"&gt;how it works&lt;/a&gt;. This code will run the procedure &lt;code&gt;start_crawl()&lt;/code&gt; when I run it from the command line, but won&amp;#8217;t start crawling if I import it in the interactive terminal to test something out.&lt;/p&gt;

&lt;p&gt;You might notice that I haven&amp;#8217;t used the cache at all. It&amp;#8217;s probably a good idea to think about how I could incorporate it.&lt;/p&gt;

&lt;p&gt;When I ran the crawler, it crashed right away! Here&amp;#8217;s the error message:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;File "udacity_crawler.py", line 85, in add_page_to_index
    if word[-1] in punctuation:
IndexError: string index out of range
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The punctuation-stripping code I wrote earlier is too fragile, so I&amp;#8217;ll have to find a better solution. I checked the Python reference and found the string methods &lt;code&gt;lstrip()&lt;/code&gt; and &lt;code&gt;rstrip()&lt;/code&gt; which remove characters from the beginning and end (&amp;#8216;left&amp;#8217; and &amp;#8216;right&amp;#8217;) of strings. Thus:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;for word in words:
    word = word.lstrip(punctuation)
    word = word.rstrip(punctuation)
    word = word.lower()
    add_to_index(index, word, url)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It works! The new index is uploading to the development server as I type, and it&amp;#8217;s pretty slow! Crawling just a few more pages resulted in a massive increase in the size of the index: the &lt;code&gt;.csv&lt;/code&gt; file that contains URLs, terms, and Daveranks is now 1.3 megabytes! Meanwhile, the list of search terms is only 35k. Looking over the terms and URLs, a few immediate problems to solve are clear:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;A lot of URLs are duplicated. I should figure out how to check for this and eliminate them.&lt;/li&gt;
&lt;li&gt;The crawler is picking up a lot of non-word noise, like Javascript functions.&lt;/li&gt;
&lt;li&gt;The crawler is picking up empty strings (&lt;code&gt;''&lt;/code&gt;) as a word.&lt;/li&gt;
&lt;li&gt;Stripping punctuation from the end of strings sometimes messes up things like code samples on the forums. (Or is this really a problem?)&lt;/li&gt;
&lt;li&gt;Usernames and Karma scores from the forum are mashed together in the index. (This might not be a very &lt;strong&gt;big&lt;/strong&gt; problem).&lt;/li&gt;
&lt;li&gt;Maybe the engine shouldn&amp;#8217;t index extremely common words like &amp;#8220;is&amp;#8221; and &amp;#8220;if.&amp;#8221; (Then again, some of these are important terms in Python!)&lt;/li&gt;
&lt;li&gt;The index is really big already. Maybe I should limit it to CS101 content. On the other hand, I get 5gb of space on App Engine.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The database is still (!) loading, so I&amp;#8217;m going to wrap up this post for now and find another problem to work on.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20612530936</link><guid>http://davedavefind.tumblr.com/post/20612530936</guid><pubDate>Fri, 06 Apr 2012 17:54:49 -0400</pubDate></item><item><title>Figuring out queries</title><description>&lt;p&gt;Let&amp;#8217;s see if we can get DaveDaveFind working with the new data First, we need to remember to import the &lt;code&gt;PageUrl&lt;/code&gt; and &lt;code&gt;SearchTerm&lt;/code&gt; models from &lt;code&gt;models.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To figure out how to get URLs out of the database, I read about &lt;a href="https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_order"&gt;queries&lt;/a&gt; and tried things out in the App Engine console until I stumbled on something that worked. Once I started to understand it, I realized that the database methods are pretty intuitive. Here&amp;#8217;s the process, with comments on each line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Get all SearchTerm objects.
q = SearchTerm.all()

# Filter out the ones that don't match the search_query.
q.filter('term =', search_query)

# Retrieve them from the database. (There should just be one).
q.get()

# Now get the PageUrls that are associated with the term.
page_urls = q.pages

# Sort them by dave_rank and return the top five.
results = page_urls.order('-dave_rank').fetch(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s possible to write this a little more concisely, by chaining some of the methods together:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;q = SearchTerm.all().filter('term =', search_query).get()
page_urls = q.pages
results = page_urls.order('-dave_rank').fetch(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we just need to pass this to the template. The cool thing about templates is that they can include Python logic right alongside HTML. This block will list each URL in the results passed in from &lt;code&gt;main.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;h3&amp;gt;You searched for: {{ search_query }}&amp;lt;/h3&amp;gt;
        %if results:
            %for page in results:
                &amp;lt;a href="{{ page.url }}"&amp;gt;{{ page.url }}&amp;lt;/a&amp;gt;
        %end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, searching for a term in the index should return a list of links, ordered by Daverank:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m22qfnOfSn1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;This is a pretty good start, but a lot of improvements stand out right away. The crawler did a good job of splitting up each word, but many of them still have punctuation attached. Searches are also case-sensitive, which doesn&amp;#8217;t make any sense. Since I&amp;#8217;ve just been testing the database calls, I still haven&amp;#8217;t put in any code to catch words that aren&amp;#8217;t in the index, so the program crashes for most terms. And eventually, I&amp;#8217;ll have to flesh out the results page to provide something more than a long list of links. For now, you can see the latest update &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/f857e63435acc0175e6ce376856715fd6f26ff7e"&gt;here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20606376353</link><guid>http://davedavefind.tumblr.com/post/20606376353</guid><pubDate>Fri, 06 Apr 2012 16:04:40 -0400</pubDate></item><item><title>Steve Holt!</title><description>&lt;p&gt;When I last left off, I was thinking about how to check if a search term is also a Python term and model this in the database. I think simply adding a True/False property to the &lt;code&gt;SearchTerm&lt;/code&gt; model is best for now:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class SearchTerm(db.Model):
    """Models a search term."""
    term = db.StringProperty()
    is_pythonterm = db.BooleanProperty()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When DaveDaveFind looks up a search term, it should be able to check the value of &lt;code&gt;is_pythonterm&lt;/code&gt; and figure out if it should look for a &lt;code&gt;PythonTerm&lt;/code&gt; definition, too. Now, let&amp;#8217;s see if we can get these models working. Just like when I added Python terms, I&amp;#8217;ll need to figure out 1) how to get our crawler dictionaries into an App Engine-readable format and 2) how to load them onto the development server. For simplicity&amp;#8217;s sake, I&amp;#8217;ll start with the very simple index of the dummy site.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s an excerpt of what the crawler returns:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;INDEX:  {u'have': [u'http://www.udacity.com/cs101x/crawling.html'], u'is': ['http://www.udacity.com/cs101x/index.html', 'http://www.udacity.com/cs101x/index.html'], u'am': [u'http://www.udacity.com/cs101x/crawling.html'], u'idea': ['http://www.udacity.com/cs101x/index.html'], u'walk': ['http://www.udacity.com/cs101x/index.html'], ... }

RANKS:  {'http://www.udacity.com/cs101x/index.html': 0.09157335528823045, u'http://www.udacity.com/cs101x/walking.html': 0.06446669411028806, u'http://www.udacity.com/cs101x/flying.html': 0.06446669411028806, ... }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I have a feeling it will be difficult to map these to my models. The &amp;#8216;bulkloader&amp;#8217; code that I wrote last night was very simple, but I&amp;#8217;m not sure how to put terms with many urls in a &lt;code&gt;.csv&lt;/code&gt; file, where each row can only store one value for each column. I found &lt;a href="http://seewah.blogspot.com/2009/08/datastore-bulk-upload-referenceproperty.html"&gt;this blog post&lt;/a&gt; with some information on this.&lt;/p&gt;

&lt;p&gt;I think I&amp;#8217;ll want to do this in steps. First, put all the search terms in the database, since everything else is associated with them. Next, put in all the URLs and ranks, each of which is associated with a search term. So to start, I&amp;#8217;ll try saving all search terms in the index to a &lt;code&gt;.csv&lt;/code&gt; file. Here&amp;#8217;s the &lt;code&gt;write_csv()&lt;/code&gt; procedure I wrote last night in &lt;code&gt;doc_crawler.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def write_csv(filename, dict):
    f = open(filename, 'wt')
    try:
        writer = csv.writer(f)
        writer.writerow(['term', 'definition'])
        for key in dict:
            ascii_key = key.encode('ascii', 'ignore')
            ascii_def = dict[key].encode('ascii', 'ignore')
            writer.writerow([ascii_key, ascii_def])
    finally:
        f.close()
        print "Finished writing CSV file."
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For now, I&amp;#8217;ll add it to &lt;code&gt;udacity_crawler.py&lt;/code&gt;, but if I find myself using it frequently, I&amp;#8217;ll put it in its own file. It&amp;#8217;s pretty easy to change this to write just keys to a &lt;code&gt;.csv&lt;/code&gt; file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def write_csv(filename, dict):
    f = open(filename, 'wt')
    try:
        writer = csv.writer(f)
        writer.writerow(['term'])
        for key in dict:
            ascii_key = key.encode('ascii', 'ignore')
            #ascii_def = dict[key].encode('ascii', 'ignore')
            writer.writerow([ascii_key])
    finally:
        f.close()
        print "Finished writing CSV file."
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;#8217;s see if it works. I added the line&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;write_csv('search_terms.csv', index) 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;to the bottom of our crawler code, and it returned a one-column &lt;code&gt;.csv&lt;/code&gt; file with all our search terms. Since I&amp;#8217;ll probably need to do this again, it seems like a good idea to save this procedure. I&amp;#8217;ll rename it &lt;code&gt;save_search_terms()&lt;/code&gt; and make another copy to tinker with for the second &lt;code&gt;.csv&lt;/code&gt; upload.&lt;/p&gt;

&lt;p&gt;Now, I need to generate a &lt;code&gt;.csv&lt;/code&gt; with one term, one URL, and one Daverank in each row. Let&amp;#8217;s call this procedure &lt;code&gt;write_url_info()&lt;/code&gt;. It needs to read our search index and Daveranks. Here&amp;#8217;s what I came up with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def write_url_info(filename, index, ranks):
    f = open(filename, 'wt')
    try:
        writer = csv.writer(f)
        writer.writerow(['term', 'url', 'dave_rank'])
        for term in index:
            # Get the term's list of urls
            url_list = index[term]
            for url in url_list:
                ascii_url = url.encode('ascii', 'ignore')
                ascii_term = term.encode('ascii', 'ignore')
                dave_rank = ranks[url]
                writer.writerow([ascii_url, ascii_term, dave_rank])
    finally:
        f.close()
        print "Finished writing CSV file."
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will go through each &lt;code&gt;term&lt;/code&gt; in the &lt;code&gt;index&lt;/code&gt; dictionary and get its list of urls. Then it iterates through each &lt;code&gt;url&lt;/code&gt; in the &lt;code&gt;url_list&lt;/code&gt;, printing its &lt;code&gt;term&lt;/code&gt; and &lt;code&gt;dave_rank&lt;/code&gt; to the file. I added the line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;write_url_info('url_info.csv', index, ranks)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;at the bottom of the crawler code, and it returned a readable &lt;code&gt;.csv&lt;/code&gt; file. Perfect! Now I just have to figure out how to get these into the App Engine datastore. I&amp;#8217;ll start with a loader file like the one I wrote last night. Copying straight from that file, changing all the &amp;#8216;&lt;code&gt;Python&lt;/code&gt;&amp;#8217;s to &amp;#8216;&lt;code&gt;Search&lt;/code&gt;&amp;#8217; and deleting the &lt;code&gt;'definition'&lt;/code&gt; line gives me this beautiful file I&amp;#8217;ll call &lt;code&gt;searchterm_loader.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db
from google.appengine.tools import bulkloader
from models import SearchTerm

class SearchTermLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, 'SearchTerm',
            [('term', str),
            ])
loaders = [SearchTermLoader]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now I&amp;#8217;ll enter the magic words in the terminal (I had to look them up from yesterday&amp;#8217;s post!):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;appcfg.py upload_data --config_file=searchterm_loader.py --filename=crawler/search_terms.csv --has_header --url=http://localhost:8000/remote_api --kind=SearchTerm
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And the results in the terminal:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[INFO    ] Opening database: bulkloader-progress-20120406.184933.sql3
[INFO    ] Connecting to localhost:8000/remote_api
[INFO    ] Skipping header line.
[INFO    ] Starting import; maximum 10 entities per post
.....
[INFO    ] 41 entities total, 0 previously transferred
[INFO    ] 41 entities (4679 bytes) transferred in 7.6 seconds
[INFO    ] All entities successfully transferred
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Alright! I&amp;#8217;m starting to understand uploading data. Now for a more complicated loader. I used &lt;a href="http://seewah.blogspot.com/2009/08/datastore-bulk-upload-referenceproperty.html"&gt;this blog post&lt;/a&gt; to help me figure out how to write it. Here&amp;#8217;s what I wound up with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models


def get_searchterm(term):
    terms = db.GqlQuery("select * from SearchTerm where term = :1", term)
    if terms.count() == 0:
        newSearchTerm = model.SearchTerm(term=term)
        db.put(newSearchTerm)
        return newSearchTerm
    else:
        return terms[0]

class PageUrlLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, "PageUrl",
                                    [("term", get_searchterm),
                                    ("url", str),
                                    ("dave_rank", float) 
                                    ])


loaders = [PageUrlLoader]
if __name__ == '__main__':
    bulkload.main(PageUrlLoader)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I think I mostly understand what&amp;#8217;s going on here. The syntax is complicated, but the key part is this list, which is displayed funkily across several lines:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[("term", get_searchterm), ("url", str), ("dave_rank", float)]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This describes where to send each cell of the &lt;code&gt;.csv&lt;/code&gt; file that&amp;#8217;s being read into the database. If it&amp;#8217;s a term, run the procedure &lt;code&gt;get_searchterm()&lt;/code&gt;. If it&amp;#8217;s a URL or a Daverank number, store it as a new &lt;code&gt;str&lt;/code&gt; or &lt;code&gt;float&lt;/code&gt;. This time it took longer to load, but it eventually worked:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[INFO    ] Opening database: bulkloader-progress-20120406.191050.sql3
[INFO    ] Connecting to localhost:8000/remote_api
[INFO    ] Skipping header line.
[INFO    ] Starting import; maximum 10 entities per post
......
[INFO    ] 53 entities total, 0 previously transferred
[INFO    ] 53 entities (51325 bytes) transferred in 7.1 seconds
[INFO    ] All entities successfully transferred
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I did a &lt;a href="http://www.youtube.com/watch?v=rREGbLdOzfg"&gt;Steve Holt!&lt;/a&gt; when this one worked. I&amp;#8217;m amazed that this project has worked at all so far, and that I&amp;#8217;ve figured out some complicated stuff pretty quickly. I came into the Udacity course with a little bit of Python knowledge, but I still considered myself a novice. Now I&amp;#8217;ve cobbled together the makings of a working search engine (I think!). Over the last few days, I&amp;#8217;ve come to realize that the structure of Udacity courses is a brilliant model for learning to program well. We learned one idea at a time, and tried to use it to solve one simple problem right away. In doing so, we broke up the process of writing a web crawler into manageable steps.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve been writing these posts the same way, even though I didn&amp;#8217;t realize it at first. Everything I&amp;#8217;ve written here has been more or less real-time, which has been a huge help in clarifying my thinking and preventing me from messing up my code. Approaching this problem the Udacity way, by spending five minutes solving one little problem at a time, has helped me build a pretty complicated thing that I&amp;#8217;m very proud of so far. And when I&amp;#8217;ve needed to learn something new, it&amp;#8217;s almost always been easy to figure out with the help of online documentation or a StackOverflow question.&lt;/p&gt;

&lt;p&gt;Anyways, here&amp;#8217;s what the data looks like in the App Engine console:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m22gvx9AI01qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;The next step will be going back to the code in &lt;code&gt;models.py&lt;/code&gt; to see if we can get DaveDaveFind to start using data from the database (and whether it&amp;#8217;s been loaded correctly!). This seems like a good time to push my changes to GitHub. You can see them &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/553cbfb1ef0218c566d89d80010d3b2057382018"&gt;here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20594894352</link><guid>http://davedavefind.tumblr.com/post/20594894352</guid><pubDate>Fri, 06 Apr 2012 12:38:25 -0400</pubDate></item><item><title>Don't get in a pickle!</title><description>&lt;p&gt;Let&amp;#8217;s check in on the TODO list:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;strike&gt;Limit the depth of our crawler code.&lt;/strike&gt;&lt;/li&gt;
&lt;li&gt;&lt;strike&gt;Figure out a way to stop the crawler manually once its index is &amp;#8220;big enough.&amp;#8221;&lt;/strike&gt;&lt;/li&gt;
&lt;li&gt;&lt;strike&gt;Implement the URank (DaveRank?) algorithm.&lt;/strike&gt;&lt;/li&gt;
&lt;li&gt;Think about how to store and lookup the best result, like in &lt;code&gt;lookup_best()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Think about how to store our index, graph, and ranks in the database.&lt;/li&gt;
&lt;li&gt;Add multiword lookups. (I still haven&amp;#8217;t answered this question on the final!)&lt;/li&gt;
&lt;li&gt;Get the documentation parser to retrieve more Python terms and better definitions.&lt;/li&gt;
&lt;li&gt;Figure out a way to get information from YouTube transcripts!&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Not bad for a few hours of work! The crawler is working, outputting a usable index, graph, and ranks, even for the Udacity site. I have most of the data I need, but it&amp;#8217;s time to sit down and think about how to store it.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m not quite sure how to map the dictionaries created by the Udacity crawler to models in the database. (Google App Engine technically uses a &amp;#8216;datastore&amp;#8217;, but I am not smart enough to understand the difference yet). To help out, I re-read some parts of the App Engine &lt;a href="https://developers.google.com/appengine/docs/java/datastore/jdo/relationships"&gt;documentation&lt;/a&gt; (still confused!) and looked at the way some &lt;a href="http://www.allbuttonspressed.com/projects/nonrel-search"&gt;other&lt;/a&gt; &lt;a href="http://code.google.com/p/django-fts/source/browse/trunk/fts/models.py"&gt;people&lt;/a&gt; have implemented search engine models. At one point, I even thought about &lt;a href="http://docs.python.org/library/pickle.html"&gt;pickling&lt;/a&gt; the dictionaries and putting them in the database, but I&amp;#8217;m not sure this is a good idea. Here are a few &lt;a href="http://kovshenin.com/2010/app-engine-json-objects-google-datastore/"&gt;interesting&lt;/a&gt; &lt;a href="http://kovshenin.com/2010/app-engine-python-objects-in-the-google-datastore/"&gt;blog&lt;/a&gt; &lt;a href="http://kovshenin.com/2010/pickle-vs-json-which-is-faster/"&gt;posts&lt;/a&gt; on doing this.&lt;/p&gt;

&lt;p&gt;But before I make any new models, it seems prudent to think about exactly what the search engine needs to return a result. Let&amp;#8217;s think through each data structure created by the crawler.&lt;/p&gt;

&lt;p&gt;The search index is a dictionary that maps Unicode strings to lists of Unicode URLs. There might be more than one URL mapped to each string, since a search term might be on more than one page. Here&amp;#8217;s what it looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{ u'evil': [u'http://searchwithpeter.info/'], u'cs101': [u'http://www.udacity.com/', u'http://www.udacity-forums.com'], … }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The graph is a dictionary mapping URLs to lists of URLs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{ u'http://searchwithpeter.info/': [u'http://en.wikipedia.org/list_of_secret_plans', u'http://discount-robo-soldiers.biz/'], … }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And the ranks are a simple dictionary that maps one URL to one floating point number:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{ 'http://www.udacity.com/cs101x/index.html': 0.09157335528823045, … }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Since these dictionaries included lists, I thought I might need to store lists in the database by pickling or figuring out how to use JSON with Python. But let&amp;#8217;s step back and think about what information DaveDaveFind actually needs to return a search result.&lt;/p&gt;

&lt;p&gt;I still haven&amp;#8217;t used the &lt;code&gt;lookup()&lt;/code&gt; procedure in the crawler code, or the updated &lt;code&gt;lucky_search()&lt;/code&gt; procedure from &lt;a href="http://www.udacity.com/view#Course/cs101/CourseRev/feb2012/Unit/528001/Nugget/592002"&gt;Homework 6&lt;/a&gt;, which returns the best result it can find. These procedures aren&amp;#8217;t much use as part of the crawler code, since I&amp;#8217;m not using the crawler to look up search terms. At some point, I&amp;#8217;ll need to move the lookup procedures to the code in &lt;code&gt;main.py&lt;/code&gt; that handles search lookups. But it&amp;#8217;s worth looking at the information that they use. Here&amp;#8217;s the &lt;code&gt;lucky_search()&lt;/code&gt; procedure from the homework:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def lucky_search(index, ranks, keyword):
    pages = lookup(index, keyword)
    if not pages:
        return None
    best_page = pages[0]
    for candidate in pages:
        if ranks[candidate] &amp;gt; ranks[best_page]:
            best_page = candidate
    return best_page
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You might remember that we also wrote a procedure to return all pages matching a particular keyword, in order of their rank, by writing our own quicksort algorithm. Both these questions seemed more about understanding how to sort than making a fast lookup procedure, in the same way that we learned to make hash tables before using dictionaries. (This was one of my favorite moments in the course. I sort of knew about the idea of a hash table without any real understanding, and I definitely knew how to use a Python dictionary, but I had no idea that dictionaries worked so well because they use hash tables!)&lt;/p&gt;

&lt;p&gt;When I went to get the code from the answer to this question, I noticed that I cheated a little by using the &lt;code&gt;list.sort()&lt;/code&gt; method instead of iterating through the pages. (I didn&amp;#8217;t notice, since it&amp;#8217;s a method I&amp;#8217;ve used a lot elsewhere). For now, though, the details of the procedure aren&amp;#8217;t too important. What matters are the values it uses to do a search: the data in our index, the dictionary of ranks, and a keyword.&lt;/p&gt;

&lt;p&gt;As it turns out, we don&amp;#8217;t need the graph to look up a search term, which means we can ignore its complicated data structure. After thinking for a minute, this makes sense: the whole point of &lt;strike&gt;Page&lt;/strike&gt;&lt;strike&gt;U&lt;/strike&gt;Daverank is to summarize all the information in a complex graph with one simple number. But we &lt;strong&gt;do&lt;/strong&gt; need the information in the index, which is also more complicated than a simple dictionary.&lt;/p&gt;

&lt;p&gt;To think through how I might model this in the database, I&amp;#8217;ll take a look at the &lt;code&gt;PythonTerm&lt;/code&gt; model I made yesterday. It&amp;#8217;s saved in &lt;code&gt;models.py&lt;/code&gt; in the project root directory.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db

class PythonTerm(db.Model):
    """Models a term from the Python glossary."""
    term = db.StringProperty()
    definition = db.TextProperty()

def store_pythonterm(query, result):
    pythonterm = PythonTerm(term=query, definition=result)
    pythonterm.put()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For now, I&amp;#8217;m going to delete the &lt;code&gt;store_pythonterm()&lt;/code&gt; procedure. I added it following the example of others, but I can see now that it&amp;#8217;s not necessary. DaveDaveFind should only retrieve information from the database. The only person storing new Python terms should be me, when I update the search index. I might need to use a similar procedure sometime in the future, but for now I don&amp;#8217;t need it. If I ever do, I&amp;#8217;ll look back at my GitHub history and find it.&lt;/p&gt;

&lt;p&gt;So, the relevant things I need to store are URLs, search terms, and ranks. All URLs will have just one rank. Search terms might be associated with one or many URLs, and URLs might be associated with one or many search terms. I think it&amp;#8217;s possible to model either search terms or URLs interchangeably, but something will always have to map one thing to more than one other thing, which means I need to look at &amp;#8220;one to many&amp;#8221; models. I searched this in the App Engine documentation and found &lt;a href="https://developers.google.com/appengine/articles/modeling"&gt;this page&lt;/a&gt;. Huge thanks to Rafe Kaplan, who wrote this tutorial! It made sense right away. Here are the changes I made to the models, based on what I learned from the tutorial:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db

class PythonTerm(db.Model):
    """Models a term from the Python glossary."""
    term = db.StringProperty()
    definition = db.TextProperty()

class SearchTerm(db.Model):
    """Models a search term."""
    term = db.StringProperty()

class PageUrl(db.Model):
    """Models a URL and its Daverank from the index."""
    # A search term can be associated with many pages...
    page = db.ReferenceProperty(SearchTerm,
                                collection_name='pages')

    #...but each page has a URL and Daverank.
    url = db.StringProperty()
    dave_rank = db.FloatProperty()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As I think about this, I think there&amp;#8217;s one more thing to include. What if a search term is also a Python term? If so, I want DaveDaveFind to return both search results and the Python definition. I&amp;#8217;m going to do some reading and see if I can implement this in my next post.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20586825125</link><guid>http://davedavefind.tumblr.com/post/20586825125</guid><pubDate>Fri, 06 Apr 2012 09:51:00 -0400</pubDate></item><item><title>DaveRank</title><description>&lt;p&gt;Next, I&amp;#8217;m going to try to add the Urank algorithm from class to the web crawler code. Since Urank is a knock-off of Pagerank, which is a registered trademark of Google, the algorithm will be called Daverank instead.&lt;/p&gt;

&lt;p&gt;The crawler already returns a graph, so I just need to add the &lt;code&gt;compute_ranks()&lt;/code&gt; procedure. Here&amp;#8217;s the procedure we used in class:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def compute_ranks(graph):
    d = 0.8 # damping factor
    numloops = 10

    ranks = {}
    npages = len(graph)
    for page in graph:
        ranks[page] = 1.0 / npages

    for i in range(0, numloops):
        newranks = {}
        for page in graph:
            newrank = (1 - d) / npages
            for node in graph:
                if page in graph[node]:
                    newrank = newrank + d * (ranks[node] / len(graph[node]))

            newranks[page] = newrank
        ranks = newranks
    return ranks
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I thought adding Daverank would be hard, but I didn&amp;#8217;t think carefully enough. I haven&amp;#8217;t changed the structure of the crawler&amp;#8217;s graph, so it should work just as it is! I added the procedure to &lt;code&gt;udacity_crawler.py&lt;/code&gt; and tested it on the simple crawler site:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m226q918N71qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Works perfectly, returning &lt;code&gt;index&lt;/code&gt;, &lt;code&gt;graph&lt;/code&gt;, and &lt;code&gt;ranks&lt;/code&gt; as dictionaries. This is a good time to push everything to GitHub. To see the latest code, &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/4e8faaa4f64a3b166361c6fa1a915a42b6bf0d33"&gt;click here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20585076839</link><guid>http://davedavefind.tumblr.com/post/20585076839</guid><pubDate>Fri, 06 Apr 2012 08:58:00 -0400</pubDate></item><item><title>Simplifying the problem</title><description>&lt;p&gt;Now that the database is up and running, and I&amp;#8217;m starting to wrap my brain around it, I&amp;#8217;d like to figure out how to store our search index and graph. Last night, I made a short list of problems left to solve:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Limit the depth of our crawler code.&lt;/li&gt;
&lt;li&gt;Figure out a way to stop the crawler manually once its index is &amp;#8220;big enough.&amp;#8221;&lt;/li&gt;
&lt;li&gt;Implement the URank (DaveRank?) algorithm.&lt;/li&gt;
&lt;li&gt;Think about how to store and lookup the best result, like in &lt;code&gt;lookup_best()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Think about how to store our index, graph, and ranks in the database.&lt;/li&gt;
&lt;li&gt;Add multiword lookups. (I still haven&amp;#8217;t answered this question on the final!)&lt;/li&gt;
&lt;li&gt;Get the documentation parser to retrieve more Python terms and better definitions.&lt;/li&gt;
&lt;li&gt;Figure out a way to get information from YouTube transcripts!&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;These are pretty big problems and a lot to think about all at once, but they should be soluble one by one. To help out, I think I&amp;#8217;ll go back to the extremely simple crawler &lt;a href="http://www.udacity.com/cs101x/index.html"&gt;test page&lt;/a&gt;. Today&amp;#8217;s goal will be to store a working index of this page in the database and return the results of simple search queries. Stepping back and simplifying the problem might mean modifying some code in the interim, but I think it will be a good way to clarify just how the search engine will work without worrying about a complicated index.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s start with the first item on the TODO list. Limiting the depth and scope of the web crawler was an exercise way back in Unit 3–I just forgot to implement it! Since the current version of our crawler seems to run forever, it&amp;#8217;s probably a good idea to limit the number of pages it sucks up. Here&amp;#8217;s the latest version of &lt;code&gt;crawl_web()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def crawl_web(seed): # returns index, graph of inlinks
    if is_udacity(seed):
        tocrawl = [seed]
    else: 
        print "This seed is not a Udacity site!"
        return
    crawled = []
    graph = {}  # &amp;lt;url&amp;gt;, [list of pages it links to]
    index = {} 
    while tocrawl: 
        page = tocrawl.pop()
        if page not in crawled:
            soup, url = get_page(page)
            add_page_to_index(index, page, soup)
            outlinks = get_all_links(soup, url)
            graph[page] = outlinks
            add_new_links(tocrawl, outlinks)
            crawled.append(page)
    return index, graph
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It should look pretty similar to the original procedure. I&amp;#8217;ll modify it in the same way we &lt;a href="http://www.udacity.com/view#Course/cs101/CourseRev/feb2012/Unit/252001/Nugget/269003"&gt;changed the code&lt;/a&gt; in class. First add a &lt;code&gt;max_pages&lt;/code&gt; and &lt;code&gt;max_depth&lt;/code&gt; as values passed into the function. Then, edit the &lt;code&gt;if&lt;/code&gt; condition to check whether &lt;code&gt;max_pages&lt;/code&gt; has been exceeded:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if page not in crawled and len(crawled) &amp;lt; max_pages 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Implementing a depth limit is a little harder. Here&amp;#8217;s how Peter did it &lt;a href="http://www.udacity.com/view#Course/cs101/CourseRev/feb2012/Unit/252001/Nugget/333001"&gt;in class&lt;/a&gt;. To keep it simple, he modified the crawler to stop using the &lt;code&gt;union()&lt;/code&gt; procedure. This isn&amp;#8217;t an option now that the crawler code is more complicated and &lt;code&gt;union()&lt;/code&gt; has become &lt;code&gt;add_new_links()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My solution isn&amp;#8217;t too different from Peter&amp;#8217;s code. We both modified &lt;code&gt;tocrawl&lt;/code&gt; to be not just a list of URLs, but a list of lists, where the first element is a URL and the second is the URL&amp;#8217;s &lt;code&gt;depth&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tocrawl = [[seed, 0]]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We both popped the value of &lt;code&gt;depth&lt;/code&gt; from the list:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;page, depth = tocrawl.pop()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And added &lt;code&gt;max_depth&lt;/code&gt; to this test condition:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if page not in crawled and len(crawled) &amp;lt; max_pages and depth &amp;lt;= max_depth
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;#8217;s the difference: the procedures &lt;code&gt;crawl_web()&lt;/code&gt; and &lt;code&gt;add_new_links()&lt;/code&gt; will both have to keep track of &lt;code&gt;depth&lt;/code&gt;, so &lt;code&gt;crawl_web()&lt;/code&gt; needs to pass this value to &lt;code&gt;add_new_links()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;add_new_links(tocrawl, outlinks, depth)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And when &lt;code&gt;add_new_links()&lt;/code&gt; appends a new URL to &lt;code&gt;tocrawl&lt;/code&gt;, it needs to make sure it&amp;#8217;s in the new format and increment &lt;code&gt;depth&lt;/code&gt; by 1:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tocrawl.append([link, depth+1])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s it! Here&amp;#8217;s the new &lt;code&gt;crawl_web()&lt;/code&gt; procedure:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def crawl_web(seed, max_pages, max_depth): # returns index, graph of inlinks
    if is_udacity(seed):
        tocrawl = [[seed, 0]]
    else: 
        print "This seed is not a Udacity site!"
        return
    crawled = []
    graph = {}  # &amp;lt;url&amp;gt;, [list of pages it links to]
    index = {} 
    while tocrawl: 
        page, depth = tocrawl.pop()
        print "CURRENT DEPTH: ", depth
        print "PAGES CRAWLED: ", len(crawled)
        if page not in crawled and len(crawled) &amp;lt; max_pages and depth &amp;lt;= max_depth:
            soup, url = get_page(page)
            add_page_to_index(index, page, soup)
            outlinks = get_all_links(soup, url)
            graph[page] = outlinks
            add_new_links(tocrawl, outlinks, depth)
            #print tocrawl
            crawled.append(page)
            #print crawled
    return index, graph
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When I tested this out, I ran into an unrelated problem with my old nemesis &lt;code&gt;'robots.txt'&lt;/code&gt;. The new robotparser was having trouble with the blank URLs the crawler sometimes tries to visit. Since our crawler only visits two sites, and never crawls the off-limits pages of one of them, I&amp;#8217;ll feed it the more restrictive &lt;code&gt;'robots.txt'&lt;/code&gt; (from the forums) when it gets confused. This is a pretty lazy and bad solution, but it will work for now. I&amp;#8217;ll put a note in the TODO list to figure out what&amp;#8217;s going on with blank URLs.&lt;/p&gt;

&lt;p&gt;Now, to test the crawler on the &amp;#8220;learn to crawl&amp;#8221; site. This is a nice, simple case, since it it&amp;#8217;s completely self-contained. I set &lt;code&gt;max_pages&lt;/code&gt; to 10 and &lt;code&gt;max_depth&lt;/code&gt; to 5, but the crawler should finish before it gets to either condition:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m224amtxgO1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Success! A simple index and graph for the dummy site. Next, we&amp;#8217;ll think about implementing URank.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20583721082</link><guid>http://davedavefind.tumblr.com/post/20583721082</guid><pubDate>Fri, 06 Apr 2012 08:06:48 -0400</pubDate></item><item><title>TODO list</title><description>&lt;p&gt;&lt;ul&gt;&lt;li&gt;Limit the depth of our crawler code.&lt;/li&gt;
&lt;li&gt;Figure out a way to stop the crawler manually once its index is &amp;#8220;big enough.&amp;#8221;&lt;/li&gt;
&lt;li&gt;Implement the URank (DaveRank?) algorithm.&lt;/li&gt;
&lt;li&gt;Think about how to store and lookup the best result, like in &lt;code&gt;lookup_best()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Think about how to store our index, graph, and ranks in the database.&lt;/li&gt;
&lt;li&gt;Add multiword lookups. (I still haven&amp;#8217;t answered this question on the final!)&lt;/li&gt;
&lt;li&gt;Get the documentation parser to retrieve more Python terms and better definitions.&lt;/li&gt;
&lt;li&gt;Figure out a way to get information from YouTube transcripts!&lt;/li&gt;
&lt;/ul&gt;&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20556590752</link><guid>http://davedavefind.tumblr.com/post/20556590752</guid><pubDate>Thu, 05 Apr 2012 20:16:00 -0400</pubDate></item><item><title>Making models</title><description>&lt;p&gt;In my last post, I explained how I set up an extremely hacky solution for displaying Python search terms on the DaveDaveFind results page. It was pretty poorly designed since it parsed a page and created a big dictionary for every search term. A better solution is to use a database.&lt;/p&gt;

&lt;p&gt;Thinking about databases gives me a headache. It doesn&amp;#8217;t help that the Google App Engine &lt;a href="https://developers.google.com/appengine/docs/python/datastore/"&gt;documentation&lt;/a&gt; is anything but CS101-level. After spending a few migraine-inducing hours slogging through the details of the App Engine Datastore API, reading a bunch of StackOverflow questions, and coming really close to giving up more than a few times, I finally got some data onto the development server and I&amp;#8217;ve started to figure out how I might implement real search results. Here&amp;#8217;s how I got there.&lt;/p&gt;

&lt;p&gt;To start off, I needed to write some data models. These seem complicated, but they&amp;#8217;re really just some code describing how data will be stored in the database. I&amp;#8217;ve done this before in Django, but now I&amp;#8217;m starting to understand it. Here&amp;#8217;s the one simple model in the database, saved in my project directory as &lt;code&gt;models.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db

class PythonTerm(db.Model):
    """Models a term from the Python glossary."""
    term = db.StringProperty()
    definition = db.TextProperty()

def store_pythonterm(query, result):
        pythonterm = PythonTerm(term=query, definition=result)
        pythonterm.put()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The database will be full of &lt;code&gt;PythonTerm&lt;/code&gt;s, each of which has a &lt;code&gt;term&lt;/code&gt; stored as a string and a &lt;code&gt;definition&lt;/code&gt;, stored as a longer string.&lt;/p&gt;

&lt;p&gt;Once this was set up, I needed a way to get the Python dictionary returned by &lt;code&gt;doc_crawler.py&lt;/code&gt; into some form readable by the database. It looked like there was a very complicated way to somehow upload &lt;code&gt;.csv&lt;/code&gt; files, so I looked in the trusty standard library for something to help. Here&amp;#8217;s the procedure I added, which uses the library &lt;code&gt;csv&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def write_csv(filename, dict):
    f = open(filename, 'wt')
    try:
        writer = csv.writer(f)
        writer.writerow(['term', 'definition'])
        for key in dict:
            ascii_key = key.encode('ascii', 'ignore')
            ascii_def = dict[key].encode('ascii', 'ignore')
            writer.writerow([ascii_key, ascii_def])
    finally:
        f.close()
        print "Finished writing CSV file."
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I used this procedure to create a &lt;code&gt;.csv&lt;/code&gt; file of the parsed Python terms.&lt;/p&gt;

&lt;p&gt;Now for the annoying part. After hours of despair and endless attempts to understand how to get data from my computer onto the development server, I figured out that I needed to write a &lt;code&gt;bulkloader&lt;/code&gt; file that tells Google App Engine how to map the rows of the &lt;code&gt;.csv&lt;/code&gt; to parts of the database. Here&amp;#8217;s the file, named &lt;code&gt;term_loader.py&lt;/code&gt;, which I more or less copied straight from the documentation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from google.appengine.ext import db
from google.appengine.tools import bulkloader
from models import PythonTerm

class PythonTermLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, 'PythonTerm',
            [('term', str),
            ('definition', str)
            ])
loaders = [PythonTermLoader]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After wrestling with development server settings for a long time, I finally stumbled on the correct command to upload my data to the development server:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;appcfg.py upload_data --config_file=term_loader.py --filename=crawler/function_dict.csv --has_header --url=http://localhost:8000/remote_api --kind=PythonTerm
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Meanwhile, I discovered that the Google App Engine launcher includes a helpful console page with some behind-the-scenes tools:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m216hgozHP1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s where I found my freshly-uploaded Python terms once they were loaded into the database:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m216i6BHrT1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;To finish up, I added a familiar &lt;code&gt;if&lt;/code&gt; block back into the &lt;code&gt;process_search()&lt;/code&gt; procedure in &lt;code&gt;main.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if search_query[0] == '`':
    term = search_query[1:]
    q = PythonTerm.all()
    q.filter('term =', term)
    python_term = q.fetch(1)[0].definition
    return template('templates/results', search_query=search_query, python_term=python_term)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This time, it uses the Google App Engine database methods &lt;code&gt;all()&lt;/code&gt;, &lt;code&gt;filter()&lt;/code&gt; and &lt;code&gt;fetch()&lt;/code&gt; to get a term from the database and display it on the page. Last of all, I tested it out:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m216ijjxHy1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;It works just the same as the hacked-together method from our last post, but displays almost instantly, since it&amp;#8217;s a quick retrieval from the database. After a lot of trouble in the last few posts, I&amp;#8217;m feeling pretty good about DaveDaveFind, and starting to figure out how all the parts might work together. The next big task is figuring out how to get our search index and graph into the database, which might be a little tricky. As always, here&amp;#8217;s &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/bd9e453901312f0a29997ba04af3f9244a5fbfbe"&gt;the latest version&lt;/a&gt; of DaveDaveFind.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20555458924</link><guid>http://davedavefind.tumblr.com/post/20555458924</guid><pubDate>Thu, 05 Apr 2012 19:57:41 -0400</pubDate></item><item><title>Proof of concept</title><description>&lt;p&gt;Now that our crawler code is working, and I&amp;#8217;m waiting on an answer from the Udacity forums, I&amp;#8217;ll try to implement a special feature: returning help on Python objects and commands from the online documentation.&lt;/p&gt;

&lt;p&gt;Python includes an &lt;a href="http://docs.python.org/library/functions.html#help"&gt;interactive help&lt;/a&gt; feature that can be really useful. It seemed like this would be an easy way to get DaveDaveFind to print help information to the search results page. But it looks like it&amp;#8217;s not really designed to be used outside of the terminal and getting the code to work correctly would be really hacky. I tinkered with it for a while and gave up.&lt;/p&gt;

&lt;p&gt;Instead, let&amp;#8217;s try another strategy: parsing the online Python documentation with BeautifulSoup. Of course, it would be pretty rude to open the online documentation page and parse it every time someone makes a search query. Fortunately all Python documentation is &lt;a href="http://docs.python.org/download.html"&gt;available for download&lt;/a&gt;, so we can parse it offline.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s start with something simple: the list of &lt;a href="http://docs.python.org/library/functions.html"&gt;built-in functions&lt;/a&gt; in the Python library reference. After some trial and error, I came up with this procedure, which I named &lt;code&gt;doc_crawler.py&lt;/code&gt; and placed in the &lt;code&gt;/crawler&lt;/code&gt; folder:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from bs4 import BeautifulSoup
import io

def make_function_dict(page):
    # Open the documentation HTML file and read it into BeautifulSoup.
    f = io.open(page)
    s = f.read()
    f.close()
    soup = BeautifulSoup(s)

    # Create an empty dict that will map function names to HTML.
    function_dict = {}

    # Find all 'dl' tags with the css class 'function'
    functions = soup.find_all('dl', {'class':'function'})
    for i in range(0, len(functions)):
        title_tag = functions[i].find('tt', {'class':'descname'})
        name = title_tag.text
        desc_tag = functions[i].find('dd')
        desc = desc_tag.text
        function_dict[name] = desc

    return function_dict
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Fortunately, the structure of Python&amp;#8217;s HTML documentation pages is very predictable, which means it&amp;#8217;s easy to parse out each function&amp;#8217;s name and description with just a couple methods from BeautifulSoup. This worked in the terminal, so I moved on to hooking this function up to DaveDaveFind.&lt;/p&gt;

&lt;p&gt;So far, I haven&amp;#8217;t been back to the &lt;code&gt;main.py&lt;/code&gt; file set up in the very first post, but I&amp;#8217;ll try using this new code to make DaveDaveFind do something more interesting than tell users their search terms. This will mostly be an experiment, so I won&amp;#8217;t worry too much about cluttering up the project directories.&lt;/p&gt;

&lt;p&gt;First, I made a copy of the file &lt;code&gt;doc_crawler.py&lt;/code&gt; in the same directory as &lt;code&gt;main.py&lt;/code&gt;. I added the line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from doc_crawler import make_function_dict
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;to the top of &lt;code&gt;main.py&lt;/code&gt;, and put a new &lt;code&gt;if&lt;/code&gt; block in the &lt;code&gt;process_search()&lt;/code&gt; function:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;if search_query[0] == '`':
    help_string = search_query[1:]
    dict = make_function_dict('http://docs.python.org/library/functions.html')
    help_html = dict[help_string]
    return template('templates/results', search_query=search_query, help_html=help_html)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, if we enter a search term that starts with a tilde (the markdown shortcut for a code snippet), DaveDaveFind will look for it in our Python function dictionary.&lt;/p&gt;

&lt;p&gt;Of course, this threw an error right away. The &lt;code&gt;io&lt;/code&gt; library doesn&amp;#8217;t work with the development server, so I switched the &lt;code&gt;doc_crawler.py&lt;/code&gt; code to read from a URL instead of a local file (just for now). Here&amp;#8217;s the code for my modified &lt;code&gt;doc_crawler.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;from bs4 import BeautifulSoup
from urllib import urlopen

def make_function_dict(page):
    # Open the documentation HTML file and read it into BeautifulSoup.
    soup = BeautifulSoup(urlopen(page).read())

    # Create an empty dict that will map function names to HTML.
    function_dict = {}

    # Find all 'dl' tags with the css class 'function'
    functions = soup.find_all('dl', {'class':'function'})
    for i in range(0, len(functions)):
        title_tag = functions[i].find('tt', {'class':'descname'})
        name = title_tag.text
        desc_tag = functions[i].find('dd')
        desc = desc_tag.text
        function_dict[name] = desc

    return function_dict
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Back to the development server to try it out:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m20szwBkoR1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Looks good so far…&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_m20t037abc1qz7dqc.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;It works! Unfortunately, this isn&amp;#8217;t a very good solution for a bunch of reasons. The &lt;code&gt;process_search()&lt;/code&gt; procedure in &lt;code&gt;main.py&lt;/code&gt; loads the Python documentation from the web, parses the page, and creates a new dictionary for every search. That&amp;#8217;s pretty expensive, and really slow. Function descriptions aren&amp;#8217;t always formatted correctly. And the code I wrote above is pretty sloppy: it doesn&amp;#8217;t account for a number of important cases, like when there&amp;#8217;s only a normal search term and no Python search term. I&amp;#8217;ll try to solve a few of these problems by putting the index of Python terms in a database instead of generating it on the server, but that&amp;#8217;s a big job that I&amp;#8217;ve been avoiding so far. For now, though, this proof of concept is pretty cool. Even so, I deleted my modified &lt;code&gt;doc_crawler.py&lt;/code&gt; and removed the new code in &lt;code&gt;main.py&lt;/code&gt; before pushing an update to GitHub. To see the code as it stands after this post, &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/tree/7f62527765114a74a92f283858634a4a5692b67b"&gt;click here&lt;/a&gt;.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20538755037</link><guid>http://davedavefind.tumblr.com/post/20538755037</guid><pubDate>Thu, 05 Apr 2012 15:03:00 -0400</pubDate></item><item><title>Robots redux</title><description>&lt;p&gt;After a frustrating morning, I&amp;#8217;ve decided to return to my original problem parsing &lt;code&gt;'robots.txt'&lt;/code&gt; files. I didn&amp;#8217;t get much help with my question on &lt;a href="http://stackoverflow.com/questions/10026708/python-robotparser-module-wont-load-robots-txt"&gt;StackOverflow&lt;/a&gt;, but I did find a third-party library that solves the problem. I was hoping to stick with tools from the standard library as much as possible, but the clock is ticking. (I&amp;#8217;m still curious about what was wrong with the original parser if anyone can explain it to me!)&lt;/p&gt;

&lt;p&gt;After tinkering with the standard parser in the interactive terminal, my best guess is that it can&amp;#8217;t handle some of the nonstandard additions that have been added to &lt;code&gt;'robots.txt'&lt;/code&gt; files, like the &lt;code&gt;Sitemap&lt;/code&gt; statement, blank lines, and certain weird URLs. &lt;a href="http://nikitathespider.com/python/rerp/"&gt;Nikita the Spider&lt;/a&gt; is a library that adds support for some of these extensions to &lt;code&gt;'robots.txt'&lt;/code&gt;, and works just like the standard library parser.&lt;/p&gt;

&lt;p&gt;To get it working, replace the old &lt;code&gt;import robotparser&lt;/code&gt; at the top of &lt;code&gt;udacity_crawler.py&lt;/code&gt; with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import robotexclusionrulesparser as rerp
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This module&amp;#8217;s full name is pretty long, so we&amp;#8217;ll import it as &lt;code&gt;rerp&lt;/code&gt; for short. Now, all I need to do is change the line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;rp = robotparser.RobotFileParser()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;in the procedures &lt;code&gt;get_page()&lt;/code&gt; and &lt;code&gt;get_all_links()&lt;/code&gt; to:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;rp = rerp.RobotFileParserLookalike()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A good night&amp;#8217;s sleep, a Google search or two, and I solved this problem in a few minutes. The crawler is now plugging away indexing the Udacity forums, and it&amp;#8217;s finally respecting the rules of &lt;code&gt;'robots.txt'&lt;/code&gt;. In fact, it&amp;#8217;s been running for a little more than ten minutes now, which might be a problem of its own! Here&amp;#8217;s &lt;a href="https://github.com/ecmendenhall/DaveDaveFind/blob/9535d770f67034e04b7ccbcf5b9eeedd020b8242/crawler/udacity_crawler.py"&gt;a link&lt;/a&gt; to the latest update of the crawler code.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20527612892</link><guid>http://davedavefind.tumblr.com/post/20527612892</guid><pubDate>Thu, 05 Apr 2012 10:50:00 -0400</pubDate></item><item><title>Trouble with transcripts</title><description>&lt;p&gt;While waiting for an answer to my &lt;code&gt;robotparser&lt;/code&gt; question, I&amp;#8217;ll try to work on another problem. In addition to indexing the Udacity homepage and forums, I&amp;#8217;d like DaveDaveFind to index the transcripts of video lectures, and return results from Python reference documents.&lt;/p&gt;

&lt;p&gt;I spent a few hours looking for ways to extract captions from YouTube videos. It&amp;#8217;s harder than I hoped. I started by looking at the source of a Udacity YouTube video in the developer toolbar on Chrome. I found the subtitle track (a resource in the resources window named &lt;code&gt;'timedtext'&lt;/code&gt;), but this is returned by the YouTube API and doesn&amp;#8217;t have an easy-to-use URL. Next, I searched around and read the YouTube API documentation on &lt;a href="https://developers.google.com/youtube/2.0/developers_guide_protocol_captions"&gt;captions&lt;/a&gt;. It&amp;#8217;s pretty clear that video captions are only supposed to be available to the owners of a given video. It might be possible to extract the file somehow, but it wouldn&amp;#8217;t be very nice. After this, I looked for a third-party library that might help. I found the excellent &lt;a href="http://rg3.github.com/youtube-dl/"&gt;youtube-dl&lt;/a&gt; project, but it didn&amp;#8217;t work on Udacity videos.&lt;/p&gt;

&lt;p&gt;Just to be sure, I checked out the source code (this is one great benefit of understanding how to read Python). The program is pretty complicated, but it looks like it tries to get subtitle information from a Google Video page that no longer works.&lt;/p&gt;

&lt;p&gt;Stuck again! This time, I think I&amp;#8217;ll try a different sort of solution: posting a question to the &lt;a href="http://www.udacity-forums.com/cs101/questions/57350/would-the-udacity-staff-mind-posting-public-video-transcripts"&gt;Udacity forums&lt;/a&gt; to see if the staff might be willing to post the transcripts on the main site.&lt;/p&gt;</description><link>http://davedavefind.tumblr.com/post/20525201385</link><guid>http://davedavefind.tumblr.com/post/20525201385</guid><pubDate>Thu, 05 Apr 2012 09:44:14 -0400</pubDate></item></channel></rss>
