Welcome to the Digital Public Library of America Vertical Search Demo, created for the DPLA Beta Sprint by the California Digital Library.
This demo is built using Apache Nutch 1.3 for web crawling and LucidWorks Certified Distribution for Solr Release 3.2 for search with a simple web interface using django templates. The Cloudera Distribution of Hadoop was run on Amazon EC2 to create the index for the demo. Code and configuration files are available on Google Code.
The goal of this demo was to focus on the configuration of the crawl and search, with minimal programming work. There are a number of ways the functionality could be improved and the content expanded upon to provide a more robust searching experience (see below for just a few development ideas for future phases of the project).
The demo targets a range of websites with digital cultural heritage content. Currently, approximately 300,000 unique URLs from 100 sources are included in the index. The IMLS Digital Collections and Content--a registry of digital materials funded by IMLS National Leadership Grants and selected LSTA-supported collections--provides the foundation for the demo index. In addition, resources from the University of California, other libraries and cultural heritage institutions, and aggregated content websites have been included in the initial crawl. The full seed list is maintained in Google Code.
If you have ideas for additional resources that should be included in the demo or a later phase, please let us know by suggesting a site.
Integrating mass digitized monograph collections such as the Hathi Trust was not investigated in this demo for both technical and content development reasons. While we believe a crawler approach could be used to aggregate monographs, digitized monographs do not have the same granularity and characteristics as web pages and short documents often found on the web. The out-of-the-box stack we are using is tuned for web crawling and search, and we anticipate book search may not be optimized by default. Furthermore, we believe that access to digitized local history and cultural history is an important role of libraries, and that this content should be made more visible.
There are several areas in which a vertical search could be expanded and refined: