IST Research completed its Scrapy Cluster.
Fredericksburg, VA, April 26, 2015 –(PR.com)– IST Research published its newest contribution to the open-source community, dubbed “Scrapy Cluster,” to GitHub on Wednesday, completing a key component of its work on DARPA’s Memex program. DARPA’s Memex Program is researching and creating a domain specific indexing platform to better navigate the information that currently exists on the internet. The goal of the program is described as, “developing the next generation of search technologies and revolutionize the discovery, organization, and presentation of search results.” IST’s development of Scrapy Cluster has advanced the overall capability of an end-user to gather vast quantities of web content on any topic in a highly efficient manner. In other words, one now has the ability to gather large volumes of openly available information without having to rely on the limitations of commercial search engines.
The Scrapy Cluster package is a scalable, distributed web crawling cluster based on Scrapy and coordinated via Apache Kafka and Redis. It provides a framework baseline for intelligent distributed scrapes as well as the ability to conduct time-limited web crawls. Scrapy Cluster provides anyone with the ability to scale Scrapy instances across one or more machines, coordinate their scraping effort for desired sites, persist across scraping jobs, or have multiple scraping jobs running at the same time. Capabilities also include the ability to arbitrarily add, remove, and scale scrapers from the pool without data loss or downtime. Scrapy Cluster is an on demand set of scrapers that continuously run in the background, allowing the arbitrary submission of scrape jobs to the idling spiders.
Scrapy Cluster is built from a combination of open source components that allow anyone to build a scalable web content collection platform. It utilizes popular, stable, and open source components like Apache Kafka, Redis, and Scrapy. IST Research utilizes Scrapy Cluster to execute continuous, on demand scraping of many different domains required by DARPA Memex performers. The open source nature of the project allows the community to extend and contribute back to the platform to improve overall web collection capability.
IST Research, LLC is based in Fredericksburg, VA and has been completing work on the Memex program as a prime performer since the project’s inception in 2014.