GitHub built a new search engine for code ‘from scratch’ in Rust

Image: Luis Alvarez/Getty Images

The Rust programming language continues to grow in popularity and now developer platform GitHub has used it to build its new code-focused search engine, Blackbird. 

Instead of perusing forums for answers, GitHub wants users to use its search engine, which is currently in beta

Also: Memory safe programming languages are on the rise. Here’s how developers should respond

Rust is consistently the most loved (but not most widely used) programming language among developers, according to developer question and answer site, Stack Overflow. 

As a new project, it is an interesting reference for Rust, which is usually adopted for building new features in projects previously written in C/C++, and is popular for systems programming versus building apps. The CTO of Microsoft Azure last year declared all new projects should be written in Rust over C/C++ because of its memory safety features.  

But why build a search engine from scratch when GitHub could use another open-source solution, such as Apache Cassandra, Solr, or Elasticsearch?

“At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren’t there plenty of existing, open source solutions out there already? Why build something new?” writes GitHub’s Timothy Clem

His short answer is that GitHub hasn’t found success using general text search products to power code search.     

“The user experience is poor, indexing is slow, and it’s expensive to host. There are some newer, code-specific open source projects out there, but they definitely don’t work at GitHub’s scale,” he writes. 

GitHub started experimenting with Elasticsearch in 2011, but Clem notes it look “months” to index GitHub’s then roughly eight million repositories. Today, GitHub supports about 200 million dynamic code repositories.  

GitHub’s Blackbird currently supports searching across about 45 million repositories, so it provides only partial coverage, but it still enables code searching across 15 terabytes of code and 15.5 billion documents for programs written in Python, Java, and JavaScript. 

The Rust-written custom search engine, Blackbird, is more efficient and gives GitHub “substantial storage savings via deduplication and guarantees a uniform load distribution across shards”, according to Pavel Avgustinov, VP of software engineering at GitHub.  

He argues GitHub’s scale means it can’t use a Unix ‘grep’ (global regular expression print) for search. In effect, it would be too slow when considering the possibility of processing hundred of terabytes of code in memory. Queries would take too long. 

Also: New job? Here are 5 ways to make a great first impression

Clem notes that deduplication and its approach to indexing cut down the 115 terabytes it needed to search down to 28 terabytes of unique content. The index itself is now 25 terabytes.  

For all the latest Technology News Click Here 

 For the latest news and updates, follow us on Google News

Read original article here

Denial of responsibility! TechNewsBoy.com is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected]. The content will be deleted within 24 hours.