Back

Seq – A programming language for computational genomics and bioinformatics

87 points9 hoursgithub.com
bscphil8 hours ago

> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications

> Seq is able to outperform Python code by up to 160x.

So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)

Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.

amelius7 hours ago

It's probably in the same sense that Numpy is much faster than doing matrix operations with pure Python arrays and Python for-loops.

drocer886 hours ago

Look at the link: https://github.com/seq-lang/seq It says 96% of the code is C++ in the "Languages" box on the right. C ( and C++ and Rust) outperforms Python in benchmarks and certain optimized C code can do 160x over very naive Python. So this is very possible, though the routines tested are probably cherry picked for bragging rights.

arc-in-space6 hours ago

> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations

It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).

aldanor8 hours ago

I also know literally nothing about this particular project, but why not? If you support a small restricted subset of Python it's completely doable under certain conditions for specific types of programs. E.g., Numba can easily outperform Python 100-1000x in numerical applications (done it myself multiple times), simply because it jit-compiles the code by first translating it to LLVM IR.

bscphil7 hours ago

> If you support a small restricted subset of Python

That's why I quoted their claim that the "vast majority" of Python programs run unmodified. Even PyPy barely achieves that. To really get 100x performance over Python (and even supposedly beat C) with a compiler that works on most unmodified Python code would be an extraordinary achievement.

dunefox7 hours ago

That seems misrepresenting the original points: it can run the vast majority of python programs unmodified AND in some cases outperform Python - not at the same time.

hoseja7 hours ago

Probably, it can outperform generic python specifically for genomics payloads, versus python code/C code.

encode7 hours ago

Also see this comparison between Julia's BioSequences and Seq by Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/

dgb236 hours ago

An interesting takeaway:

> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.

Reminds me of the myriads of Excel catastrophes.

dunefox5 hours ago

This shows imo that BioJulia is better, precisely because it validates data and is a broader programming language invented for science, not a DSL that optimises for speed over all else. Besides the new version of BuiJulia seems to perform even better than seq.

clusterhacks4 hours ago

I am a CS person who works with bioinformaticians every day as part of my job.

I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.

Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.

I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.

Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.

Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.

dr_kiszonka53 minutes ago

Very interesting! I noticed a similar phenomenon in the GIS space. All of my colleagues with formal training in GIS use ArcGIS and its Python API, but those without such background gravitate towards FLOSS solutions.

I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.

travisgriggs3 hours ago

So basically, the same thing that kept(keeps?) Visual Basic in use for so long.

My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.

It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.

psychometry2 hours ago

Scientists like using R instead of because the language lets them get set up and coding quickly with RStudio. More importantly, the language, tooling, and ecosystem is very forgiving when it comes to code quality and style. There is good R code out there, but the R community generally lacks the wide acceptance of good coding practices you see with Python users: unit tests, sane dependency management, type hints, documentation, safe namespacing, etc.

It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.

arshajii2 hours ago

Hi everyone, I’m one of the developers on the Seq project — I was delighted to see it posted here! We started this project with a focus on bioinformatics, but since then we’ve added a lot of language features/libraries that have closed the gap with Python by a decent margin, and Seq today can be useful in other areas or even for general Python programs (although there are still limitations of course). We’re in the process of creating an extensible / plugin-able Python compiler based on Seq that allow for other domain-extensions. The upcoming release also has some neat features like OpenMP integration (e.g. “@par(num_threads=10) for i in range(N): …” will run the loop with 10 threads). Happy to answer any questions!

adgjlsfhk12 hours ago

Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.

arshajii2 hours ago

We haven't done too many comparisons with BioJulia since that paper, although we did address the (valid) issues they raised such as data validation (i.e. Seq now validates input data by default, but this can be optionally disabled). We did compare against them in our last paper in a sequence alignment benchmark: https://www.nature.com/articles/s41587-021-00985-6 (check the supplement).

dekhn4 hours ago

Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.

I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.

adgjlsfhk14 hours ago

IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.

dekhn4 hours ago

100 cores? I forgot how to count that low.

The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.

f6v5 hours ago

> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

A pitch most people doing applied bioinformatics won’t understand/appreciate.

totalperspectiv6 hours ago

It’s odd that they didn’t include Nim in the benchmarks in their paper: https://dl.acm.org/doi/pdf/10.1145/3360551

jpxw6 hours ago

I know nothing about Nim or genomics. Why is it odd that they didn’t include Nim?

pietroppeter5 hours ago

Nim has had some success in genomics mainly thanks to the work of https://github.com/brentp

Nim can be sold as a "A strongly-typed and statically-compiled high-performance Pythonic language" as Seq (although it is more than that and does not actually have as a goal to be Pythonic, see https://nim-lang.org/ or https://github.com/Araq/nimconf2021/blob/main/zennim.rst).

Still, given the small size of Nim community and even smaller size of the genomics nim subcommunity, I would say it is not that odd that is not included in the benchmark. The existing nim genomics library might not even cover the functionalities required by the benchmark.

lf-non2 hours ago

Nim is not really 'pythonic'. It does have some superficial similarity with Python (being whitespace sensitive) but it begins to diverge pretty soon. This is not really a criticism of Nim. I quite like many of the choices in Nim.

Seq claims that vast majority of python programs would work as is. I have not validated that claim, but Nim can absolutely not make that claim. Any python library would require substantial porting effort to be translated to nim.

Zababa1 hour ago

Calling Nim Python is like calling OCaml or Scala Python, it's not really true. The main reason people use Python is because it is Python, not because of an extractable list of things.

gandalfgeek2 hours ago

Quick explainer video: https://youtu.be/5bk4Wc5Op2M

car9 hours ago

Looks great, will definitely give this a try since it does sequence manipulations that I otherwise have to write myself.

Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?

tdido9 hours ago

Seems like there's a conda package in the works: https://github.com/bioconda/bioconda-recipes/pull/29660

kasperset6 hours ago

I like this idea. However to me it is similar to using à la carte tools/programs along with bash script or DSL such as Nextflow. More often these stand-alone programs are already written in compiled languages. I am sure Seq will allow to build customized programs as compared to scripting or gluing programs.

chmaynard8 hours ago

I'm wondering if Seq can also serve as a general-purpose replacement for Python whenever a fast executable is needed.

arshajii2 hours ago

(I'm one of the developers on Seq.) We've actually been working mostly on closing the gap with Python for the last year or so. Seq can be useful for plain Python programs as well -- I give a bit more context in my comment above.

dunefox5 hours ago

It's a domain specific language for bioinformatics. So, most likely not.

jack_riminton3 hours ago

How do you pronounce Seq?

adgjlsfhk13 hours ago

Short for sequence.

jack_riminton3 hours ago

So is it pronounced sequence or like 'seek'?

da39a3ee3 hours ago

I have high confidence it's pronounced "seek".

winter_squirrel7 hours ago

This looks cool, I also love how easy the setup was considering lots of niche languages I try sometimes seem to have arcane setup steps and dependencies