Seq – A programming language for computational genomics and bioinformatics

116 points4 yearsgithub.com

clusterhacks • 4 years ago

I am a CS person who works with bioinformaticians every day as part of my job.

I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.

Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.

I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.

Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.

Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.

psychometry • 4 years ago

Scientists like using R instead of because the language lets them get set up and coding quickly with RStudio. More importantly, the language, tooling, and ecosystem is very forgiving when it comes to code quality and style. There is good R code out there, but the R community generally lacks the wide acceptance of good coding practices you see with Python users: unit tests, sane dependency management, type hints, documentation, safe namespacing, etc.

It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.

travisgriggs • 4 years ago

So basically, the same thing that kept(keeps?) Visual Basic in use for so long.

My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.

It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.

ativzzz • 4 years ago

I assume you are a software engineer? If so, part of our job is to use a variety of software tools, since that's our specialty. The researchers are not software developers. They learn how to use one particular tool to do their jobs, but they are not software specialists, nor do they desire to be.

dr_kiszonka • 4 years ago

Very interesting! I noticed a similar phenomenon in the GIS space. All of my colleagues with formal training in GIS use ArcGIS and its Python API, but those without such background gravitate towards FLOSS solutions.

I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.

encode • 4 years ago

Also see this comparison between Julia's BioSequences and Seq by Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/

dgb23 • 4 years ago

An interesting takeaway:

> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.

Reminds me of the myriads of Excel catastrophes.

dunefox • 4 years ago

This shows imo that BioJulia is better, precisely because it validates data and is a broader programming language invented for science, not a DSL that optimises for speed over all else. Besides the new version of BuiJulia seems to perform even better than seq.

bscphil • 4 years ago

> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications

> Seq is able to outperform Python code by up to 160x.

So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)

Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.

amelius • 4 years ago

It's probably in the same sense that Numpy is much faster than doing matrix operations with pure Python arrays and Python for-loops.

aldanor • 4 years ago

I also know literally nothing about this particular project, but why not? If you support a small restricted subset of Python it's completely doable under certain conditions for specific types of programs. E.g., Numba can easily outperform Python 100-1000x in numerical applications (done it myself multiple times), simply because it jit-compiles the code by first translating it to LLVM IR.

mhenders • 4 years ago

(minor contributor) I’ve been following the project for a while and pleasantly surprised by the ability to manually convert Python programs to Seq without needing to make too many changes. Note, most of my experimentation has been with smallish programs I’ve written. I like that I can still think “Pythonically” and compose mostly correct Seq code using familiar idioms, e.g. list/set/dict comprehensions. The standard library is very readable and a source for “from import” type functionality. Some of the other features I’ve come to appreciate: pipeline operator |>, JIT compile or create an executable (seqc run, seqc build), match statements, and strong typing.

bscphil • 4 years ago

> If you support a small restricted subset of Python

That's why I quoted their claim that the "vast majority" of Python programs run unmodified. Even PyPy barely achieves that. To really get 100x performance over Python (and even supposedly beat C) with a compiler that works on most unmodified Python code would be an extraordinary achievement.

dunefox • 4 years ago

That seems misrepresenting the original points: it can run the vast majority of python programs unmodified AND in some cases outperform Python - not at the same time.

drocer88 • 4 years ago

Look at the link: https://github.com/seq-lang/seq It says 96% of the code is C++ in the "Languages" box on the right. C ( and C++ and Rust) outperforms Python in benchmarks and certain optimized C code can do 160x over very naive Python. So this is very possible, though the routines tested are probably cherry picked for bragging rights.

arc-in-space • 4 years ago

> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations

It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).

snicker7 • 4 years ago

Most newer languages will give you multiple orders of magnitude better performance than python.

Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).

It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.

hoseja • 4 years ago

Probably, it can outperform generic python specifically for genomics payloads, versus python code/C code.

arshajii • 4 years ago

Hi everyone, I’m one of the developers on the Seq project — I was delighted to see it posted here! We started this project with a focus on bioinformatics, but since then we’ve added a lot of language features/libraries that have closed the gap with Python by a decent margin, and Seq today can be useful in other areas or even for general Python programs (although there are still limitations of course). We’re in the process of creating an extensible / plugin-able Python compiler based on Seq that allow for other domain-extensions. The upcoming release also has some neat features like OpenMP integration (e.g. “@par(num_threads=10) for i in range(N): …” will run the loop with 10 threads). Happy to answer any questions!

adgjlsfhk1 • 4 years ago

Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.

arshajii • 4 years ago

We haven't done too many comparisons with BioJulia since that paper, although we did address the (valid) issues they raised such as data validation (i.e. Seq now validates input data by default, but this can be optionally disabled). We did compare against them in our last paper in a sequence alignment benchmark: https://www.nature.com/articles/s41587-021-00985-6 (check the supplement).

fuzzythinker • 4 years ago

Used it for coding Coursera/Stepik's Bioinformatics course [1] when it was first announced 2 years ago.

Not claiming it as any sort of reference, but you can see how it [2] may be used to solve some basic genome sequencing.

[1] https://www.coursera.org/specializations/bioinformatics

[2] https://github.com/fuzzthink/seq-genomics

fwip • 4 years ago

It's an impressive project, but I'm not sure the niche is big enough. It's certainly come a long way since the last time I looked at it!

My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.

e.g: The example of reading Fasta files in Seq:

    # iterate over everything
    for r in FASTA('genome.fa'):
        print r.name
        print r.seq

versus BioPython:

    from Bio import SeqIO
    for r in SeqIO.parse("genome.fa", "fasta"):
        print(r.id)
        print(r.seq)

It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)

1. https://biopython.org/

chmaynard • 4 years ago

> It's an impressive project, but I'm not sure the niche is big enough.

Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.

totalperspectiv • 4 years ago

It’s odd that they didn’t include Nim in the benchmarks in their paper: https://dl.acm.org/doi/pdf/10.1145/3360551

jpxw • 4 years ago

I know nothing about Nim or genomics. Why is it odd that they didn’t include Nim?

pietroppeter • 4 years ago

Nim has had some success in genomics mainly thanks to the work of https://github.com/brentp

Nim can be sold as a "A strongly-typed and statically-compiled high-performance Pythonic language" as Seq (although it is more than that and does not actually have as a goal to be Pythonic, see https://nim-lang.org/ or https://github.com/Araq/nimconf2021/blob/main/zennim.rst).

Still, given the small size of Nim community and even smaller size of the genomics nim subcommunity, I would say it is not that odd that is not included in the benchmark. The existing nim genomics library might not even cover the functionalities required by the benchmark.

lf-non • 4 years ago

Nim is not really 'pythonic'. It does have some superficial similarity with Python (being whitespace sensitive) but it begins to diverge pretty soon. This is not really a criticism of Nim. I quite like many of the choices in Nim.

Seq claims that vast majority of python programs would work as is. I have not validated that claim, but Nim can absolutely not make that claim. Any python library would require substantial porting effort to be translated to nim.

goodpoint • 4 years ago

Nim is pretty pythonic in terms of expressiveness.

Of course Nim is statically typed, but a lot of Python code that does not use dynamic typing heavily can be ported to Python surprisingly easily.

Zababa • 4 years ago

Calling Nim Python is like calling OCaml or Scala Python, it's not really true. The main reason people use Python is because it is Python, not because of an extractable list of things.

dekhn • 4 years ago

Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.

I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.

adgjlsfhk1 • 4 years ago

IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.

dekhn • 4 years ago

100 cores? I forgot how to count that low.

The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.

east2west • 4 years ago

I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.

[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...

heuermh • 4 years ago

We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam

dekhn • 4 years ago

Yep, that's the one I was thinking of (along with GNOMAD, which IIRC uses ADAM or some similar tech). My main complaint with ADAM was that they came up with their own file format (which had some flaws). But the general idea is the right one.

heuermh • 4 years ago

I'm interested in chatting with you about this, and genomics on Spark more generally, feel free to reach out on Github or via my username at the usual suspects.

dekhn • 4 years ago

I left this field, actually. I cofounded Google Cloud Genomics, and when I proposed that we pivot from working with the GA4GH (very stupid APIs) to working with ADAM (real data processing) I got kicked off the team. Since then I've come to see genomics as a minefield of bad practices and don't really work in the field any more, except to help scientists run their workflows in the cloud.

f6v • 4 years ago

> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

A pitch most people doing applied bioinformatics won’t understand/appreciate.

car • 4 years ago

Looks great, will definitely give this a try since it does sequence manipulations that I otherwise have to write myself.

Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?

tdido • 4 years ago

Seems like there's a conda package in the works: https://github.com/bioconda/bioconda-recipes/pull/29660

haihaibye • 4 years ago

I'm in the target market but can't use this unless it supports all of my Python libraries like Django and Numpy.

It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.

V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.

It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.

Bostonian • 4 years ago

The code examples look like Python 2 rather than Python 3. Print does have not parentheses. Why was this decision made?

haihaibye • 4 years ago

They support both print syntaxes, and will deprecate Python 2 style soon.

https://github.com/seq-lang/seq/issues/223

kasperset • 4 years ago

I like this idea. However to me it is similar to using à la carte tools/programs along with bash script or DSL such as Nextflow. More often these stand-alone programs are already written in compiled languages. I am sure Seq will allow to build customized programs as compared to scripting or gluing programs.

tdido • 4 years ago