Composability in Julia: Implementing Deep Equilibrium Models via Neural ODEs

Tarrosion • 3 years ago

I've been part of and/or following the Julia community since 2015, and Julia is my favorite programming language by a wide margin. Seems like every two months there's a new blog post, usually with some of these folks as authors, that describes...something...to do with ODEs, machine learning, neural ODEs, GPUs, adjoints, scientific machine learning, ...

I have never once been able to follow one of these blog posts. Seems like universally these posts have terrible curse of knowledge [1]. To be fair, I know only a light-to-moderate amount about machine learning and very little about differential equations, but still— I'm a long-term Julia fan, professional data scientist, mathy PhD, would hope that's at least table stakes. Maybe I just need to...try harder? I wonder how many people would be excited but are totally lost by these posts.

To that point, if anybody has a recommendation for a gentle introduction to these topics (preferably Julia), I'd be most appreciative.

[1] https://en.wikipedia.org/wiki/Curse_of_knowledge

ViralBShah • 3 years ago

The SciML tutorials are a good start: https://github.com/SciML/SciMLTutorials.jl

And also the 18.337 lecture notes (probably also has videos available): https://github.com/mitmath/18337

uoaei • 3 years ago

> but still— I'm a long-term Julia fan, professional data scientist, mathy PhD, would hope that's at least table stakes.

I don't understand this part. Are you saying you're a "professional data scientist" with a "mathy PhD" who has a "light-to-moderate amount" of machine learning knowledge? How did you get the job?

I would expect anyone with a mathy PhD to understand ODEs and PDEs, and neural ODEs are commonly understood (by those who read the papers, where I think it is an appropriate assumption that people who want to understand this stuff would do) to be effectively infinite-depth neural networks where every layer represents the same function.

eigenspace • 3 years ago

This is a pretty shitty, non-constructive response. I think neural differential equations are not easy to wrap one's head around even if you have a solid understand of deep learning and differential equations.

Sure, if they spent a lot of time wading through the literature they'd probably understand fine, but the point they were making was that the post was quite unapproachable without having delved into the specific literature on neural differential equations.

I think this is a reasonably valid complaint, and does not warrant you implying that they don't deserve their job.

orbifold • 3 years ago

I think it would have helped if the people writing that paper had not confused the issue by introducing a new name for something that is a well known thing in optimal control and had been invented even before neural networks, namely adjoint sensitivity analysis. There even appear multilayer networks of switching components in Pontryagin's book on the subject.

uoaei • 3 years ago

Valid complaint how? If I have never studied carcinogenesis why should I believe I should understand the description of a new treatment for bone marrow cancer?

The way any article is written reflects the audience it is suited for. If this article was intended for people unfamiliar with neural ODEs they would have put more effort into writing it in a suitable way.

DNF2 • 3 years ago

There is a big difference between a blog post and a scientific article. A blog post should be expected to popularize a topic and reach a wider audience.

You also seem to be quite unfamiliar with breadth of the topic of mathematics. It is quite possible to take a mathematics phd without touching differential equations, except as an undergrad.

BTW, implying that people don't deserve their job is just shitty behaviour, and way out of line.

Tarrosion • 3 years ago

> Are you saying you're a "professional data scientist" with a "mathy PhD" who has a "light-to-moderate amount" of machine learning knowledge? How did you get the job?

Indeed! My job involves approximately zero machine learning, at least for a narrow or stereotypical definition of machine learning. I work on optimization, domain-specific models inspired by queueing theory, various phsyically-motivated structural models, writing production code for data-heavy products, testing hypotheses in data, brainstorming how existing data can be used to solve new customer problems, writing documentation, communicating with customers, etc.

If you think this isn't data science but know of better nomenclature: please share! I've struggled to write a great job posting for this kind of work, and surely improved nomenclature would help :)

uoaei • 3 years ago

I would say you do machine learning if you're doing optimization on bespoke models. These worlds blend together once you get into the weeds. It doesn't have to be gradient descent (via AD or otherwise) to be machine learning.

By "light to moderate amount" do you mean that you haven't catalogued the menagerie of statistical models that are commonly taught in machine learning / deep learning courses? Because that to me is secondary to having the underlying principles (optimization theory, measure theory, etc.) down pat. Recognizing the fundamentals in the theory which describes how machine learning proceeds is invaluable for comprehension.

If you wanted a flashier name for your role I might suggest "Solutions Architect" or even "Machine Learning Engineer" based on what roles you want to aim for. Because honestly you're doing a lot of what is already entailed by those titles. "Data Scientist" also fits for sure, being such a broad title nowadays.

I interpreted what you said more glibly(?) than it seems you intended, and apparently I expressed my surprise snarkier than I intended.

amkkma • 3 years ago

in what is your phd?

Tarrosion • 3 years ago

diskzero • 3 years ago

Have you been in interview loops or worked with bread and butter data scientists performing common tasks? I am curious what your view of what most data scientists do day in and day out?

uoaei • 3 years ago

What about tasks being common makes a light-to-moderate understanding of machine learning sufficient?

Processes initiated by data scientists during the execution of their role will tend to fail silently. What is meant here is, throwing an inappropriate model at otherwise good data produces unreliable (catastrophic in certain situations) results, but produces results nonetheless. Without the proper discernment of the reliability of the results, we have an unequivocal failure to execute the role. This is the oft-unmentioned companion to, but decidedly more insidious than, the "garbage in, garbage out" (i.e., right model, wrong data) aphorism.

It is up to the person performing this operation to deduce whether or not the conclusions are trustworthy. I don't see how someone can be confident of this without either relying on a pre-defined workflow verified by someone else qualified to assess the consequences, or to have those qualifications themselves.

What follows is a contrived example, but illustrative of the problem:

Consider e.g. user privacy: it is by now well-known that e.g. embedding vectors (or even merely the relationships between them) can leak a lot of information about the person or object it represents. It is not enough to understand how the forward pass of such a model commences, but also what is stored in those representations, which, having gone through a master's with quite a few people who now call themselves data scientists, I am not confident is commonly understood.

diskzero • 3 years ago

This is helpful, and I agree with you. There are some "data scientists" who are able to use PyTorch, TensorFlow, etc. and modify code in a Jupyter Notebook without knowing the larger ramifications of the work they are doing.

agumonkey • 3 years ago

What's funny is how most of them are revolving around a similar culture of concepts (ODE, automated differentiation, analysis) which is very unlike mainstream computing.

ps: I need to read those adjoints articles.

jimsimmons • 3 years ago

I consider myself to be reasonably good at ML and work for one of the cool labs and this stuff is impenetrable to me. Stuff like this are deceptively written and a cynic might say that they are primarily to show off rather than help.

rpmuller • 3 years ago

One of the best things about Julia is that people like Chris Rackauckas are developing great packages for it.

lytefm • 3 years ago

Definetely. I've just been working with Stan and Pyro + Python so far for modelling, but post like this encourage me to finally pick up Julia and get seroius with neural ODEs.

adgjlsfhk1 • 3 years ago

This is a really cool idea, especially because it is a type of NN that can take more time for harder inputs. This makes it relatively unique since most types of NN have O(1) runtime, which is often nice, but puts limits on the types of problems they can solve.

jstx1 • 3 years ago

What are some good references on neural ODEs that don't come from the Julia community? I'm looking for theory and applications - when are they good and who is using them for what?

I'm asking for sources outside of Julia because I find the coupling of algorithm types to tools kind of strange and the whole SciML trend is kind of opaque to me. (Are people applying ML as a solution to newer problems? Are they using new approaches to solve ML problems? How legit is the whole thing? I just don't know.)

orbifold • 3 years ago

Neural ODEs are essentially a rebranding of adjoint sensitivity analysis, which has been around in various forms in established solver suites, such as Sundials, PETSC, etc. The machine learning community got a hold of it, cited one book and otherwise happily reinvented everything.

IlyaOrson • 3 years ago

That actual content of that work was good but very misleading with an excess of backpropaganda and a poor literature review. The training procedure makes sense as continuous time backprop but it is mostly a special case of adjoint sensitivity analysis. The use of a NN as defining an ODE system seems fair to be named Neural ODE, imho its a good name, although again it was not completely novel as the writing style through the paper makes it look.

orbifold • 3 years ago

I agree that it is a good name and evaluating this on "ML tasks" was novel. However this and subsequent papers did a really poor job in delineating what is novel, from what is well known. Moreover this is a pattern in almost all subsequent papers, where people literally pretend that they were the first to consider parameter gradient computation of controlled differential equations, hybrid dynamical systems, etc., while in fact this has been worked out in full generality since essentially the 60s.

ChrisRackauckas • 3 years ago

Yes, that's mostly right (instead of PETSc put FATODE since PETSc TS Adjoint was only published in 2019, but it's based heavily on FATODE's techniques https://epubs.siam.org/doi/10.1137/130912335?mobileUi=0). And that's why the Julia tools were so ready for it: we already had adjoint sensitivity analysis (implemented for parameter estimation in systems pharmacology), so neural ODEs were a freebee. Similar to DEQs using the adjoint of a nonlinear solve: that means a DEQ is just a neural network inside of a Julia nonlinear solver and you call the gradients and you'll hit the right overloads that were originally implemented for parameter estimation of elliptic PDEs. There are definitely some aspects of the ML community that have muddied the waters so-to-speak.

The recent one that I found funny was the Second-Order Neural ODE (https://arxiv.org/abs/2109.14158). The second order adjoint is rather old, with a canonical implementation in Sundials (https://github.com/LLNL/sundials/raw/master/doc/cvodes/cvs_g...) based on a very good analysis of how to do second order adjoints fast (https://epubs.siam.org/doi/abs/10.1137/030601582?journalCode...). But what about putting a neural network in there? In Julia it's just composing forward-over-reverse to get the optimal adjoint that matches Sundials, so there's been a tutorial on it in DiffEqFlux since 2019 (https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin...) with both Newton and Newton-Krylov methods. The rest of the optimizations from the paper follow using BacksolveAdjoint and relying on dead code elimination (DCE), which is a compiler pass that wouldn't exist in Python so I guess they have to care? I assumed that was too trivial to publish given that history of prior work, but somehow that paper got a NeurIPS Spotlight with "a novel computational framework for computing higher-order derivatives of deep continuous-time models". At this point I just kind of shrug though, I think it's a symptom of conference culture not giving people enough time to review the literature thoroughly so random things seem to seep through. Take it as one reason among many to treat any ML conference paper similar to an unreviewed preprint.

All of that together, that's why the Julia universe of tools has mostly been focusing on improvements and performance vs Sundials, PETSc, etc. since those are the real challengers. You can see that we spend our time benchmarking against Sundials all of the time in our stiff ODE benchmarks and recently started outperforming it with QNDF pulling 2x-5x wins in various ways (https://benchmarks.sciml.ai/html/Bio/BCR.html, see https://sciml.ai/news/2021/05/24/QNDF/). The stiff neural ODE paper describes an 3 ways to achieve an improvement on the Sundials adjoint in terms of complexity (https://aip.scitation.org/doi/10.1063/5.0060697), and our adjoint benchmarking paper shows how putting all of these various pieces together leads to about 2-3 orders of magnitude improvement over the naive CVODES adjoint (https://arxiv.org/abs/1812.01892) (against the method, the follow up of course will be against the direct wrapping).

What's funny though is that people then get upset when we show benchmarks against some of the Python tools, like >100x improvements in solver speeds on physical and biological problems (https://gist.github.com/ChrisRackauckas/cc6ac746e2dfd285c28e...) and adjoints (https://gist.github.com/ChrisRackauckas/4a4d526c15cc4170ce37...). I don't understand why anyone would be surprised though: we've spent years "competing" against the C and Fortran codes and only recently started pulling ahead due to the combined effort of a whole community, while the Python tools were just a few people with simple methods who never benchmarked against the previous tools. If they benchmarked enough they would see that we're not an outlier claiming to be 100x faster than everyone else, instead we're in and slightly ahead of the pack but they are the outlier that is 100x behind the whole group. Personally, I would require every paper to at least have a benchmark against Sundials (and/or its methods) as a baseline which is the standard we tend to hold.

orbifold • 3 years ago

Ha, yeah I knew that analysis of how to do second order adjoints fast :), in what is basically documentation no less. I figured someone would probably do a ML paper on that eventually. I have had a hard time judging, what is non-trivial in the past, aswell. The next thing that they will probably discover is Jet spaces and Hopf algebras.

I'm also slightly salty because a recent paper (https://arxiv.org/pdf/2011.03902.pdf) claimed "Differentiable event handling generalizes many numerical methods that often have specialized methods for gradient computation,...", citing one of our results. We instead went through the trouble to find a semi-complete list of prior work, which would make it abundantly clear that what they claim as new is in fact traceable to a paper by Rozenvasser in ~1965 (we translated it from Russian to make sure), with a long history of more recent work. In fact I think DifferentialEquations.jl just supports this out of the box and has non-broken event-handling (their implementation tests against a product of jump conditions).

I am aware of those benchmarks :). I will probably adopt your library for this very reason, although I have to say that I really don't like Julia as a language. It is pretty clear that besides Sundials, PETSC and other C++ based libraries, you are rapidly becoming the only game in town. As much as I am tempted I really shouldn't be handwriting integration routines :). I don't think you need a random internet stranger to tell you that, but it is abundantly clear that DifferentialEquations.jl provides far more long-term value than these types of machine learning papers.

adgjlsfhk1 • 3 years ago

UncleOxidant • 3 years ago

There's some info from the Pyhon/Pytorch camp: https://towardsdatascience.com/neural-odes-with-pytorch-ligh...

I suspect neural ODE work was done in Julia earlier because it was easier given some language features and libraries. But there does seem to be some work on neural ODEs in Python/Pytorch.

ssivark • 3 years ago

The original Neural ODEs paper is quite readable, and by now there are loads of blog posts and even a few talks on the subject.

The basic idea is inspired by the “adjoint method” for ODE solving (so you don’t have to hold in memory all the intermediate layer outputs — which is otherwise necessary to compute the backpropagated gradient signal).

ChrisRackauckas • 3 years ago

Yeah, though with the method described in that paper you do have to be very careful since it has exponential error growth with the Lipchitz constants of the ODE. See https://aip.scitation.org/doi/10.1063/5.0060697 for details. But that is generally the case in numerical analysis: there's always a simple way to do things, and then there's the way that prevents error growths. Both have different pros and cons.

adgjlsfhk1 • 3 years ago

What do you mean by "the coupling of algorithm types to tools" (not judging, just curious).

jstx1 • 3 years ago

More bluntly my question is if SciML is that good, why aren’t more people doing it yet? Why is it limited to a small group of Julia developers and packages?

(There are good possible explanations - it could be very new, have only niche applications, Julia is somehow uniquely suited for it etc. I don’t know)

krastanov • 3 years ago

Some minor clarifications: NeuralODEs are not a Julia invention. I am pretty sure the first papers on the topic were using a python package implementing a rather crude ODE solver in torch or tensorflow. Julia just happens to be light years ahead of any other tool when it comes to solving ODEs, while having many high-quality autodifferentiation packages as well, so it feels natural to use it for these problems. But more importantly, SciML is not just for your typical Machine Learning tasks: being able to solve ODEs and have autodiff over them is incredibly empowering for boring old science and engineering, and SciML has become one of the most popular set of libraries when it comes to unwieldy ODEs.

ChrisRackauckas • 3 years ago

Lots to say here. First of all, the community growth has been pretty tremendous and I couldn't really ask for more. We're seeing tens of thousands of visitors to the documentation of various packages, and have some high profile users. For example, NASA showing a 15,000x acceleration (https://www.youtube.com/watch?v=tQpqsmwlfY0) and the Head of Clinical Pharmacology at Moderna saying SciML-based Pumas "has emerged as our 'go-to' tool for most of our analyses in recent months" in 2020 (see https://pumas.ai/). We try to keep a showcase (https://sciml.ai/showcase/) but at this point it's hard to stay on top of the growth. I think anyone would be excited to see an open source side project reach that level of use. Since we tend to focus on core numerical issues (stiffness) and performance, we target the more "hardcore" people in scientific disciplines who really need these aspects and those communities are the ones seeing the most adoption (pharmacology, systems biology, combustion modeling, etc.). Indeed the undergrad classes using a non-stiff ODE solver on small ODEs or training a neural ODE on MNIST don't really have those issues so they aren't our major growth areas. That's okay and that's probably the last group that would move.

In terms of the developer team, throughout the SciML organization repositories we have had around 30 people who have had over 100 commits, which is similar in number to NumPy and SciPy. Julia naturally has a much lower barrier to entry in terms of committing to such packages (since the packages are all in Julia rather than C/Fortran), so the percentage of users who become developers is much higher which is probably why you see a lot more developer activity in contrast to "pure" users. With things like the Python community you have a group of people who write blog posts and teach the tools in courses without ever hacking on the packages or its ecosystem. In Julia, that background is sufficient knowledge to also be developing the package, so everyone writing about Julia seems to also be associated with developing Julia packages somehow. I tend to think that's a positive, but it does make the community look insular as everyone you see writing about Julia is also a developer of packages.

Lastly, since we have been focusing on people with big systems and numerically hard problems, we have had the benefit of being able to overlook some "simple user" issues so far. We are starting to do a big push to clean up things like compile times (https://github.com/SciML/DifferentialEquations.jl/issues/786), improve the documentation, throw better errors, support older versions longer, etc. One way to think about SciML is that it's somewhat the Linux to the monolith Python packages's Windows. We give modular tools in a bunch of different packages that work together, get high performance, and become "more than the sum of the parts", but sometimes people are fine with the simple app made for one purpose. With DEQs, there's a Python package specifically for DEQs (https://github.com/locuslab/deq). Does it have all of the Newton-Krylov choices for the different classes of Jacobians and all of that? No, but it gets something simple and puts an easily Google-able face to it. So while all it takes in Julia with SciML is to stick a nonlinear solver in the right spot in the right way and know how the adjoint codegen will bring it all together, the majority want Visual Studio instead of Awk+Sed or Vim. We understand that, and so the DiffEqFlux.jl package is essentially just a repository of tutorials and prebuilt architectures that people tend to want (https://diffeqflux.sciml.ai/dev/) but we need to continue improving that "simplified experience". The age of Linux is more about making desktop managers that act sufficiently like Windows and less about trying to get everyone building ArchLinux from source. Right now we are currently too much like ArchLinux and need to build more of the Ubuntu-like pieces. We thus have similarly loyal hardcore followers but need to focus a bit on making that installation process easier and the error messages shorter to attract a larger crowd.

ViralBShah • 3 years ago

What do you mean by "more" people? Perhaps you mean people who know? Anyone who solves a differential equation in Julia is using the SciML ecosystem of packages. The Julia ecosystem is about 1M users, and lots of people in that ecosystem use these tools.

There's over 100 dependent packages: https://juliahub.com/ui/Packages/OrdinaryDiffEq/DlSvy/5.64.1...

UncleOxidant • 3 years ago

> Why is it limited to a small group of Julia developers and packages?

I don't think there are any gatekeepers limiting it's use. Articles like the one highlighted here help to get the word out to more potential users.

cs702 • 3 years ago

Very cool. I have only one question:

Has anyone successfully applied DEQs to larger-scale cognitive tasks or benchmarks, as opposed to MNIST, which is a tiny trivial task by today's standards?

Think ImageNet-1000, COCO, LVIS, WMT language translation, ..., SuperGLUE. There's a long list of datasets and benchmarks that regular boring fixed-depth NNs tackle with remarkable ease these days.

Has anyone anywhere applied DEQs to any of those datasets / benchmarks?

avikpal1410 • 3 years ago

MDEQ work applies DEQ to some of the large scale benchmarks you mention: https://arxiv.org/pdf/2006.08656.pdf

cs702 • 3 years ago

Thank you! The results don't look that great (e.g., EfficientNet models achieve greater accuracy on ImageNet-1000 with ~5x fewer parameters), but the works looks interesting and worthwhile. I'll take a look.

gugagore • 3 years ago

> . For example, when we apply convolution filters on images the network consists of repetitive blocks of convolutional layers, and one linear output layer at the very end. It's essentially f(f(f(...f(x))...)) where f is the neural network, and we call this "deep" because of the layers of composition. But what if we make this composition go to infinity?

This really does not jive with my understanding. Each layer of, e.g. VGG 16 [1] does not implement the same function. Each layer has its own weights. There are certain architectures that tie weights across layers, but not all of them.

[1] https://neurohive.io/en/popular-networks/vgg16/

taeric • 3 years ago

I think this works if you consider subscripted fs. That is, it is always the same arity, and the output is the same. So, would be better it they said f_n, where n is the layer if they network.

(I mean this as a question, but don't see an obvious place for a question mark...)

gugagore • 3 years ago

well, I think it has to be a fixed f for the interpretation of a (time-invariant) dynamical system x_{n+1} = f(x_n) to work.

taeric • 3 years ago

But deeper layers of the network don't get the original input, do they?

ChrisRackauckas • 3 years ago

Correct, not all models have equivalent weights at each step. But it is an abstraction that needs to be done to send it to infinity.

savant_penguin • 3 years ago

15000x speed up compared to what?

adgjlsfhk1 • 3 years ago

the previous implantation used by Nasa in I believe simulink. See https://www.youtube.com/watch?v=tQpqsmwlfY0 for more details.

snicker7 • 3 years ago

Simulink -- probably the most popular modeling toolbox.

grzff • 3 years ago

Fauci funded COVID-19: https://www.zerohedge.com/covid-19/nih-admits-funding-gain-f...

Everyone involved should face the firing squad.

tomtomftomtom • 3 years ago

Fauci funded COVID-19: https://www.zerohedge.com/covid-19/nih-admits-funding-gain-f...

bomz24 • 3 years ago

NIH Admits Fauci lied to congress and funded covid-19:

https://theintercept.com/2021/10/21/virus-mers-wuhan-experim...

fauscist1984 • 3 years ago

NIH admits Fauci lied about funding Wuhan gain-of-function experiments

https://www.washingtonexaminer.com/opinion/nih-admits-fauci-...