2012-11-06 03:03:14 utc |
mburnett |
so i have a situation where i have N concurrent jobs that get submitted to a remote process via AMQP, then i need to wait a long time and then resume the process once those concurrent jobs have all finished |
2012-11-06 03:03:34 utc |
mburnett |
what's a good/the right way to approach that? |
2012-11-06 03:53:04 utc |
mburnett |
nevermind, i was being foolish about how receivers worked |
2012-11-06 16:49:25 utc |
mburnett |
is there a typical way of reporting an error on a workitem received via AMQP? I see a thead of mid-late 2011 in the mailing list, but I'm having a hard time understanding how to apply that to my case. |
2012-11-06 20:04:33 utc |
mburnett |
ah, it seems that Ruote::Amqp::Receiver flunk has a different interface from Ruote::Receiver flunk |
2012-11-06 21:02:31 utc |
jmettraux |
mburnett: hello, yes, #flunk is used to pass errors back from the receivers |
2012-11-06 21:02:52 utc |
mburnett |
yeah, i was just passing it all the wrong stuff :) |
2012-11-06 21:03:59 utc |
mburnett |
now i just need to get a curl-friendly inteface up to have a complete tracer bullet |
2012-11-06 22:58:37 utc |
mburnett |
how do i abort components of a process that depend on a failed component without killing everything? |
2012-11-06 22:58:50 utc |
jmettraux |
what is a component? |
2012-11-06 22:59:09 utc |
mburnett |
an Amqp::Receiver in this case |
2012-11-06 22:59:40 utc |
jmettraux |
it's not a component of a process |
2012-11-06 22:59:42 utc |
mburnett |
i know that the idea is that failed processes will be administered and error sections corrected |
2012-11-06 23:00:02 utc |
mburnett |
maybe I should just put up a gist |
2012-11-06 23:00:09 utc |
jmettraux |
I can tell you how to cancel parts of a workflow instance |
2012-11-06 23:00:29 utc |
mburnett |
ok |
2012-11-06 23:00:32 utc |
jmettraux |
do you need a way to unregister an Amqp::Receiver? |
2012-11-06 23:00:43 utc |
jmettraux |
and make it unsubscribe? |
2012-11-06 23:00:47 utc |
mburnett |
maybe i should just fill in some background |
2012-11-06 23:00:54 utc |
mburnett |
and you can tell me how that's the wrong design :) |
2012-11-06 23:01:10 utc |
jmettraux |
maybe an email to the mailing list would be more appropriate |
2012-11-06 23:01:28 utc |
mburnett |
ok |
2012-11-06 23:02:37 utc |
jmettraux |
breakfast here |
2012-11-06 23:02:49 utc |
jmettraux |
I'm OK to help via IRC, but please remember I cannot read your mind |
2012-11-06 23:13:54 utc |
mburnett |
well, here's the gist: https://gist.github.com/4028287 |
2012-11-06 23:14:00 utc |
mburnett |
if you like, i'll post more details to the mailing list |
2012-11-06 23:15:39 utc |
jmettraux |
what is the question? |
2012-11-06 23:16:23 utc |
mburnett |
so the question is basically "what's the right way to handle failed grid jobs" |
2012-11-06 23:16:35 utc |
mburnett |
right now i'm doing flunk() |
2012-11-06 23:16:37 utc |
jmettraux |
that's very deep |
2012-11-06 23:16:47 utc |
mburnett |
ok, so then let's narrow the scope |
2012-11-06 23:16:52 utc |
jmettraux |
flunk() will pass the error to ruote |
2012-11-06 23:16:54 utc |
mburnett |
what's a reasonable way to handle failed grid jobs here |
2012-11-06 23:17:20 utc |
jmettraux |
so flunk() is read, IMHO |
2012-11-06 23:17:27 utc |
jmettraux |
so flunk() is right, IMHO |
2012-11-06 23:17:42 utc |
mburnett |
right that seems to work, i guess the behavior that most closely matches our existing infrastructure is that the process is marked as failed, but any non-depdendent parts of the process are still run |
2012-11-06 23:18:02 utc |
jmettraux |
that's the default behaviour |
2012-11-06 23:18:10 utc |
mburnett |
ah |
2012-11-06 23:18:44 utc |
jmettraux |
if you have two concurrent ruote branches and one ends up in an error, the other will go on |
2012-11-06 23:18:45 utc |
mburnett |
so basically i just need to monitor the failures so that i can flag the whole process as failed? |
2012-11-06 23:18:53 utc |
mburnett |
right |
2012-11-06 23:19:02 utc |
mburnett |
i just need to notify users that this process has failed |
2012-11-06 23:19:30 utc |
jmettraux |
ruote-wise, a branch of the process failed |
2012-11-06 23:19:32 utc |
mburnett |
so your initial recommendation would basically be to just flunk and do nothing else inside ruote? |
2012-11-06 23:19:43 utc |
jmettraux |
yes |
2012-11-06 23:19:48 utc |
mburnett |
ok |
2012-11-06 23:20:09 utc |
mburnett |
i plan to setup a historian service listening to messages on amqp for stuff like this |
2012-11-06 23:20:19 utc |
mburnett |
to create entries in our existing tracking system |
2012-11-06 23:20:45 utc |
jmettraux |
as long as everything goes through AMQP, it's great |
2012-11-06 23:31:52 utc |
jmettraux |
you're building an AMQP powered interface in front of your grid |
2012-11-06 23:32:06 utc |
jmettraux |
your clients are ruote or whatever talks AMQP |
2012-11-06 23:32:49 utc |
jmettraux |
services and orchestration of services |
2012-11-06 23:38:58 utc |
mburnett |
that's right |
2012-11-06 23:41:15 utc |
mburnett |
i really like this architecture |
2012-11-06 23:42:15 utc |
mburnett |
is there a way to query ruote about whether a process has any possible ways to proceed without intervention? i.e. has every branch not blocked by an error completed? |
2012-11-06 23:43:55 utc |
mburnett |
is leaves() the best attempt? |
2012-11-06 23:44:39 utc |
mburnett |
and then check each one for error state |
2012-11-07 00:05:20 utc |
jmettraux |
yes, leaves could help |