| 2012-11-06 03:03:14 utc | mburnett | so i have a situation where i have N concurrent jobs that get submitted to a remote process via AMQP, then i need to wait a long time and then resume the process once those concurrent jobs have all finished |
| 2012-11-06 03:03:34 utc | mburnett | what's a good/the right way to approach that? |
| 2012-11-06 03:53:04 utc | mburnett | nevermind, i was being foolish about how receivers worked |
| 2012-11-06 16:49:25 utc | mburnett | is there a typical way of reporting an error on a workitem received via AMQP? I see a thead of mid-late 2011 in the mailing list, but I'm having a hard time understanding how to apply that to my case. |
| 2012-11-06 20:04:33 utc | mburnett | ah, it seems that Ruote::Amqp::Receiver flunk has a different interface from Ruote::Receiver flunk |
| 2012-11-06 21:02:31 utc | jmettraux | mburnett: hello, yes, #flunk is used to pass errors back from the receivers |
| 2012-11-06 21:02:52 utc | mburnett | yeah, i was just passing it all the wrong stuff :) |
| 2012-11-06 21:03:59 utc | mburnett | now i just need to get a curl-friendly inteface up to have a complete tracer bullet |
| 2012-11-06 22:58:37 utc | mburnett | how do i abort components of a process that depend on a failed component without killing everything? |
| 2012-11-06 22:58:50 utc | jmettraux | what is a component? |
| 2012-11-06 22:59:09 utc | mburnett | an Amqp::Receiver in this case |
| 2012-11-06 22:59:40 utc | jmettraux | it's not a component of a process |
| 2012-11-06 22:59:42 utc | mburnett | i know that the idea is that failed processes will be administered and error sections corrected |
| 2012-11-06 23:00:02 utc | mburnett | maybe I should just put up a gist |
| 2012-11-06 23:00:09 utc | jmettraux | I can tell you how to cancel parts of a workflow instance |
| 2012-11-06 23:00:29 utc | mburnett | ok |
| 2012-11-06 23:00:32 utc | jmettraux | do you need a way to unregister an Amqp::Receiver? |
| 2012-11-06 23:00:43 utc | jmettraux | and make it unsubscribe? |
| 2012-11-06 23:00:47 utc | mburnett | maybe i should just fill in some background |
| 2012-11-06 23:00:54 utc | mburnett | and you can tell me how that's the wrong design :) |
| 2012-11-06 23:01:10 utc | jmettraux | maybe an email to the mailing list would be more appropriate |
| 2012-11-06 23:01:28 utc | mburnett | ok |
| 2012-11-06 23:02:37 utc | jmettraux | breakfast here |
| 2012-11-06 23:02:49 utc | jmettraux | I'm OK to help via IRC, but please remember I cannot read your mind |
| 2012-11-06 23:13:54 utc | mburnett | well, here's the gist: https://gist.github.com/4028287 |
| 2012-11-06 23:14:00 utc | mburnett | if you like, i'll post more details to the mailing list |
| 2012-11-06 23:15:39 utc | jmettraux | what is the question? |
| 2012-11-06 23:16:23 utc | mburnett | so the question is basically "what's the right way to handle failed grid jobs" |
| 2012-11-06 23:16:35 utc | mburnett | right now i'm doing flunk() |
| 2012-11-06 23:16:37 utc | jmettraux | that's very deep |
| 2012-11-06 23:16:47 utc | mburnett | ok, so then let's narrow the scope |
| 2012-11-06 23:16:52 utc | jmettraux | flunk() will pass the error to ruote |
| 2012-11-06 23:16:54 utc | mburnett | what's a reasonable way to handle failed grid jobs here |
| 2012-11-06 23:17:20 utc | jmettraux | so flunk() is read, IMHO |
| 2012-11-06 23:17:27 utc | jmettraux | so flunk() is right, IMHO |
| 2012-11-06 23:17:42 utc | mburnett | right that seems to work, i guess the behavior that most closely matches our existing infrastructure is that the process is marked as failed, but any non-depdendent parts of the process are still run |
| 2012-11-06 23:18:02 utc | jmettraux | that's the default behaviour |
| 2012-11-06 23:18:10 utc | mburnett | ah |
| 2012-11-06 23:18:44 utc | jmettraux | if you have two concurrent ruote branches and one ends up in an error, the other will go on |
| 2012-11-06 23:18:45 utc | mburnett | so basically i just need to monitor the failures so that i can flag the whole process as failed? |
| 2012-11-06 23:18:53 utc | mburnett | right |
| 2012-11-06 23:19:02 utc | mburnett | i just need to notify users that this process has failed |
| 2012-11-06 23:19:30 utc | jmettraux | ruote-wise, a branch of the process failed |
| 2012-11-06 23:19:32 utc | mburnett | so your initial recommendation would basically be to just flunk and do nothing else inside ruote? |
| 2012-11-06 23:19:43 utc | jmettraux | yes |
| 2012-11-06 23:19:48 utc | mburnett | ok |
| 2012-11-06 23:20:09 utc | mburnett | i plan to setup a historian service listening to messages on amqp for stuff like this |
| 2012-11-06 23:20:19 utc | mburnett | to create entries in our existing tracking system |
| 2012-11-06 23:20:45 utc | jmettraux | as long as everything goes through AMQP, it's great |
| 2012-11-06 23:31:52 utc | jmettraux | you're building an AMQP powered interface in front of your grid |
| 2012-11-06 23:32:06 utc | jmettraux | your clients are ruote or whatever talks AMQP |
| 2012-11-06 23:32:49 utc | jmettraux | services and orchestration of services |
| 2012-11-06 23:38:58 utc | mburnett | that's right |
| 2012-11-06 23:41:15 utc | mburnett | i really like this architecture |
| 2012-11-06 23:42:15 utc | mburnett | is there a way to query ruote about whether a process has any possible ways to proceed without intervention? i.e. has every branch not blocked by an error completed? |
| 2012-11-06 23:43:55 utc | mburnett | is leaves() the best attempt? |
| 2012-11-06 23:44:39 utc | mburnett | and then check each one for error state |
| 2012-11-07 00:05:20 utc | jmettraux | yes, leaves could help |