2013-03-14 00:20:14 utc |
ypz_ |
what would be the correct way to get the name from a Ruote.define object ? empirically, it is the value of pdef[1]['name'] |
2013-03-14 05:35:52 utc |
jmettraux |
ypz: hello, yes pdef[1]['name'] is probably the shortest way |
2013-03-14 05:36:18 utc |
ypz |
hi, |
2013-03-14 05:36:42 utc |
ypz |
so it is acceptable to access it this way ? I was hope there is a getter for it |
2013-03-14 05:36:45 utc |
jmettraux |
although, in the flow, the workitem handed to participants has a #wf_name and a #wf_revision method |
2013-03-14 05:36:54 utc |
jmettraux |
it's totally acceptable |
2013-03-14 05:37:06 utc |
jmettraux |
process definition are just trees |
2013-03-14 05:37:14 utc |
jmettraux |
(once "generated") |
2013-03-14 05:37:24 utc |
jmettraux |
feel free to wrap that in any class you like |
2013-03-14 05:37:42 utc |
jmettraux |
the "process portfolio management" is left to integrators |
2013-03-14 05:38:18 utc |
jmettraux |
there are people trying to build things around that: https://github.com/coffeeaddict/ruote-registry |
2013-03-14 05:38:25 utc |
ypz |
well, I generated pdef object and stored them in DB, and other script (not aware of anything about Ruote) is reading them out from db directly |
2013-03-14 05:39:00 utc |
jmettraux |
ok |
2013-03-14 05:39:28 utc |
ypz |
great, thanks |
2013-03-14 22:09:26 utc |
ypz |
hi, jmettraux |
2013-03-14 22:09:40 utc |
jmettraux |
hello, good afternoon |
2013-03-14 22:09:54 utc |
ypz |
what time is it at your place ? |
2013-03-14 22:10:01 utc |
jmettraux |
0657 |
2013-03-14 22:10:13 utc |
ypz |
then good morning to you |
2013-03-14 22:10:18 utc |
jmettraux |
you're in SF iirc |
2013-03-14 22:10:21 utc |
jmettraux |
thanks! |
2013-03-14 22:10:42 utc |
ypz |
yea, I am in the SF Bay area |
2013-03-14 22:11:22 utc |
jmettraux |
how can I help you? |
2013-03-14 22:11:31 utc |
ypz |
when I use a participant to handle on_error conditions, the process itself is removed from the engine, correct ? |
2013-03-14 22:12:03 utc |
jmettraux |
ACTION looks again at the docs |
2013-03-14 22:13:06 utc |
ypz |
i am trying to figure out how to handle various types of errors one may encounter while processing a a workflow |
2013-03-14 22:15:13 utc |
jmettraux |
if you use dashboard.on_error = 'participant', the process should not be removed |
2013-03-14 22:15:22 utc |
jmettraux |
is that what you're using? |
2013-03-14 22:16:15 utc |
ypz |
i used sequence :on_error => 'error_handler' |
2013-03-14 22:16:44 utc |
jmettraux |
is the sequence the top "embracing" block? |
2013-03-14 22:16:45 utc |
phaeron |
jmettraux: finally setup a staging environment where I can run stuff in a vm , under valgrind |
2013-03-14 22:17:04 utc |
jmettraux |
phaeron: hello, good good |
2013-03-14 22:17:12 utc |
ypz |
jmettraux, yes, |
2013-03-14 22:17:32 utc |
jmettraux |
ypz: then the sequence will execute the participant and then be "over" |
2013-03-14 22:17:57 utc |
jmettraux |
ypz: since it's the top sequence, the process terminates as well (unless the on_error participant doesn't reply immediately) |
2013-03-14 22:18:43 utc |
jmettraux |
ypz: maybe a good rule of thumb would be to deal with known errors "in participants", and let the rest of the errors jam their processes |
2013-03-14 22:19:02 utc |
phaeron |
jmettraux: I compared the setups between the two vms ( leaking vs. non leaking ) and couldn't find any difference. |
2013-03-14 22:19:49 utc |
jmettraux |
ypz: then when you have a good grip on the thing, you can start using those block on_error constructs |
2013-03-14 22:20:14 utc |
jmettraux |
ypz: but please experiment and have fun |
2013-03-14 22:20:32 utc |
jmettraux |
phaeron: can you reproduce the leak? |
2013-03-14 22:20:36 utc |
ypz |
jmettraux by saying " to deal with known errors "in participants", do you mean to implement on_error method for that participant ? |
2013-03-14 22:21:12 utc |
jmettraux |
ypz: sorry, I meant regular rescue/ensure blocks inside of the participant implementations to deal with local issues |
2013-03-14 22:21:25 utc |
jmettraux |
ypz: those that can be handled at the participant level |
2013-03-14 22:21:52 utc |
jmettraux |
ypz: (and that you don't want to jam their processes) |
2013-03-14 22:23:42 utc |
phaeron |
jmettraux: yes. as far as I can see , but valgrind is not reporting it yet |
2013-03-14 22:24:22 utc |
phaeron |
jmettraux: the ruote setup is a bit custom https://github.com/MeeGoIntegration/boss/blob/bundled/Gemfile.lock |
2013-03-14 22:24:32 utc |
phaeron |
opensuse 12.1 64bit |
2013-03-14 22:25:20 utc |
phaeron |
ruby 1.8.7 |
2013-03-14 22:26:34 utc |
jmettraux |
phaeron: this vm is a ruote-worker vm? Do you have an array of ruote worker vms? Or is it an amqp worker vm? |
2013-03-14 22:27:13 utc |
ypz |
jmettraux, in séquence :on_error => 'error_handler' construct, is the work item and error message available to the 'error_handler' participant to examine what's caused the error condition? my simple test error_handler just does "pp workitem" and it doesn't produce any output |
2013-03-14 22:27:43 utc |
phaeron |
jmettraux: single ruote fs engine with one amqp worker (same vm for this test) |
2013-03-14 22:28:32 utc |
jmettraux |
phaeron: and the leak is coming from the ruote process or the amqp worker process? |
2013-03-14 22:28:50 utc |
jmettraux |
ypz: looking at the doc... |
2013-03-14 22:29:47 utc |
jmettraux |
ypz: the workitem handed to the error handled should have an __error__ field, the workitem class has a #error method to get it directly |
2013-03-14 22:30:44 utc |
phaeron |
this 'boss' script https://github.com/MeeGoIntegration/boss/blob/bundled/boss |
2013-03-14 22:30:50 utc |
phaeron |
eventually eats lots of memory |
2013-03-14 22:31:04 utc |
phaeron |
pmap says it is all heap |
2013-03-14 22:31:47 utc |
jmettraux |
phaeron: that's the script that contains the ruote worker |
2013-03-14 22:32:03 utc |
phaeron |
and initializes the engine too |
2013-03-14 22:32:38 utc |
phaeron |
storage , I mean |
2013-03-14 22:33:33 utc |
jmettraux |
ypz: here's a test (a bit convoluted) that leverages the #error method: https://github.com/jmettraux/ruote/blob/master/test/functional/ft_5_on_error.rb#L269-L305 |
2013-03-14 22:33:54 utc |
jmettraux |
phaeron: I'm looking forward to the valgrind results |
2013-03-14 22:35:08 utc |
ypz |
let me look at the test |
2013-03-14 22:43:53 utc |
jmettraux |
ypz: not sure if I should have shown this test, it's a bit raw and convoluted, it uses a stash trick, it's probably not a good example |
2013-03-14 22:44:42 utc |
ypz |
is "stash" special in any way ? |
2013-03-14 22:45:22 utc |
jmettraux |
yes, it's only availalble in ruote functional tests |
2013-03-14 22:45:53 utc |
ypz |
is there any reason I can't extract error into from work item inside my error_handler, such as write it to a log file on file system ? |
2013-03-14 22:46:13 utc |
ypz |
s/error into/error info/ |
2013-03-14 22:46:21 utc |
jmettraux |
ypz: you should have no problem doing that |
2013-03-14 22:47:08 utc |
jmettraux |
if it doesn't work, what are the symptoms? |
2013-03-14 22:47:14 utc |
ypz |
good to know that! |
2013-03-14 22:47:30 utc |
ypz |
right ow, I got nothing, no errors and no output |
2013-03-14 22:49:13 utc |
jmettraux |
maybe an error in your error_handler |
2013-03-14 22:49:53 utc |
jmettraux |
add some puts statements to determine where it stops behaving, maybe add a rescue block |
2013-03-14 22:50:07 utc |
ypz |
in document, http://ruote.rubyforge.org/exp/on_error.html, it mentions about (error) messages, any doc on how to receive such messages ? |
2013-03-14 22:50:08 utc |
jmettraux |
acertain the thing before it goes hiding under the rug |
2013-03-14 22:50:57 utc |
ypz |
yea, I'll try to trim my error handler to its minimum to figure out what's going on there, now I know that it should work |
2013-03-14 22:50:59 utc |
jmettraux |
in the same way, by writing a participant or a subprocess |
2013-03-14 22:59:15 utc |
jmettraux |
ypz: here is a simple example, it digs into workitem.error: https://gist.github.com/anonymous/5165918 |
2013-03-14 23:02:27 utc |
ypz |
alright, my abs. bare bone error handler is able to puts out the work item along with error message ! |
2013-03-14 23:06:10 utc |
ypz |
jmettraux that's plenty of info to get me going for now, thanks a lot! |
2013-03-14 23:09:03 utc |
jmettraux |
ypz: you're welcome! |
2013-03-14 23:09:27 utc |
ypz |
bye |
2013-03-14 23:09:32 utc |
jmettraux |
bye! |
2013-03-14 23:21:54 utc |
phaeron |
jmettraux: sorry , this is the script that is running https://github.com/MeeGoIntegration/boss/blob/0.8.0/boss |
2013-03-14 23:25:40 utc |
phaeron |
it's in master now |
2013-03-14 23:35:36 utc |
phaeron |
https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L454 |
2013-03-14 23:35:50 utc |
phaeron |
similar constructs causes very high cpu usage |
2013-03-14 23:37:05 utc |
jmettraux |
phaeron: it iterates on how many actions? |
2013-03-14 23:37:42 utc |
phaeron |
varies. usually 2-4 and doesn't cause much trouble. recently a big request had about 130 actions |
2013-03-14 23:38:23 utc |
jmettraux |
what causes the high cpu usage? What is do_wait_for_build? |
2013-03-14 23:38:48 utc |
phaeron |
https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L510 |
2013-03-14 23:39:15 utc |
phaeron |
is_repo_published is an amqp participant |
2013-03-14 23:39:24 utc |
phaeron |
that checks an external system |
2013-03-14 23:39:50 utc |
jmettraux |
cannot pinpoint on the real cpu hog? |
2013-03-14 23:40:30 utc |
phaeron |
not really. I wrote a similar smaller process and got the high cpu usage similarly |
2013-03-14 23:41:27 utc |
jmettraux |
I'm afraid I cannot help much |
2013-03-14 23:42:01 utc |
phaeron |
yeah I am still trying to find a single point of failure |
2013-03-14 23:42:16 utc |
phaeron |
jmettraux: don't worry I am not giving up yet :) |
2013-03-14 23:42:36 utc |
jmettraux |
well, lots of suspects |
2013-03-14 23:43:55 utc |
jmettraux |
it'd be interesting to run a process that just contains an invocation to do_wait_for_build and measure |
2013-03-14 23:44:15 utc |
jmettraux |
(just a few simplification iterations: https://gist.github.com/anonymous/5166219 ) |
2013-03-14 23:46:41 utc |
phaeron |
I am doing the last simpler form with a dumper ampq participant but it doesn't call to the external system |
2013-03-14 23:46:52 utc |
phaeron |
and I can see the memory increase slowly in top |
2013-03-14 23:47:32 utc |
jmettraux |
then try removing your participant |
2013-03-14 23:48:23 utc |
jmettraux |
if your sure it's ruote's fault, it's pretty easy to prove it |
2013-03-14 23:48:39 utc |
jmettraux |
without any amqp stuff |
2013-03-14 23:49:08 utc |
jmettraux |
just write a plain ruote test case and state your measurements points and results |
2013-03-14 23:49:13 utc |
phaeron |
ok |
2013-03-14 23:49:28 utc |
phaeron |
I am still also figuring out how to produce the measurements |
2013-03-14 23:49:35 utc |
jmettraux |
valgrind? |
2013-03-14 23:49:40 utc |
jmettraux |
top? |
2013-03-14 23:50:06 utc |
jmettraux |
btw, what are the symptoms of the memory leak in the wild? |
2013-03-14 23:50:49 utc |
phaeron |
increasing memory usage , eventually swapping , and then system thrashing |
2013-03-14 23:51:18 utc |
phaeron |
(I know what tools to use for measurements but how to show them to you :) ) |
2013-03-14 23:51:35 utc |
jmettraux |
maybe you have to first prove it's ruote's fault |
2013-03-14 23:51:40 utc |
phaeron |
I'll collect them in a report and paste them |
2013-03-14 23:53:27 utc |
jmettraux |
in the end, I'd love to have a simple test case that tells me that ruote is leaking memory and how |
2013-03-14 23:53:45 utc |
phaeron |
yes |
2013-03-14 23:53:50 utc |
jmettraux |
platforms detailed included |
2013-03-14 23:54:23 utc |
jmettraux |
but he culprit could be somewhere around your participant |
2013-03-14 23:54:37 utc |
jmettraux |
and also remember that you have an identical vm that doesn't leak |
2013-03-14 23:54:52 utc |
phaeron |
but the participant is a python process that runs elsewhere |
2013-03-14 23:54:53 utc |
jmettraux |
have you looked at the Ubuntu package versions? |
2013-03-14 23:55:13 utc |
phaeron |
all the heap usage is by that script I linked to above (boss) |
2013-03-14 23:55:20 utc |
phaeron |
what ubuntu package versions ? |
2013-03-14 23:55:36 utc |
jmettraux |
the "local" participant that dispatches over AMQP, is that a vanilla ruote-amqp participant or something that you guys developed or modified? |
2013-03-14 23:55:51 utc |
jmettraux |
are your two vm's package identical? |
2013-03-14 23:56:05 utc |
jmettraux |
you only showed the Gemfile |
2013-03-14 23:56:08 utc |
phaeron |
the two vms should be identical yes |
2013-03-14 23:56:16 utc |
jmettraux |
you didn't report the ruby patch level |
2013-03-14 23:56:18 utc |
jmettraux |
should be |
2013-03-14 23:56:43 utc |
phaeron |
https://github.com/MeeGoIntegration/boss/blob/master/boss#L81 |
2013-03-14 23:57:02 utc |
phaeron |
ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux] |
2013-03-14 23:57:17 utc |
jmettraux |
that's the receiver, there's also the participant involved |
2013-03-14 23:57:18 utc |
phaeron |
same on both systems |
2013-03-14 23:57:36 utc |
jmettraux |
are the system packages identical? |
2013-03-14 23:57:50 utc |
jmettraux |
are the two vms running on the same host? |
2013-03-14 23:57:52 utc |
phaeron |
yes |
2013-03-14 23:57:56 utc |
phaeron |
not same host |
2013-03-14 23:58:04 utc |
phaeron |
same packages installed on both sides |
2013-03-14 23:58:33 utc |
jmettraux |
not the same host... does moving the OK vm to the NotOK host make the vm go NotOK? |
2013-03-14 23:58:34 utc |
phaeron |
usually physical host doesn't affect vm internals |
2013-03-14 23:58:51 utc |
phaeron |
I can't migrate the vms, at least not very easily |
2013-03-14 23:59:39 utc |
phaeron |
if by participant you mean the remote on the other side of amqp , it is a python script , custom ruote-amqp |
2013-03-14 23:59:59 utc |
jmettraux |
I meant the local participant, the one that places the message in AMQP |
2013-03-15 00:00:13 utc |
jmettraux |
for that "real participant", the python one |
2013-03-15 00:00:15 utc |
phaeron |
lbt: can you help |
2013-03-15 00:01:50 utc |
phaeron |
I hope he's still awake :) |
2013-03-15 00:05:09 utc |
phaeron |
jmettraux: the launchers are also remote amqp python scripts. ruote is intermediate |
2013-03-15 00:05:19 utc |
jmettraux |
good |
2013-03-15 00:05:20 utc |
phaeron |
I am sorry I might be confusing you |
2013-03-15 00:05:28 utc |
jmettraux |
no worries |
2013-03-15 00:05:55 utc |
lbt |
hey ... sure |
2013-03-15 00:06:01 utc |
lbt |
hi jmettraux |
2013-03-15 00:06:12 utc |
jmettraux |
lbt: hello, good late evening |
2013-03-15 00:06:12 utc |
phaeron |
but as far as I understand : "python launcher (process + workitem ) " -> amqp -> ruote -> ( python amqp participants ) |
2013-03-15 00:09:24 utc |
lbt |
so just catching up on backlog |
2013-03-15 00:11:52 utc |
lbt |
so the process that grows is the "boss" script in that ^^ url |
2013-03-15 00:13:19 utc |
lbt |
and it is essentially just a wrapper around ruote Dash/Worker/FsStorage |
2013-03-15 00:13:52 utc |
lbt |
phaeron: I wonder if we could use a different storage? |