| 2013-03-14 00:20:14 utc | ypz_ | what would be the correct way to get the name from a Ruote.define object ? empirically, it is the value of pdef[1]['name'] |
| 2013-03-14 05:35:52 utc | jmettraux | ypz: hello, yes pdef[1]['name'] is probably the shortest way |
| 2013-03-14 05:36:18 utc | ypz | hi, |
| 2013-03-14 05:36:42 utc | ypz | so it is acceptable to access it this way ? I was hope there is a getter for it |
| 2013-03-14 05:36:45 utc | jmettraux | although, in the flow, the workitem handed to participants has a #wf_name and a #wf_revision method |
| 2013-03-14 05:36:54 utc | jmettraux | it's totally acceptable |
| 2013-03-14 05:37:06 utc | jmettraux | process definition are just trees |
| 2013-03-14 05:37:14 utc | jmettraux | (once "generated") |
| 2013-03-14 05:37:24 utc | jmettraux | feel free to wrap that in any class you like |
| 2013-03-14 05:37:42 utc | jmettraux | the "process portfolio management" is left to integrators |
| 2013-03-14 05:38:18 utc | jmettraux | there are people trying to build things around that: https://github.com/coffeeaddict/ruote-registry |
| 2013-03-14 05:38:25 utc | ypz | well, I generated pdef object and stored them in DB, and other script (not aware of anything about Ruote) is reading them out from db directly |
| 2013-03-14 05:39:00 utc | jmettraux | ok |
| 2013-03-14 05:39:28 utc | ypz | great, thanks |
| 2013-03-14 22:09:26 utc | ypz | hi, jmettraux |
| 2013-03-14 22:09:40 utc | jmettraux | hello, good afternoon |
| 2013-03-14 22:09:54 utc | ypz | what time is it at your place ? |
| 2013-03-14 22:10:01 utc | jmettraux | 0657 |
| 2013-03-14 22:10:13 utc | ypz | then good morning to you |
| 2013-03-14 22:10:18 utc | jmettraux | you're in SF iirc |
| 2013-03-14 22:10:21 utc | jmettraux | thanks! |
| 2013-03-14 22:10:42 utc | ypz | yea, I am in the SF Bay area |
| 2013-03-14 22:11:22 utc | jmettraux | how can I help you? |
| 2013-03-14 22:11:31 utc | ypz | when I use a participant to handle on_error conditions, the process itself is removed from the engine, correct ? |
| 2013-03-14 22:12:03 utc | jmettraux | ACTION looks again at the docs |
| 2013-03-14 22:13:06 utc | ypz | i am trying to figure out how to handle various types of errors one may encounter while processing a a workflow |
| 2013-03-14 22:15:13 utc | jmettraux | if you use dashboard.on_error = 'participant', the process should not be removed |
| 2013-03-14 22:15:22 utc | jmettraux | is that what you're using? |
| 2013-03-14 22:16:15 utc | ypz | i used sequence :on_error => 'error_handler' |
| 2013-03-14 22:16:44 utc | jmettraux | is the sequence the top "embracing" block? |
| 2013-03-14 22:16:45 utc | phaeron | jmettraux: finally setup a staging environment where I can run stuff in a vm , under valgrind |
| 2013-03-14 22:17:04 utc | jmettraux | phaeron: hello, good good |
| 2013-03-14 22:17:12 utc | ypz | jmettraux, yes, |
| 2013-03-14 22:17:32 utc | jmettraux | ypz: then the sequence will execute the participant and then be "over" |
| 2013-03-14 22:17:57 utc | jmettraux | ypz: since it's the top sequence, the process terminates as well (unless the on_error participant doesn't reply immediately) |
| 2013-03-14 22:18:43 utc | jmettraux | ypz: maybe a good rule of thumb would be to deal with known errors "in participants", and let the rest of the errors jam their processes |
| 2013-03-14 22:19:02 utc | phaeron | jmettraux: I compared the setups between the two vms ( leaking vs. non leaking ) and couldn't find any difference. |
| 2013-03-14 22:19:49 utc | jmettraux | ypz: then when you have a good grip on the thing, you can start using those block on_error constructs |
| 2013-03-14 22:20:14 utc | jmettraux | ypz: but please experiment and have fun |
| 2013-03-14 22:20:32 utc | jmettraux | phaeron: can you reproduce the leak? |
| 2013-03-14 22:20:36 utc | ypz | jmettraux by saying " to deal with known errors "in participants", do you mean to implement on_error method for that participant ? |
| 2013-03-14 22:21:12 utc | jmettraux | ypz: sorry, I meant regular rescue/ensure blocks inside of the participant implementations to deal with local issues |
| 2013-03-14 22:21:25 utc | jmettraux | ypz: those that can be handled at the participant level |
| 2013-03-14 22:21:52 utc | jmettraux | ypz: (and that you don't want to jam their processes) |
| 2013-03-14 22:23:42 utc | phaeron | jmettraux: yes. as far as I can see , but valgrind is not reporting it yet |
| 2013-03-14 22:24:22 utc | phaeron | jmettraux: the ruote setup is a bit custom https://github.com/MeeGoIntegration/boss/blob/bundled/Gemfile.lock |
| 2013-03-14 22:24:32 utc | phaeron | opensuse 12.1 64bit |
| 2013-03-14 22:25:20 utc | phaeron | ruby 1.8.7 |
| 2013-03-14 22:26:34 utc | jmettraux | phaeron: this vm is a ruote-worker vm? Do you have an array of ruote worker vms? Or is it an amqp worker vm? |
| 2013-03-14 22:27:13 utc | ypz | jmettraux, in séquence :on_error => 'error_handler' construct, is the work item and error message available to the 'error_handler' participant to examine what's caused the error condition? my simple test error_handler just does "pp workitem" and it doesn't produce any output |
| 2013-03-14 22:27:43 utc | phaeron | jmettraux: single ruote fs engine with one amqp worker (same vm for this test) |
| 2013-03-14 22:28:32 utc | jmettraux | phaeron: and the leak is coming from the ruote process or the amqp worker process? |
| 2013-03-14 22:28:50 utc | jmettraux | ypz: looking at the doc... |
| 2013-03-14 22:29:47 utc | jmettraux | ypz: the workitem handed to the error handled should have an __error__ field, the workitem class has a #error method to get it directly |
| 2013-03-14 22:30:44 utc | phaeron | this 'boss' script https://github.com/MeeGoIntegration/boss/blob/bundled/boss |
| 2013-03-14 22:30:50 utc | phaeron | eventually eats lots of memory |
| 2013-03-14 22:31:04 utc | phaeron | pmap says it is all heap |
| 2013-03-14 22:31:47 utc | jmettraux | phaeron: that's the script that contains the ruote worker |
| 2013-03-14 22:32:03 utc | phaeron | and initializes the engine too |
| 2013-03-14 22:32:38 utc | phaeron | storage , I mean |
| 2013-03-14 22:33:33 utc | jmettraux | ypz: here's a test (a bit convoluted) that leverages the #error method: https://github.com/jmettraux/ruote/blob/master/test/functional/ft_5_on_error.rb#L269-L305 |
| 2013-03-14 22:33:54 utc | jmettraux | phaeron: I'm looking forward to the valgrind results |
| 2013-03-14 22:35:08 utc | ypz | let me look at the test |
| 2013-03-14 22:43:53 utc | jmettraux | ypz: not sure if I should have shown this test, it's a bit raw and convoluted, it uses a stash trick, it's probably not a good example |
| 2013-03-14 22:44:42 utc | ypz | is "stash" special in any way ? |
| 2013-03-14 22:45:22 utc | jmettraux | yes, it's only availalble in ruote functional tests |
| 2013-03-14 22:45:53 utc | ypz | is there any reason I can't extract error into from work item inside my error_handler, such as write it to a log file on file system ? |
| 2013-03-14 22:46:13 utc | ypz | s/error into/error info/ |
| 2013-03-14 22:46:21 utc | jmettraux | ypz: you should have no problem doing that |
| 2013-03-14 22:47:08 utc | jmettraux | if it doesn't work, what are the symptoms? |
| 2013-03-14 22:47:14 utc | ypz | good to know that! |
| 2013-03-14 22:47:30 utc | ypz | right ow, I got nothing, no errors and no output |
| 2013-03-14 22:49:13 utc | jmettraux | maybe an error in your error_handler |
| 2013-03-14 22:49:53 utc | jmettraux | add some puts statements to determine where it stops behaving, maybe add a rescue block |
| 2013-03-14 22:50:07 utc | ypz | in document, http://ruote.rubyforge.org/exp/on_error.html, it mentions about (error) messages, any doc on how to receive such messages ? |
| 2013-03-14 22:50:08 utc | jmettraux | acertain the thing before it goes hiding under the rug |
| 2013-03-14 22:50:57 utc | ypz | yea, I'll try to trim my error handler to its minimum to figure out what's going on there, now I know that it should work |
| 2013-03-14 22:50:59 utc | jmettraux | in the same way, by writing a participant or a subprocess |
| 2013-03-14 22:59:15 utc | jmettraux | ypz: here is a simple example, it digs into workitem.error: https://gist.github.com/anonymous/5165918 |
| 2013-03-14 23:02:27 utc | ypz | alright, my abs. bare bone error handler is able to puts out the work item along with error message ! |
| 2013-03-14 23:06:10 utc | ypz | jmettraux that's plenty of info to get me going for now, thanks a lot! |
| 2013-03-14 23:09:03 utc | jmettraux | ypz: you're welcome! |
| 2013-03-14 23:09:27 utc | ypz | bye |
| 2013-03-14 23:09:32 utc | jmettraux | bye! |
| 2013-03-14 23:21:54 utc | phaeron | jmettraux: sorry , this is the script that is running https://github.com/MeeGoIntegration/boss/blob/0.8.0/boss |
| 2013-03-14 23:25:40 utc | phaeron | it's in master now |
| 2013-03-14 23:35:36 utc | phaeron | https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L454 |
| 2013-03-14 23:35:50 utc | phaeron | similar constructs causes very high cpu usage |
| 2013-03-14 23:37:05 utc | jmettraux | phaeron: it iterates on how many actions? |
| 2013-03-14 23:37:42 utc | phaeron | varies. usually 2-4 and doesn't cause much trouble. recently a big request had about 130 actions |
| 2013-03-14 23:38:23 utc | jmettraux | what causes the high cpu usage? What is do_wait_for_build? |
| 2013-03-14 23:38:48 utc | phaeron | https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L510 |
| 2013-03-14 23:39:15 utc | phaeron | is_repo_published is an amqp participant |
| 2013-03-14 23:39:24 utc | phaeron | that checks an external system |
| 2013-03-14 23:39:50 utc | jmettraux | cannot pinpoint on the real cpu hog? |
| 2013-03-14 23:40:30 utc | phaeron | not really. I wrote a similar smaller process and got the high cpu usage similarly |
| 2013-03-14 23:41:27 utc | jmettraux | I'm afraid I cannot help much |
| 2013-03-14 23:42:01 utc | phaeron | yeah I am still trying to find a single point of failure |
| 2013-03-14 23:42:16 utc | phaeron | jmettraux: don't worry I am not giving up yet :) |
| 2013-03-14 23:42:36 utc | jmettraux | well, lots of suspects |
| 2013-03-14 23:43:55 utc | jmettraux | it'd be interesting to run a process that just contains an invocation to do_wait_for_build and measure |
| 2013-03-14 23:44:15 utc | jmettraux | (just a few simplification iterations: https://gist.github.com/anonymous/5166219 ) |
| 2013-03-14 23:46:41 utc | phaeron | I am doing the last simpler form with a dumper ampq participant but it doesn't call to the external system |
| 2013-03-14 23:46:52 utc | phaeron | and I can see the memory increase slowly in top |
| 2013-03-14 23:47:32 utc | jmettraux | then try removing your participant |
| 2013-03-14 23:48:23 utc | jmettraux | if your sure it's ruote's fault, it's pretty easy to prove it |
| 2013-03-14 23:48:39 utc | jmettraux | without any amqp stuff |
| 2013-03-14 23:49:08 utc | jmettraux | just write a plain ruote test case and state your measurements points and results |
| 2013-03-14 23:49:13 utc | phaeron | ok |
| 2013-03-14 23:49:28 utc | phaeron | I am still also figuring out how to produce the measurements |
| 2013-03-14 23:49:35 utc | jmettraux | valgrind? |
| 2013-03-14 23:49:40 utc | jmettraux | top? |
| 2013-03-14 23:50:06 utc | jmettraux | btw, what are the symptoms of the memory leak in the wild? |
| 2013-03-14 23:50:49 utc | phaeron | increasing memory usage , eventually swapping , and then system thrashing |
| 2013-03-14 23:51:18 utc | phaeron | (I know what tools to use for measurements but how to show them to you :) ) |
| 2013-03-14 23:51:35 utc | jmettraux | maybe you have to first prove it's ruote's fault |
| 2013-03-14 23:51:40 utc | phaeron | I'll collect them in a report and paste them |
| 2013-03-14 23:53:27 utc | jmettraux | in the end, I'd love to have a simple test case that tells me that ruote is leaking memory and how |
| 2013-03-14 23:53:45 utc | phaeron | yes |
| 2013-03-14 23:53:50 utc | jmettraux | platforms detailed included |
| 2013-03-14 23:54:23 utc | jmettraux | but he culprit could be somewhere around your participant |
| 2013-03-14 23:54:37 utc | jmettraux | and also remember that you have an identical vm that doesn't leak |
| 2013-03-14 23:54:52 utc | phaeron | but the participant is a python process that runs elsewhere |
| 2013-03-14 23:54:53 utc | jmettraux | have you looked at the Ubuntu package versions? |
| 2013-03-14 23:55:13 utc | phaeron | all the heap usage is by that script I linked to above (boss) |
| 2013-03-14 23:55:20 utc | phaeron | what ubuntu package versions ? |
| 2013-03-14 23:55:36 utc | jmettraux | the "local" participant that dispatches over AMQP, is that a vanilla ruote-amqp participant or something that you guys developed or modified? |
| 2013-03-14 23:55:51 utc | jmettraux | are your two vm's package identical? |
| 2013-03-14 23:56:05 utc | jmettraux | you only showed the Gemfile |
| 2013-03-14 23:56:08 utc | phaeron | the two vms should be identical yes |
| 2013-03-14 23:56:16 utc | jmettraux | you didn't report the ruby patch level |
| 2013-03-14 23:56:18 utc | jmettraux | should be |
| 2013-03-14 23:56:43 utc | phaeron | https://github.com/MeeGoIntegration/boss/blob/master/boss#L81 |
| 2013-03-14 23:57:02 utc | phaeron | ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux] |
| 2013-03-14 23:57:17 utc | jmettraux | that's the receiver, there's also the participant involved |
| 2013-03-14 23:57:18 utc | phaeron | same on both systems |
| 2013-03-14 23:57:36 utc | jmettraux | are the system packages identical? |
| 2013-03-14 23:57:50 utc | jmettraux | are the two vms running on the same host? |
| 2013-03-14 23:57:52 utc | phaeron | yes |
| 2013-03-14 23:57:56 utc | phaeron | not same host |
| 2013-03-14 23:58:04 utc | phaeron | same packages installed on both sides |
| 2013-03-14 23:58:33 utc | jmettraux | not the same host... does moving the OK vm to the NotOK host make the vm go NotOK? |
| 2013-03-14 23:58:34 utc | phaeron | usually physical host doesn't affect vm internals |
| 2013-03-14 23:58:51 utc | phaeron | I can't migrate the vms, at least not very easily |
| 2013-03-14 23:59:39 utc | phaeron | if by participant you mean the remote on the other side of amqp , it is a python script , custom ruote-amqp |
| 2013-03-14 23:59:59 utc | jmettraux | I meant the local participant, the one that places the message in AMQP |
| 2013-03-15 00:00:13 utc | jmettraux | for that "real participant", the python one |
| 2013-03-15 00:00:15 utc | phaeron | lbt: can you help |
| 2013-03-15 00:01:50 utc | phaeron | I hope he's still awake :) |
| 2013-03-15 00:05:09 utc | phaeron | jmettraux: the launchers are also remote amqp python scripts. ruote is intermediate |
| 2013-03-15 00:05:19 utc | jmettraux | good |
| 2013-03-15 00:05:20 utc | phaeron | I am sorry I might be confusing you |
| 2013-03-15 00:05:28 utc | jmettraux | no worries |
| 2013-03-15 00:05:55 utc | lbt | hey ... sure |
| 2013-03-15 00:06:01 utc | lbt | hi jmettraux |
| 2013-03-15 00:06:12 utc | jmettraux | lbt: hello, good late evening |
| 2013-03-15 00:06:12 utc | phaeron | but as far as I understand : "python launcher (process + workitem ) " -> amqp -> ruote -> ( python amqp participants ) |
| 2013-03-15 00:09:24 utc | lbt | so just catching up on backlog |
| 2013-03-15 00:11:52 utc | lbt | so the process that grows is the "boss" script in that ^^ url |
| 2013-03-15 00:13:19 utc | lbt | and it is essentially just a wrapper around ruote Dash/Worker/FsStorage |
| 2013-03-15 00:13:52 utc | lbt | phaeron: I wonder if we could use a different storage? |