ruote tmp/log_2013-03-14.html

2013-03-14 00:20:14 utc ypz_ what would be the correct way to get the name from a Ruote.define object ? empirically, it is the value of pdef[1]['name']
2013-03-14 05:35:52 utc jmettraux ypz: hello, yes pdef[1]['name'] is probably the shortest way
2013-03-14 05:36:18 utc ypz hi,
2013-03-14 05:36:42 utc ypz so it is acceptable to access it this way ? I was hope there is a getter for it
2013-03-14 05:36:45 utc jmettraux although, in the flow, the workitem handed to participants has a #wf_name and a #wf_revision method
2013-03-14 05:36:54 utc jmettraux it's totally acceptable
2013-03-14 05:37:06 utc jmettraux process definition are just trees
2013-03-14 05:37:14 utc jmettraux (once "generated")
2013-03-14 05:37:24 utc jmettraux feel free to wrap that in any class you like
2013-03-14 05:37:42 utc jmettraux the "process portfolio management" is left to integrators
2013-03-14 05:38:18 utc jmettraux there are people trying to build things around that: https://github.com/coffeeaddict/ruote-registry
2013-03-14 05:38:25 utc ypz well, I generated pdef object and stored them in DB, and other script (not aware of anything about Ruote) is reading them out from db directly
2013-03-14 05:39:00 utc jmettraux ok
2013-03-14 05:39:28 utc ypz great, thanks
2013-03-14 22:09:26 utc ypz hi, jmettraux
2013-03-14 22:09:40 utc jmettraux hello, good afternoon
2013-03-14 22:09:54 utc ypz what time is it at your place ?
2013-03-14 22:10:01 utc jmettraux 0657
2013-03-14 22:10:13 utc ypz then good morning to you
2013-03-14 22:10:18 utc jmettraux you're in SF iirc
2013-03-14 22:10:21 utc jmettraux thanks!
2013-03-14 22:10:42 utc ypz yea, I am in the SF Bay area
2013-03-14 22:11:22 utc jmettraux how can I help you?
2013-03-14 22:11:31 utc ypz when I use a participant to handle on_error conditions, the process itself is removed from the engine, correct ?
2013-03-14 22:12:03 utc jmettraux ACTION looks again at the docs
2013-03-14 22:13:06 utc ypz i am trying to figure out how to handle various types of errors one may encounter while processing a a workflow
2013-03-14 22:15:13 utc jmettraux if you use dashboard.on_error = 'participant', the process should not be removed
2013-03-14 22:15:22 utc jmettraux is that what you're using?
2013-03-14 22:16:15 utc ypz i used sequence :on_error => 'error_handler'
2013-03-14 22:16:44 utc jmettraux is the sequence the top "embracing" block?
2013-03-14 22:16:45 utc phaeron jmettraux: finally setup a staging environment where I can run stuff in a vm , under valgrind
2013-03-14 22:17:04 utc jmettraux phaeron: hello, good good
2013-03-14 22:17:12 utc ypz jmettraux, yes,
2013-03-14 22:17:32 utc jmettraux ypz: then the sequence will execute the participant and then be "over"
2013-03-14 22:17:57 utc jmettraux ypz: since it's the top sequence, the process terminates as well (unless the on_error participant doesn't reply immediately)
2013-03-14 22:18:43 utc jmettraux ypz: maybe a good rule of thumb would be to deal with known errors "in participants", and let the rest of the errors jam their processes
2013-03-14 22:19:02 utc phaeron jmettraux: I compared the setups between the two vms ( leaking vs. non leaking ) and couldn't find any difference.
2013-03-14 22:19:49 utc jmettraux ypz: then when you have a good grip on the thing, you can start using those block on_error constructs
2013-03-14 22:20:14 utc jmettraux ypz: but please experiment and have fun
2013-03-14 22:20:32 utc jmettraux phaeron: can you reproduce the leak?
2013-03-14 22:20:36 utc ypz jmettraux by saying " to deal with known errors "in participants", do you mean to implement on_error method for that participant ?
2013-03-14 22:21:12 utc jmettraux ypz: sorry, I meant regular rescue/ensure blocks inside of the participant implementations to deal with local issues
2013-03-14 22:21:25 utc jmettraux ypz: those that can be handled at the participant level
2013-03-14 22:21:52 utc jmettraux ypz: (and that you don't want to jam their processes)
2013-03-14 22:23:42 utc phaeron jmettraux: yes. as far as I can see , but valgrind is not reporting it yet
2013-03-14 22:24:22 utc phaeron jmettraux: the ruote setup is a bit custom https://github.com/MeeGoIntegration/boss/blob/bundled/Gemfile.lock
2013-03-14 22:24:32 utc phaeron opensuse 12.1 64bit
2013-03-14 22:25:20 utc phaeron ruby 1.8.7
2013-03-14 22:26:34 utc jmettraux phaeron: this vm is a ruote-worker vm? Do you have an array of ruote worker vms? Or is it an amqp worker vm?
2013-03-14 22:27:13 utc ypz jmettraux, in séquence :on_error => 'error_handler' construct, is the work item and error message available to the 'error_handler' participant to examine what's caused the error condition? my simple test error_handler just does "pp workitem" and it doesn't produce any output
2013-03-14 22:27:43 utc phaeron jmettraux: single ruote fs engine with one amqp worker (same vm for this test)
2013-03-14 22:28:32 utc jmettraux phaeron: and the leak is coming from the ruote process or the amqp worker process?
2013-03-14 22:28:50 utc jmettraux ypz: looking at the doc...
2013-03-14 22:29:47 utc jmettraux ypz: the workitem handed to the error handled should have an __error__ field, the workitem class has a #error method to get it directly
2013-03-14 22:30:44 utc phaeron this 'boss' script https://github.com/MeeGoIntegration/boss/blob/bundled/boss
2013-03-14 22:30:50 utc phaeron eventually eats lots of memory
2013-03-14 22:31:04 utc phaeron pmap says it is all heap
2013-03-14 22:31:47 utc jmettraux phaeron: that's the script that contains the ruote worker
2013-03-14 22:32:03 utc phaeron and initializes the engine too
2013-03-14 22:32:38 utc phaeron storage , I mean
2013-03-14 22:33:33 utc jmettraux ypz: here's a test (a bit convoluted) that leverages the #error method: https://github.com/jmettraux/ruote/blob/master/test/functional/ft_5_on_error.rb#L269-L305
2013-03-14 22:33:54 utc jmettraux phaeron: I'm looking forward to the valgrind results
2013-03-14 22:35:08 utc ypz let me look at the test
2013-03-14 22:43:53 utc jmettraux ypz: not sure if I should have shown this test, it's a bit raw and convoluted, it uses a stash trick, it's probably not a good example
2013-03-14 22:44:42 utc ypz is "stash" special in any way ?
2013-03-14 22:45:22 utc jmettraux yes, it's only availalble in ruote functional tests
2013-03-14 22:45:53 utc ypz is there any reason I can't extract error into from work item inside my error_handler, such as write it to a log file on file system ?
2013-03-14 22:46:13 utc ypz s/error into/error info/
2013-03-14 22:46:21 utc jmettraux ypz: you should have no problem doing that
2013-03-14 22:47:08 utc jmettraux if it doesn't work, what are the symptoms?
2013-03-14 22:47:14 utc ypz good to know that!
2013-03-14 22:47:30 utc ypz right ow, I got nothing, no errors and no output
2013-03-14 22:49:13 utc jmettraux maybe an error in your error_handler
2013-03-14 22:49:53 utc jmettraux add some puts statements to determine where it stops behaving, maybe add a rescue block
2013-03-14 22:50:07 utc ypz in document, http://ruote.rubyforge.org/exp/on_error.html, it mentions about (error) messages, any doc on how to receive such messages ?
2013-03-14 22:50:08 utc jmettraux acertain the thing before it goes hiding under the rug
2013-03-14 22:50:57 utc ypz yea, I'll try to trim my error handler to its minimum to figure out what's going on there, now I know that it should work
2013-03-14 22:50:59 utc jmettraux in the same way, by writing a participant or a subprocess
2013-03-14 22:59:15 utc jmettraux ypz: here is a simple example, it digs into workitem.error: https://gist.github.com/anonymous/5165918
2013-03-14 23:02:27 utc ypz alright, my abs. bare bone error handler is able to puts out the work item along with error message !
2013-03-14 23:06:10 utc ypz jmettraux that's plenty of info to get me going for now, thanks a lot!
2013-03-14 23:09:03 utc jmettraux ypz: you're welcome!
2013-03-14 23:09:27 utc ypz bye
2013-03-14 23:09:32 utc jmettraux bye!
2013-03-14 23:21:54 utc phaeron jmettraux: sorry , this is the script that is running https://github.com/MeeGoIntegration/boss/blob/0.8.0/boss
2013-03-14 23:25:40 utc phaeron it's in master now
2013-03-14 23:35:36 utc phaeron https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L454
2013-03-14 23:35:50 utc phaeron similar constructs causes very high cpu usage
2013-03-14 23:37:05 utc jmettraux phaeron: it iterates on how many actions?
2013-03-14 23:37:42 utc phaeron varies. usually 2-4 and doesn't cause much trouble. recently a big request had about 130 actions
2013-03-14 23:38:23 utc jmettraux what causes the high cpu usage? What is do_wait_for_build?
2013-03-14 23:38:48 utc phaeron https://github.com/MeeGoIntegration/boss-standard-workflow/blob/master/processes/SRCSRV_REQUEST_CREATE.BOSS_handle_SR.pdef#L510
2013-03-14 23:39:15 utc phaeron is_repo_published is an amqp participant
2013-03-14 23:39:24 utc phaeron that checks an external system
2013-03-14 23:39:50 utc jmettraux cannot pinpoint on the real cpu hog?
2013-03-14 23:40:30 utc phaeron not really. I wrote a similar smaller process and got the high cpu usage similarly
2013-03-14 23:41:27 utc jmettraux I'm afraid I cannot help much
2013-03-14 23:42:01 utc phaeron yeah I am still trying to find a single point of failure
2013-03-14 23:42:16 utc phaeron jmettraux: don't worry I am not giving up yet :)
2013-03-14 23:42:36 utc jmettraux well, lots of suspects
2013-03-14 23:43:55 utc jmettraux it'd be interesting to run a process that just contains an invocation to do_wait_for_build and measure
2013-03-14 23:44:15 utc jmettraux (just a few simplification iterations: https://gist.github.com/anonymous/5166219 )
2013-03-14 23:46:41 utc phaeron I am doing the last simpler form with a dumper ampq participant but it doesn't call to the external system
2013-03-14 23:46:52 utc phaeron and I can see the memory increase slowly in top
2013-03-14 23:47:32 utc jmettraux then try removing your participant
2013-03-14 23:48:23 utc jmettraux if your sure it's ruote's fault, it's pretty easy to prove it
2013-03-14 23:48:39 utc jmettraux without any amqp stuff
2013-03-14 23:49:08 utc jmettraux just write a plain ruote test case and state your measurements points and results
2013-03-14 23:49:13 utc phaeron ok
2013-03-14 23:49:28 utc phaeron I am still also figuring out how to produce the measurements
2013-03-14 23:49:35 utc jmettraux valgrind?
2013-03-14 23:49:40 utc jmettraux top?
2013-03-14 23:50:06 utc jmettraux btw, what are the symptoms of the memory leak in the wild?
2013-03-14 23:50:49 utc phaeron increasing memory usage , eventually swapping , and then system thrashing
2013-03-14 23:51:18 utc phaeron (I know what tools to use for measurements but how to show them to you :) )
2013-03-14 23:51:35 utc jmettraux maybe you have to first prove it's ruote's fault
2013-03-14 23:51:40 utc phaeron I'll collect them in a report and paste them
2013-03-14 23:53:27 utc jmettraux in the end, I'd love to have a simple test case that tells me that ruote is leaking memory and how
2013-03-14 23:53:45 utc phaeron yes
2013-03-14 23:53:50 utc jmettraux platforms detailed included
2013-03-14 23:54:23 utc jmettraux but he culprit could be somewhere around your participant
2013-03-14 23:54:37 utc jmettraux and also remember that you have an identical vm that doesn't leak
2013-03-14 23:54:52 utc phaeron but the participant is a python process that runs elsewhere
2013-03-14 23:54:53 utc jmettraux have you looked at the Ubuntu package versions?
2013-03-14 23:55:13 utc phaeron all the heap usage is by that script I linked to above (boss)
2013-03-14 23:55:20 utc phaeron what ubuntu package versions ?
2013-03-14 23:55:36 utc jmettraux the "local" participant that dispatches over AMQP, is that a vanilla ruote-amqp participant or something that you guys developed or modified?
2013-03-14 23:55:51 utc jmettraux are your two vm's package identical?
2013-03-14 23:56:05 utc jmettraux you only showed the Gemfile
2013-03-14 23:56:08 utc phaeron the two vms should be identical yes
2013-03-14 23:56:16 utc jmettraux you didn't report the ruby patch level
2013-03-14 23:56:18 utc jmettraux should be
2013-03-14 23:56:43 utc phaeron https://github.com/MeeGoIntegration/boss/blob/master/boss#L81
2013-03-14 23:57:02 utc phaeron ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux]
2013-03-14 23:57:17 utc jmettraux that's the receiver, there's also the participant involved
2013-03-14 23:57:18 utc phaeron same on both systems
2013-03-14 23:57:36 utc jmettraux are the system packages identical?
2013-03-14 23:57:50 utc jmettraux are the two vms running on the same host?
2013-03-14 23:57:52 utc phaeron yes
2013-03-14 23:57:56 utc phaeron not same host
2013-03-14 23:58:04 utc phaeron same packages installed on both sides
2013-03-14 23:58:33 utc jmettraux not the same host... does moving the OK vm to the NotOK host make the vm go NotOK?
2013-03-14 23:58:34 utc phaeron usually physical host doesn't affect vm internals
2013-03-14 23:58:51 utc phaeron I can't migrate the vms, at least not very easily
2013-03-14 23:59:39 utc phaeron if by participant you mean the remote on the other side of amqp , it is a python script , custom ruote-amqp
2013-03-14 23:59:59 utc jmettraux I meant the local participant, the one that places the message in AMQP
2013-03-15 00:00:13 utc jmettraux for that "real participant", the python one
2013-03-15 00:00:15 utc phaeron lbt: can you help
2013-03-15 00:01:50 utc phaeron I hope he's still awake :)
2013-03-15 00:05:09 utc phaeron jmettraux: the launchers are also remote amqp python scripts. ruote is intermediate
2013-03-15 00:05:19 utc jmettraux good
2013-03-15 00:05:20 utc phaeron I am sorry I might be confusing you
2013-03-15 00:05:28 utc jmettraux no worries
2013-03-15 00:05:55 utc lbt hey ... sure
2013-03-15 00:06:01 utc lbt hi jmettraux
2013-03-15 00:06:12 utc jmettraux lbt: hello, good late evening
2013-03-15 00:06:12 utc phaeron but as far as I understand : "python launcher (process + workitem ) " -> amqp -> ruote -> ( python amqp participants )
2013-03-15 00:09:24 utc lbt so just catching up on backlog
2013-03-15 00:11:52 utc lbt so the process that grows is the "boss" script in that ^^ url
2013-03-15 00:13:19 utc lbt and it is essentially just a wrapper around ruote Dash/Worker/FsStorage
2013-03-15 00:13:52 utc lbt phaeron: I wonder if we could use a different storage?