sys-tango-benchmark results
|
|
---|---|
Hi All SKA is interested in benchmarking performance of a large TANGO system in our environment. The https://github.com/tango-controls/sys-tango-benchmark tool looks very useful in this regard. Does anyone have some results from previous runs that they'd be willing to share? We're looking at 10k to 100k devices, but even smaller systems will be interesting for comparison. Maybe this is already available somewhere online? Regards, Anton |
|
|
---|---|
Hi Anton, You are asking for it just in time . We are now preparing an ask for Institutes to run some unified set of benchmarks to get results. Up to now, we were running it only on the local virtual machines for test purposes. Beginning next week we will provide a proposition of the configuration .yml file. All the best, Piotr |
|
|
---|---|
Hi Anton, this sounds like a very interesting use case and as Piotr pointed out this arrives just when Piotr is requesting benchmarks from all sites. We need these results before ICALEPCS. In your case I wonder what kind of metrics are you planning to measure? I can imagine measuring performance as a function of number of clients per server but in the case of many devices what values do you want to measure - startup times, grouped calls, individual client accessing 1 or more device servers, events performance? As you know the Tango model implements point-2-point connections between clients and servers. Multiplying the number of device servers does not necessarily impact the performance of individual client-server connections. Are you planning on putting 10k devices in one device server? Or are you looking to optimise the number of devices per device server? Andy |
|
|
---|---|
Hi Andy, Piotr Thanks for replies. Glad to hear some tests and reports are planned. Andy, you make a good point about the point-to-point communications, which should remain very efficient. We're thinking of looking at metrics like these: - Start / initialisation time. - Peak memory usage . - Peak CPU usage. - Some measure of query response time for attributes & commands & events, to N devices concurrently. - How many devices can we run on a VM with say 1 CPU and 4 GB RAM. Does doubling the resources allow twice as many devices? - Possibly the TANGO DB registration time (first time population of the DB with all devices, attributes, properties), although this isn't a recurring cost, so not that important. In our environment everything is Dockerised. The plan is to use Kubernetes to orchestrate the TANGO control system. Early tests have shown that 1 device per device server per container doesn't scale very well - e.g. problem starting 2000 on a single machine. Multiple devices per server works better, with maybe 100 containers on a machine. We are looking at how to spread the load out, giving guidelines for developers. Questions like: - How many devices per device server? - How many device servers per container? - How many containers per VM? - How much CPU and RAM per VM? Obviously, it depends what each device is doing, but we'd start with something simple. Anton |
|
|
---|---|
Hi Anton, You can start with a kind of standard tests (prepared by Michal) to be able to compare results from different institutes. See: https://github.com/tango-controls/sys-tango-benchmark-standard-tests Regarding already available benchmarks, there are measurements of:
Feel free to propose or (event better ) to write additional tests. Piotr |
|
|
---|---|
Thanks, Piotr - we'll take a look. |
|
|
---|---|
Hi, Back to initial Anton's question - Are there any results available already? Are they published somewhere? Thanks! Cheers, |
|
|
---|---|
@Ingvord, there are at least results of tests made on AWS, for ICALEPCS paper: https://github.com/tango-controls/sys-tango-benchmark-standard-tests/tree/master/aws-ec2-tests All the best, Piotr |
|
|
---|---|
Hi Piotr, Thanks a lot! That is already interesting to see! Cheers, |
|
|
---|---|
I have looked through the tests result and it seems to me that Java in unfairly slow. First of all some question to test benchmark itself:
Sorry if these questions have been answered somewhere - I could not find. I have extracted test server from the benchmark and wrote a simple test here Running the test for 15s with 64 clients (all on a single machine though) gave me 124371 from WriteAttributeCounterCount, while amazon results are typically 6oK (x2 times slower) Anyway I have started to investigate this, you can track the progress here |