doc/v2/device-integration.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625

.. index:: device integration - adding new device-types

.. _adding_new_device_types:

Adding new device types
#######################

.. warning:: This is the most complex part of LAVA and it can be a lot
  of work (sometimes several months) to integrate a completely new
  device into LAVA. V2 offers a different and wider range of support to
  V1 but some devices will need new support to be written within
  ``lava-dispatcher``. **It is not always possible to automate a new
  device**, depending on how the device connects to LAVA, how the
  device is powered and whether the software on the device allows the
  device to be controlled remotely. However, do not be tempted into
  using this complexity as an excuse to fall into the trap of
  :ref:`simplistic testing <simplistic_testing_problems>`.

Experience is the hardest and the most expensive teacher of all. This
section is an attempt to gather a set of guidelines from the collective
experience of a range of developers based on a wide range of devices
and the attempts to integrate those devices into LAVA. **Not all such
integrations succeeded** and more than one attempt resulted in broken
hardware. Most labs will be asked to integrate prototype or
pre-production hardware which always bring their own unique mix of
unexpected errors, limitations and failure methods.

The integration process is different for every new device. Therefore,
this documentation can only provide hints about such devices, based on
experience within the LAVA software and lab teams. **Please** talk to
us **before** starting on the integration of a new device using the
:ref:`mailing_lists`. Include full details of the type of device, the
bootloader specifications, hardware support and anything you have done
so far to automate the device. Sometimes, the supplied bootloader
**must** be modified to allow automation. Some devices need electrical
modifications or specialised hardware to be automated.

Integrating a new device type will involve some level of development
work, the device type templates are more than configuration. Testing
new device type templates requires setting up a developer workflow and
running unit tests as well as running test jobs on a LAVA instance. If
the new device type involves a new boot or deployment method, there
will also need to be changes in the ``lava-dispatcher`` codebase. New
elements of the test job submissions and device configuration may also
need changes to the schema in ``lava-server``. Some new device types
will be a lot easier than others - for example U-Boot tends to have a
reasonably consistent interface across multiple devices, so changes for
a new U-Boot device could be as little as setting variables after
extending the ``base-uboot.jinja2`` template.

The LAVA developers encourage new device type templates to be
:ref:`contributed upstream <contribute_upstream>` as a :ref:`community
contribution <community_contributions>` to LAVA.

.. seealso:: :ref:`growing_your_lab`, including :ref:`lab_scaling`.
   Also :ref:`developing_device_type_templates`,
   :ref:`developing_new_classes` and
   :ref:`migrating_known_device_example`

.. index:: device integration - device requirements

.. _device_requirements:

Device Requirements
*******************

The LAVA software and lab teams have built up a set of guidelines
relating to the integration of new device-types. The further a device
deviates from one or more of these guidelines, the harder it will
become to automate such a device. Always remember that the way that the
device is supported **must** scale to large labs which already contain
a range of other devices, each with their own issues. It is **not**
acceptable to add a new device-type which is incompatible with devices
which are already supported or which imposes restrictions on how many
devices of any type can be used in any one lab.

The guidelines only consider a limited number of possible problems with
device integration. The guidelines are written using our experiences of
a variety of poorly behaving devices over years of development of
automation software like LAVA. Depending on local admins, some labs can
cope with hardware which does not comply with all guidelines,
particularly if the devices are not being used at scale. However, the
more devices are deployed in any lab, the more it will be necessary for
every device to fully comply or such labs will quickly deteriorate,
generating unreliable results.

These guidelines describe device behaviour as a whole. This is a
combination of the device hardware and the firmware. Some devices
support replacing the firmware. Sometimes this can aid automation,
sometimes is can cause more problems and complexity.

Device integration issues are often invisible when testing with a
single device attached to a single developer machine, so the single
device implementation **must** be proven to be reliable and
reproducible before starting to add more devices. For best results,
**only ever change one thing at a time**.

It is not possible to automate every piece of hardware, there are a
number of critical limitations.

.. seealso:: :ref:`pipeline_device_requirements`

.. _integration_reproducibility:

Reproducibility
===============

Reproducibility is the ability to deploy exactly the same software to
the same board(s) and running exactly the same tests many times in a
row, getting exactly the same results each time.

For automation to work, all device functions which need to be used in
automation **must** always produce the same results on each device of a
specific device type, irrespective of any previous operations on that
device, given the same starting hardware configuration.

There is no way to automate a device which behaves unpredictably.

Example One
-----------

Some devices have a mode which boots one boot method on the first boot
and then a different boot method on the second boot without allowing
for failures or cancelled boot operations. This alternating boot is
**not** suitable for automation because it would require the automation
to keep state and does not take account of test job failures and
cancellations.

Example Two
-----------

A device which supports jumpers or DIP switches **must** respect those
hardware settings no matter what software is deployed to the device,
including when that software is buggy, broken or written to the wrong
location. It **must not** be possible for test jobs to *brick* the
device, that is to prevent the device from being able to start the next
test job without admin intervention.

.. _integration_reliability:

Reliability
===========

Reliability is the ability to run a wide range of test jobs, stressing
different parts of the overall deployment, with a variety of tests and
**always** getting a ``Complete`` test job. There must be no
``JobError`` or ``InfrastructureError`` failures and there should be
limited variability in the time taken to run the test jobs to avoid the
need for excessive :ref:`timeouts`.

The same hardware configuration and infrastructure **must** always
behave in precisely the same way. The same commands and operations to
the device **must** always generate the same behaviour.

* If a device does not always recognise a critical component, for
  example the network hardware, then that device cannot be automated.

* If a device drops the serial connection or resets the connection in
  some situations during image deployment, then the device is not
  sufficiently reliable to be integrated.

* If a device relies on USB, it is possible that errors in the device
  hardware or software can cause instability in the USB stack of the
  worker to which it is connected. (Unlike ethernet, USB is a direct
  metal to metal connection and cannot be electrically isolated.) This
  can potentially cause issues with unrelated devices on the same
  worker.

.. note:: Many reliability issues can be symptoms of infrastructure
   problems but many devices can also exacerbate these failures by
   behaving in ways which do not fully comply with the standards and
   expectations of the infrastructure. It is **essential** that
   reliability issues are debugged during the process of scaling up the
   number of devices and complexity of your LAVA lab. Do **not** wait
   to debug reliability problems until after you have many devices.
   Quite how many devices counts as too many will vary massively
   according to the complexity of the requirements for each device.
   Sometimes, the only way to tackle reliability problems is to scale
   back, take devices offline or disconnect entire groups of devices
   and infrastructure. Debug your reliability issues **before** putting
   such devices into a production lab to minimise the risk of scheduled
   downtime.

.. _integration_scriptable:

Scriptability
=============

The device **must** support deployment of files and booting of the
device without **any** need for a human to monitor or interact with the
process. The need to press buttons is undesirable but can be managed in
some cases by using relays. However, every extra layer of complexity
reduces the overall reliability of the automation process and the need
for buttons should be limited or eliminated wherever possible. If a
device uses on LEDs to indicate the success of failure of operations,
such LEDs **must only be indicative**. The device **must** support full
control of that process using **only** commands and operations which do
not rely on observation.

.. _integration_scalability:

Scalability
===========

.. seealso:: :ref:`growing_your_lab`

All methods used to automate a device **must** have minimal footprint
in terms of load on the workers, complexity of scripting support and
infrastructure requirements. This is a complex area and can trivially
impact on both reliability and reproducibility as well as making it
much more difficult to debug problems which do arise. Admins must also
consider the complexity of combining multiple different devices which
each require multiple layers of support.

Some devices may need:

* relays to work around buttons,

* specialised hardware to work around deployment limitations,

* complex scripting around power control,

* a need to use :term:`LXC` for automation.

Any one of these burdens will make debugging issues on the worker and
on the devices difficult. Any combination of these burdens make
debugging many times more difficult than any one burden alone.

.. caution:: **ALWAYS START SMALL** and move forward in **small
   steps**. Remember that many of the deployment methods and tools used
   with some devices have been developed and tested only on the
   single-developer, single-device model. Once a single device is
   working, scale up **slowly**, make **one change at a time** then run
   dozens, preferably hundreds, of tests before stepping up in scale.
   It can make a significant difference even scaling up from one device
   to two, let alone to four or ten. Even the best behaved devices will
   need care to scale up to dozens of devices. LAVA can work with
   hundreds of devices but the only way to know how to deploy hundreds
   of **your** devices is to build slowly from one to two and then
   four, ten and beyond. To use thousands of devices, it is usually
   best to consider a :term:`frontend` which pulls results from several
   :ref:`micro_instances`.

Every LAVA lab is different. Planning is essential. When there is any
expectation that the lab will grow to support a lot of devices, take
care at the earliest initial stages to plan for the infrastructure that
can cope with the expected scale (and then add a bit again). It can be
very expensive (in time and money) to replace the initial
infrastructure like :abbr:`UPS (Uninterruptible Power Supply)` or
network switches or :term:`PDU`.

.. index:: device integration - power

.. _integration_power:

Power
=====

Devices **MUST** support automated resets either by the removal of all
power supplied to the :term:`DUT` or a full reboot or other reset which
clears all previous state of the DUT.

**Every** boot **must** reliably start, without interaction, directly
from the first application of power without the limitation of needing
to press buttons or requiring other interaction. Relays and other
arrangements can be used at the cost of increasing the overall
complexity of the solution, so should be avoided wherever possible.

Devices which have internal batteries become difficult to reliably
automate, unless the battery can be permanently removed. Forced reboots
become impossible without electrical modification of the device to
temporarily take the battery out of circuit. This means that it is much
easier to cause the device to go offline because of a broken kernel
build or broken image.

Battery charging can be an issue - devices may not behave normally when
held in ``fastboot`` mode or with a broken kernel build or image
deployed to the system. This can cause the device to fail to keep
charge in the battery or fail to recharge the battery, despite having
power available.

.. caution:: **Serial power leaks**
   some devices are capable of drawing power over the serial line used
   to control the device, despite the actual power supply being
   disconnected. Sometimes this requires a period of time to discharge
   capacitors on the board (fixable by adding a ``sleep`` in the
   :ref:`power_off_command <power_commands>`). Sometimes this power
   leak can cause the device to ``latch`` into a particular bootloader
   mode or other state which prevents the automation from proceeding.

.. index:: device integration - reset

.. _integration_reset:

Reset
=====

For a lot of devices, simply cycling power is sufficient for a full
reset. If the device supports reset by other means, for example when a
serial connection is made, then these resets **must** completely reset
the device so as to clear all buffers from previous test runs or
deployments, **including** when such test runs or deployments failed in
unexpected ways.

.. note:: It is recommended for all devices that admins disable ability
   of the device to automatically boot anything, but rather simply drop
   to the bootloader prompt.

.. index:: device integration - networking

.. _integration_networking:

Networking
==========

.. to be expanded as more specific content is added.

**Ethernet** - all devices using ethernet interfaces in LAVA **must**
have a unique MAC address on each interface. The MAC address **must**
be persistent across reboots. No assumptions should be made about fixed
IP addresses, address ranges or pre-defined routes. If more than one
interface is available, the boot process **must** be configurable to
always use the same interface every time the device is booted.

**WiFi** - is not currently supported as a method of booting devices.

.. index:: device integration - serial console

.. _integration_serial:

Serial console
==============

.. to be expanded as more specific content is added.

LAVA expects to automate devices by interacting with the serial port
immediately after power is applied to the device. The bootloader
**must** interact with the serial port. If a serial port is not
available on the device, suitable additional hardware **must** be
provided before integration can begin. All messages about the boot
process must be visible using the serial port and the serial port
should remain usable for the duration of all test jobs on the device.

.. OS what OSes are you expecting to run as test jobs? How will that
   change your integration requirements? testing of firmware - what
   software is to be tested? BMC?

.. index:: device integration - integration process

.. _integration_process:

Integration process
*******************

To add support for a new :term:`device type`, a certain amount of
development and testing **will** be required.

For some new device types, only a new :ref:`device type jinja2 template
<developing_device_type_templates>` will be required. Every new
template requires testing and a certain amount of debugging. Device
type templates need to be considered as code, not only configuration.
Some familiarity with how to :ref:`debug a LAVA instance
<admin_triage>` will be necessary.

For other device types, :ref:`new dispatcher Action classes
<adding_new_classes>` and new or modified :ref:`strategy classes
<using_strategy_classes>` will be needed. This typically involves a lot
of development time - make sure that you :ref:`contribute_upstream` so
that your local changes do not break when you next upgrade your LAVA
instance(s).

In addition, every new device type will need to be tested on a local
LAVA instance, so an amount of LAVA administration work will be
necessary.

It is **strongly** recommended that everyone who starts work to
integrate a new device type into LAVA is already familiar with
administering their own LAVA instance and has submitted dozens of LAVA
test jobs on at least two different device types already known to work
in LAVA V2. In most cases, a development instance will be needed as
well, so some familiarity with installing and upgrading a LAVA instance
is also recommended.

This means that developers adding new device types should already be
familiar with:

* :ref:`development_pre_requisites`

* :ref:`device_type_templates`

* :ref:`developing_device_type_templates`

* :ref:`testing_pipeline_code`

* :ref:`Administrator triage <admin_triage>`

* :ref:`admin_debug_information`

* :ref:`create_device_dictionary`

* :ref:`writing_tests`

* :ref:`debian_installation`

* :ref:`setting_up_pipeline_instance`

* :ref:`using_gold_standard_files`

* :ref:`debugging_test_failures`

* :ref:`debugging_v2`

* :ref:`unit_tests`

In addition, some device types will require the developer to also be
familiar with:

* :ref:`adding_new_classes`

* :ref:`using_strategy_classes`

* :ref:`contribute_upstream` - maintaining new dispatcher classes
  without upstream support is **not** recommended. LAVA development
  moves relatively quickly.

* :ref:`pipeline_schema` - if your new device type needs changes to the
  test job submission schema.

* :ref:`deploy_using_lxc`

* :ref:`lava_lxc_protocol_android`

* :ref:`debugging_multinode`

.. caution:: Before going any further, **please** talk to us using the
   :ref:`mailing_lists`. Do **not** rush into integration. It is
   tempting to ask a lot of questions on :ref:`support_irc` but other
   conversations will overlap and pasting logs can become a burden. Use
   the mailing list and attach all the relevant data.

.. _integration_similarity:

Find a similar existing device type
***********************************

There are a number of places to check for similar types of device which
are already supported in LAVA V2.

#. https://staging.validation.linaro.org/scheduler/

#. https://validation.linaro.org/scheduler/

#. https://lng.validation.linaro.org/scheduler/

#. https://playground.validation.linaro.org/scheduler/

#. https://git.linaro.org/lava/lava.git/tree/lava_scheduler_app/tests/device-types

#. https://review.linaro.org/#/dashboard/self

Check for:

* similar bootloader

* similar deployment type

* similar deployment or boot process

* similar sequence of boot steps

If you do not find something similar, we strongly recommend that you
**stop here** and :ref:`talk to us <mailing_lists>` before doing
anything else. Be clear about exactly what kind of device you are
trying to integrate. Include details of exactly how the device
currently boots and exactly how new files are deployed to the device.
Do not resort to :ref:`simplistic testing
<simplistic_testing_problems>`.

.. _integration_extend_template:

Extend from an existing device type template
********************************************

All new device type templates need to ``extend 'base.jinja2'`` but
there are also other base templates which simplify the process for
certain bootloaders. For example, all new U-Boot device type templates
should ``extend 'base-uboot.jinja2``. Many new fastboot device type
templates can ``extend 'base-fastboot.jinja2``. Avoid directly
extending any of the templates which do not have the ``base`` prefix -
instead copy the existing template for your new device type. When this
template is :ref:`contributed upstream <contribute_upstream>`, a new
``base`` template can be considered as part of the review process.

.. _integration_unit_test:

Extend the template unit tests
******************************

.. seealso:: :ref:`testing_new_devicetype_templates` and
   :ref:`debugging configuration files <debugging_configuration>`

All device type template files in
``lava_scheduler_app/tests/device-types`` will be checked for simple
YAML validity by the ``test_all_templates`` unit test. However, a
dedicated unit test is recommended for all but the simplest of new
device type templates. At the very least, having a unit test for your
new device type template will assist in debugging why the test job does
not run to completion. The full device configuration can be output as
part of running the unit test by changing the ``debug`` value to
``True`` at the top of the ``TestTemplates`` class in
``test_templates.py``.

Add your new device-type template to
``lava_scheduler_app/tests/device-types``. Edit
``lava_scheduler_app/tests/test_templates.py`` and add a new unit test
for your device-type based on one of the existing test functions.
Create a dummy device dictionary as a ``data`` string and ensure that
the combination of the template and the dictionary creates a valid
device. This can be as simple as:

.. code-block:: python

    def test_pixel_template(self):
        self.assertTrue(self.validate_data('staging-pixel-01', """{% extends 'pixel.jinja2' %}
 {% set adb_serial_number = 'FDAC1231DAD' %}
 {% set fastboot_serial_number = 'FDAC1231DAD' %}
 {% set device_info = [{'board_id': 'FDAC1231DAD'}] %}
 """))

In many cases, some of the default values in the base template will
need to be altered for your new device-type. For example:

.. code-block:: jinja

 {% set boot_character_delay = 150 %}

If the value may also need to be extended for some devices of this
device type, you should provide the new value as a default in the
template so that a device dictionary can set an override:

.. code-block:: jinja

 {% set baud_rate = baud_rate | default(115200) %}

.. note:: When setting updated values for defaults in the base
   template, ensure that the line setting the new value is **above**
   the start of the important ``body`` block which will contain the
   output of that value.

   .. code-block:: jinja

    {% extends 'base.jinja2' %}
    {% set boot_character_delay = 150 %}
    device_type: thunderx
    {% set console_device = console_device | default('ttyAMA0') %}
    {% set baud_rate = baud_rate | default(115200) %}

    {% set base_nfsroot_args = nfsroot_args | default(base_nfsroot_args) -%}
    {% set kernel_args = kernel_args | default('acpi=force') %}

    {% block body %}

Every time you make a change to the new template in
``lava_scheduler_app/tests/device-types``, re-run the specific unit
test for your new device type. For example, a new unit test function
defined as ``test_foobar_template`` can be run without running the rest
of the unit tests:

.. code-block:: shell

 $ python3 -m unittest -vcf lava_scheduler_app.tests.test_templates.TestTemplates.test_foobar_template

Remember that device type templates are not just configuration files -
the templates are processed as source code at runtime and can use
various types of logic to substitute the correct variables and omit
other variables. **Always** make your changes in
``lava_scheduler_app/tests/device-types`` and **always** run the unit
test to ensure that changes to the template continue to produce a valid
device configuration after each change.

Only when the unit test passes should the new device type template be
copied to ``/etc/lava-server/dispatcher-config/device-types/``. If the
scheduler tries to assign a test job to a device using this template, a
check will be made to ensure that the output of the template and the
device dictionary is valid. If that check fails, the test job will not
start and the failure will be logged:

.. code-block:: none

 [WARNING] [lava-master] [9] Refusing to reserve for broken V2 device intel-smecher

This message indicates that test job ID ``9`` will never start to run
until the device dictionary and the device type template for the device
``intel-smecher`` are fixed so that the output is valid. It is common
for the rendering of new device type templates to cause subtle YAML
syntax errors. It is also common for the output to be valid YAML but
not valid device configuration. The unit test **must** check for a
valid device configuration, not simply valid YAML. In addition,
whenever it is imperative that a certain value is overridden in the
device type template compared to the base template, the unit test
**must** check that this value has been correctly set in the generated
pipeline. Check the other unit tests in the ``test_*_templates.py``
files to see how this is done. e.g. for QEMU from
``test_qemu_templates.py``

.. code-block:: python

    def test_qemu_installer(self):
        data = """{% extends 'qemu.jinja2' %}
 {% set mac_addr = 'DE:AD:BE:EF:28:01' %}
 {% set memory = 512 %}"""
        job_ctx = {'arch': 'amd64'}
        test_template = prepare_jinja_template('staging-qemu-01', data)
        rendered = test_template.render(**job_ctx)
        template_dict = yaml.safe_load(rendered)
        self.assertEqual(
            'c',
            template_dict['actions']['boot']['methods']['qemu']['parameters']['boot_options']['boot_order']
        )

.. note:: This section only covers the unit tests in the
   ``lava_scheduler_app`` directories in the LAVA codebase. If your
   device integration process requires changes in the
   ``lava_dispatcher`` directory, a set of unit tests will also be
   required there to ensure that the new code operates correctly.