NDGF All Hands 2019 2

Name: NDGF All Hands 2019 2
Start: 2019-10-21T20:00:00+02:00
End: 2019-10-22T17:20:00+02:00
Location: Universität Bern / UniS

21 Oct 2019, 20:00 → 22 Oct 2019, 17:20 Europe/Copenhagen

A-124 (Universität Bern / UniS)

A-124

Universität Bern / UniS

Schanzeneckstr. 1 3012 Bern Switzerland

Description

NT1 staff, and local NDGF site sysadmin meeting. The meeting location is Bern, Universität Bern / UniS, Schanzeneckstr. 1, room A-124

There will be a dinner the evening before at Altes Tramdepot.

Bern is normally reached by flying to Zurich and taking a train. Trains from Geneva take longer, but also a possibility. Meeting location is right next to the Bern main station.

Registration

Participants

Hide

Site Roundtable

UCPH

    The past ½ year:
    - OPN on redundant(?) 100Gb. (Should be redundant, but fail-over failed when needed)
    - Added 0.5PB disk (tarpool).
    - Added 2 tapepools (tarpool)
    - Compute nodes moved to centos7
    - CEs moved to Centos 7 and ARC6
    - Facility people got tired of bothering us and moved on to chemistry

    Future:
    - Alice VOBOX (alice01) moves to centos 7
    - Order more compute this month (10-15kHS expected from benchmark runs).
    - Convert all pools to tarpools.
    - IPv6 on compute.
    - 2020: New tape library or upgrade the old.

linköping

NTR

CSC

Alice cluster will be upgraded to Centos 7 and the LRMS will switch to SLURM very soon - maybe this week.

HPC2N

   - Tarpooled all disk pools.
   - Cache machines upgraded to Xenial, the upgrade to bionic did not work.
   - dCache tarpools - readied for the "reboot-required"
   - Tape - TSM mishap: 2 tapes got lost:
      - Issue found in ENDIT: Lost tapes caused filled logs.
   - Compute - Nested singularity seems to work now.
      - Uploads from Abisko not prefering storage at HPC2N, remember to tell Ops when changing IP address.
   - Net - Half of disk pools using BBR.
      - no obvious improvement
      - expected to perform better worst-case on bad links.

Future:
- Before end of 2019: new tape pools
- Before end of 2020: new disk pools

Slovenia

NTR

UiB

   - Working on new system for the last half a year.
   - Alice jobs are now running, but still has performance issues.
   - The Alice queue has 200 nodes with 8 nodes. There might be another 100 nodes available.
   - Lots of validation errors on the jobs.
   - Storage:
     - dCache - 140 TB (evacuating is about 600 MB/S so around 64 hours - a week)
     - Ganglia is almost ready
   - There might be some left over money for more compute and dCache. Could mean that pools could be reduced in size to gain performance.

UiO

   - Abel is being decommissioned.
       - downtime since 15/9 - 2019
       - Low prio kept running. Will probably run until the end of the year.
       - Current downtime run until 30/10, should be extended.
   - New tier 1 will be run on openstack.
       - Test cluster since June: AMD epyc - 12.5 HEPSPEC pr core with HT
             - 8 vCPU pr node
             - 30 nodes from June
             - Disk limited - running in pilot mode
       - October: More disk available - installed as ARC cache
              - 10 more nodes
       - More server have arrived
              - 12 servers of 2 cpus of 48(*2 threads)
              - in total 2304 more vCPU
       - ARC Datastaging
              - There is trouble keeping the nodes fed with input. Looks like there might be a 1Gb link somewhere in the system. To fill a cluster it should be at least 10Gb

- Norwegian pledge has been saved by Oracle cloud

   - Storage:
     - New disk pools are on Ceph
   - Erasure coded
   - 2PB available space
     - 300 MB/s speed, should ideally be 4 times more
     - New pools are being commissioned

     - Minimal alice:
        - VOBOX is there. No jobs are run. Should this be closed?

Norway net

LHCOPN 100 GB: The 100 Gb link to Oslo is possibly ready. There are no plans to upgrade the link to Bergen.

Tarpool follow up

Almost all disk pools are tarpools.
Missing:
         UCPH - disk and some tape (by end of year)
         HPC2N - tape (Done with new pools by end of year)
         IJS - old pools

Tape:
    Is the responsibility line clear:
      endit daemons run by local admins, endit plugins by NDGF ops
    UiB missing a TSM person. Another option would be enstore. Cheaper, but UiB would be the only ones running it.

Automatic reboots:
Seems to work at HPC2N. Made sure to only reboot during office hours.

Ansible scripts should be cleaned up. Start stop issue should be investigated. (Who?)

New Ore machines

Note: Psql transfer compression is useful, but not above level 2 as it will wait on CPU - Znikke
Drbd tuning: Hpc2n has reached 10 Gbs, but not in conjunction with Ganeti. We would like to reach 25 Gbs

Same disk on Dulkis has dropped out of raid a number of times. Ticket for Dell must be created. (Chrulle)

   Plan:
     Ganeti install - Petter
     Psql - Chrulle
     Power cap should be tested - Chrulle
     Monitoring/ganglia - Petter/Jens (nagios test for different pw hashes)
     Network failover test - Maswan
     Syncauth - could be deprecated - Petter
     Add migration to weekly page - Chrulle

Tape carrousel

Is 200000 request limit in FTS reasonable? We should be able to support this - Maswan.
During the test we saw files being removed before being transferred out. Looks to be the sweeper stepping in. Vincent will investigate. Possibly caused by the missing protection on files.
Monitoring on the tape pools? Stuff missing from pinboard.
Pools has a timeout(96 hours) for a restore, endit does not. We should make the timeout a bit longer.

Dashboard is missing information, to be able to debug issue with check summing at HPC2N.

The Tape Carrousel has high priority from ATLAS. ATLAS is running out of disk space, and would like to put more to tape. NDGF would like an actual official bandwidth requirement for tapes. This should be communicated to ATLAS.

Znikke suggets that we finish the Tape validation for all sites: https://wiki.neic.no/wiki/NDGF_dCache_tape_pool_validation

LHCOPN Networking

Requirements

Be fast enough not to be a problem, like compute in Oslo data in Umeå.
Internal network: Mostly done with upgrade from 10 Gb/s. Slovenia and some of Norway missing.

Upgrade Plans

Bergen and Oslo will stay 10 Gb/s unless pushed to decision makers.
Slovenia: Need to upgrade backbone of NREN first.
HCP2N: Have 4 * 10 Gb/s. Purchase likely next year.

Bottlenecks

Biggest bottleneck at the time is that much of ATLAS data goes over a limited connection. Can we move most frequently used data elsewhere?

BBR

BBR: Need very modern OS (like CentOS 8).

Security Scan

NORDUnet has bough a license for Nessus.
Shall we allow them to mount an NFS volume? Yes, but don’t destroy data.
They found NFS possibly open, but did not actually try to mount.
For NDGF things we have control over, they should be allowed to do anything.
Mattias asked if anyone in the room has objections to being scanned.
Christian writes to NORDUnet.

Next Meeting

If we continue the current schedule: Oslo, Espoo, Bergen
Spring: Espoo, April 22 full day.
Fall: Oslo, October 21 to 22 lunch to lunch.

Tape technology update

Niklas reports on current and future technology.

Tape pool discussion

Niklas presents various hardware configurations, slides available. Timescale: In production before Christmas, should last 4 years. No firm conclusion, but mixed use configuration likely.

Prometheus and Grafana

Maiken presents and demonstrates.

We also want this for NT1.
Do we want to publish ARC worker data node centrally?

ARC News

Maiken presents.

There are now progress tickets for the next release.
More people running ARC testing always wanted.

dCache News

Vincent presents.

Other

Darren raises the question of using software RAID 5. Will be interesting to test.

There are minutes attached to this event. Show them.

Monday 21 October
- Mon 21 Oct
- Tue 22 Oct
- 20:00
  
  Dinner
  
  https://www.altestramdepot.ch/en/home
Tuesday 22 October
- Mon 21 Oct
- Tue 22 Oct
- 1
  
  Site roundtable
  
  NDGF AHM 2019-2 HPC2N Site Report.pdf
  
  Oslo-Site report for NDGF AllHands 2019 - 2.pdf
  
  Oslo-Site report for NDGF AllHands 2019 - 2.pptx
- 2
  
  Tarpool follow-up
- 10:30
  
  coffee
- 3
  
  New setup in Ørestaden
  
  New hardware
  New postgres
  New oob setup
  
  Speakers: Christian Soettrup (NEIC), Mattias Wadenstein (NeIC)
  
  20191021-NewHW.pdf
- 4
  
  Tape carousel update
  
  dCache news & ATLAS tape carousel
- 12:00
  
  lunch
- 5
  LHCOPN networking
  - requirements
  - bottlenecks
  - upgrade plan
  - BBR?
- 6
  
  Nordunet Security scans
  
  How deep can we let Nordunet scan our sites' OPN adresses?
  
  Speaker: Mattias Wadenstein (NeIC)
- 7
  
  Latest developments and the Future - Tape
  
  Speaker: Niklas Edmundsson (HPC2N)
  
  NDGF AHM 2019-2 Tape tech update.pdf
- 8
  
  HPC2N tape pool procurement
  
  Discussion about upcoming HPC2N procurement and best practices for the wiki.
  
  Speaker: Niklas Edmundsson (HPC2N)
  
  NDGF AHM 2019-2 Tape pool hardware.pdf
- 14:30
  
  coffee
- 9
  
  Latest developments and the Future - Arc
- 10
  
  Latest developments and the Future - dCache
- 11
  
  Prometheus and Grafana demo
  
  prom_graf_ndgf_allhands_2019_2.pptx
- 12
  
  Next allhands

Choose timezone