NDGF All Hands 2019 2

Europe/Copenhagen
A-124 (Universität Bern / UniS)

A-124

Universität Bern / UniS

Schanzeneckstr. 1 3012 Bern Switzerland
Description

NT1 staff, and local NDGF site sysadmin meeting. The meeting location is Bern,  Universität Bern / UniS, Schanzeneckstr. 1, room A-124

There will be a dinner the evening before at Altes Tramdepot.

Bern is normally reached by flying to Zurich and taking a train. Trains from Geneva take longer, but also a possibility. Meeting location is right next to the Bern main station.

Registration
Participants

Site Roundtable

UCPH

    The past ½ year:
    - OPN on redundant(?) 100Gb. (Should be redundant, but fail-over failed when needed)
    - Added 0.5PB disk (tarpool).
    - Added 2 tapepools (tarpool)
    - Compute nodes moved to centos7
    - CEs moved to Centos 7 and ARC6
    - Facility people got tired of bothering us and moved on to chemistry
    
    Future:
    - Alice VOBOX (alice01) moves to centos 7
    - Order more compute this month (10-15kHS expected from benchmark runs).
    - Convert all pools to tarpools.
    - IPv6 on compute.
    - 2020: New tape library or upgrade the old.

linköping

  NTR

CSC

 Alice cluster will be upgraded to Centos 7 and the LRMS will switch to SLURM very soon - maybe this week.

HPC2N

   - Tarpooled all disk pools.
   - Cache machines upgraded to Xenial, the upgrade to bionic did not work.
   - dCache tarpools - readied for the "reboot-required"
   - Tape - TSM mishap: 2 tapes got lost:
      - Issue found in ENDIT: Lost tapes caused filled logs.
   - Compute - Nested singularity seems to work now.
      - Uploads from Abisko not prefering storage at HPC2N, remember to tell Ops when changing IP address.
   - Net - Half of disk pools using BBR.
      - no obvious improvement
      - expected to perform better worst-case on bad links.

  Future:
    - Before end of 2019: new tape pools
    - Before end of 2020: new disk pools

Slovenia

  NTR

UiB

   - Working on new system for the last half a year.
   - Alice jobs are now running, but still has performance issues.
   - The Alice queue has 200 nodes with 8 nodes. There might be another 100 nodes available.
   - Lots of validation errors on the jobs.
   - Storage:
     - dCache - 140 TB (evacuating is about 600 MB/S so around 64 hours - a week)
     - Ganglia is almost ready
   - There might be some left over money for more compute and dCache. Could mean that pools could be reduced in size to gain performance.
  

UiO

   - Abel is being decommissioned.
       - downtime since 15/9 - 2019
       - Low prio kept running. Will probably run until the end of the year.
       - Current downtime run until 30/10, should be extended.
   - New tier 1 will be run on openstack.
       - Test cluster since June: AMD epyc - 12.5 HEPSPEC pr core with HT
             - 8 vCPU pr node
             - 30 nodes from June
             - Disk limited - running in pilot mode
       - October: More disk available - installed as ARC cache
              - 10 more nodes
       - More server have arrived
              - 12 servers of 2 cpus of 48(*2 threads)
              - in total 2304 more vCPU
       - ARC Datastaging
              - There is trouble keeping the nodes fed with input. Looks like there might be a 1Gb link somewhere in the system. To fill a cluster it should be at least 10Gb

       - Norwegian pledge has been saved by Oracle cloud

   - Storage:
     - New disk pools are on Ceph
     - Erasure coded
     - 2PB available space
     - 300 MB/s speed, should ideally be 4 times more
     - New pools are being commissioned

     - Minimal alice:
        - VOBOX is there. No jobs are run. Should this be closed?
     

Norway net

    LHCOPN 100 GB: The 100 Gb link to Oslo is possibly ready. There are no plans to upgrade the link to Bergen.         


Tarpool follow up


  Almost all disk pools are tarpools.
  Missing:
         UCPH - disk and some tape (by end of year)
         HPC2N - tape (Done with new pools by end of year)
         IJS - old pools
         
  Tape:
    Is the responsibility line clear:
      endit daemons run by local admins, endit plugins by NDGF ops
    UiB missing a TSM person. Another option would be enstore. Cheaper, but UiB would be the only ones running it.

  Automatic reboots:
    Seems to work at HPC2N. Made sure to only reboot during office hours.

  Ansible scripts should be cleaned up. Start stop issue should be investigated. (Who?)


New Ore machines


   Note: Psql transfer compression is useful, but not above level 2 as it will wait on CPU - Znikke   
   Drbd tuning: Hpc2n has reached 10 Gbs, but not in conjunction with Ganeti. We would like to reach 25 Gbs

   Same disk on Dulkis has dropped out of raid a number of times. Ticket for Dell must be created. (Chrulle)  

   Plan:
     Ganeti install - Petter
     Psql - Chrulle
     Power cap should be tested - Chrulle
     Monitoring/ganglia - Petter/Jens  (nagios test for different pw hashes)
     Network failover test - Maswan
     Syncauth - could be deprecated - Petter
     Add migration to weekly page - Chrulle
     

Tape carrousel


  Is 200000 request limit in FTS reasonable? We should be able to support this - Maswan.
  During the test we saw files being removed before being transferred out. Looks to be the sweeper stepping in. Vincent will investigate. Possibly caused by the missing protection on files.
  Monitoring on the tape pools? Stuff missing from pinboard.
  Pools has a timeout(96 hours) for a restore, endit does not. We should make the timeout a bit longer.
 
  Dashboard is missing information, to be able to debug issue with check summing at HPC2N.

  The Tape Carrousel has high priority from ATLAS. ATLAS is running out of disk space, and would like to put more to tape. NDGF would like an actual official bandwidth requirement for tapes. This should be communicated to ATLAS.

  Znikke suggets that we finish the Tape validation for all sites: https://wiki.neic.no/wiki/NDGF_dCache_tape_pool_validation

 

LHCOPN Networking

Requirements

  • Be fast enough not to be a problem, like compute in Oslo data in Umeå.
  • Internal network: Mostly done with upgrade from 10 Gb/s. Slovenia and some of Norway missing.

Upgrade Plans

  • Bergen and Oslo will stay 10 Gb/s unless pushed to decision makers.
  • Slovenia: Need to upgrade backbone of NREN first.
  • HCP2N: Have 4 * 10 Gb/s. Purchase likely next year.

Bottlenecks

  • Biggest bottleneck at the time is that much of ATLAS data goes over a limited connection. Can we move most frequently used data elsewhere?

BBR

  • BBR: Need very modern OS (like CentOS 8).

Security Scan

  • NORDUnet has bough a license for Nessus.
  • Shall we allow them to mount an NFS volume? Yes, but don’t destroy data.
  • They found NFS possibly open, but did not actually try to mount.
  • For NDGF things we have control over, they should be allowed to do anything.
  • Mattias asked if anyone in the room has objections to being scanned.
  • Christian writes to NORDUnet.

Next Meeting

  • If we continue the current schedule: Oslo, Espoo, Bergen
  • Spring: Espoo, April 22 full day.
  • Fall: Oslo, October 21 to 22 lunch to lunch.

Tape technology update

Niklas reports on current and future technology.

Tape pool discussion

Niklas presents various hardware configurations, slides available. Timescale: In production before Christmas, should last 4 years. No firm conclusion, but mixed use configuration likely.

Prometheus and Grafana

Maiken presents and demonstrates.

  • We also want this for NT1.
  • Do we want to publish ARC worker data node centrally?

ARC News

Maiken presents.

  • There are now progress tickets for the next release.
  • More people running ARC testing always wanted.

dCache News

Vincent presents.

Other

  • Darren raises the question of using software RAID 5. Will be interesting to test.
There are minutes attached to this event. Show them.