Tuesday
Site reports

Notetaking for the AHM week: 
Tue: Chrulle
Wed: Ville
Thu: Petter
Fri: Jens

Intro round

 NT1 staff

  Maswan (Mattias Wadenstein) HPC2N: NT1 boss 

  Petter (Petter Urkedal) NBI: Operations, monitoring; sysadmin
  Chrulle (Christian Søttrup) NBI: Ops, Security; ARC dev

  Jens (Jens Larsson) NSC: Ops, dCache; Sysadmin
  Krishnaveni (Krishnaveni Chitrapu) NSC: Storage/dCache developer

  Salmela (Ville Salmela) CSC: Ops; sysadmin
  ErikE (Erik Edelmann) CSC: Alice, Accounting

  Darren (Darren Starr) UiO: Ops; storage sysadmin
  Maikenp (Maiken Pedersen) UiO: NT1 Arc 

  Oxana(Oxana Smirnova) Lund: Experiments liason

 Local staff

  Kildetoft(Rune Kildetoft) NBI: Local Sysadmin
  Happe (Hans Henrik Happe) NBI: HPC center boss

  Znikke(Niklas Zenström) HPC2N: Tape
  Roger (Roger Oscarsson) HPC2N: Compute
  Michael Rännar HPC2N: 

  Borisw (Boris Wagner) UiB: Tier1 admin
  Thomas Linden HIP: Compute, CMS

  Garvin (Vincent Garonne) UiO: Former NT1 storage developer

  Caela (Michaela Barth) PDC: Neic NT1 PO

  Dejan Lesjak IJS: Sysadmin
  Andrej Filipcic IJS: Atlas 

  Marcos Acebes Lunarc: Storage
  Robert (Robert Grabowski) Lunarc: Compute
  Florido (Florido) Lund: local support, ARC dev

Central services (Maswan)

 Krishnaveni is a new hire for storage. The new servers have been fully
 provisioned in Ørestaden and old servers have been torn down. Services have
 been offered to Eiscat 3D. We should have an answer before the end of the
 year. We would prefer to offer something very similar to what we are already
 are doing. 

 We are working on getting a more "DevOps" approach towards provisioning the
 central services.
 We would also like to finish tarpooling and tape pool performance.

 For the Future we would like to consolidate more user communities onto 
 our storage to increase cost efficiency.

 Florido: asks if what would be needed to accomodate a new storage user. Maswan 
 answers that we would be able to do that if they cover the cost of hardware 
 and their share of ops staff and incidentals. 

Central site Technical overview (Jens):

 We are now completely on the new machines. The old servers are moving to Umeå
 to be included in mordor. We now have 3 ganeti machines and 2 postgresql.

 We are moving to 3 separate levels of storage:
  Production
  Pre-production (For stress testing the setup before moving to production)
  Experimental (NEW!!)(for testing out things, experimenting with fault recovery
               etc.)

  Possibly the first to run dCache 6.2 in production. 

  Next we need to upgrade the OS on the headnodes and move to three instead of two.

  There has been issues with the switch from Digicert to Sectigo. That should
  be sorted by now.

Sites roundtable:

 Bergen:
 
  Bergen is an Alice only site. Using Alien, dCache and TSM.
  Now part of the Norwegian Science Cloud. 
  Site is fully virtualized. Compute and storage are running on openstack VMs, backed by
  Ceph. The idea was to fully dissassociate the T1 service from the actual hardware.
  Boris thinks it was a beautiful idea, but that it does not really work in
  practice.
  They use Terraform as a machine readable description of the infrastructure.
  Ansible is used for provisioning. It is configured from automatically from
  Terraform.
  
  Bergen has had understaffing issues. Current staff is: 
   Raimund Kristensen (Group leader)
   Tor Ladre (Net and storage)
   Boris  (compute and T1)
    

  Efficiency has been an issue for 2020. When there is a general efficiency
  issue, it hits Bergen much harder and longer than other NT1 sites.

  The advantages of the cloudification should be good, but have not been seen
  in practice and contrariwise the complexity makes it very hard to fix problems
  and there is also performance loss. Mostly in wasted space. 

  Thomas asks if Alice have standard monitoring jobs that could be used for
  tracking the efficiency. HIP had also seen issues with efficiency. It was
  solved by fiddling with stripe size. Paper at CHEP 2012.

 UCPH (Happe)

  The latest thing is building a new data center. UCPH will have a new 200 M²
  room. It will have room for about 60 racks with 50 kW per rack. Prep is ongoing
  but tender has not been done yet.
  
  With regards to T1 we have had issues with ARC stalling. Session dir has
  been moved to a separate server which has helped. New cache servers are in
  the works.  
  We are working on buying 2 PB of disk storage, but 1 PB is about to be
  decomissioned.
  Tape are not yet full. We have room for this years and next years pledge in
  the current library, but are planning to buy a new library next year. This
  will wait until the new computer room is ready. 

  Maswan asks for plans for the new room. Happe will look up the 3d scans.
  Water cooling is being planned with rear door cooling.

 UiO (Maiken):
 
  UiO has reached its final form. 270 nodes of 8 core Epyc. Installed in the
  Norwegian openstack cloud. For now it is running with 16 DDR servers. This is
  perhaps overkill, but it works and should not hurt anything. 

  Gave up on Elasticcluster. Too much overhead. Extracted needed ansible roles instead.
  Latest issue was saturation of the DDR network. The problem was caused by a
  gateway needing updates and reboots after a 5 year uptime.
  Some issues with filling the nodes. Backfilling is being used, but the
  lowpriority jobs are not always available when needed.
  File system is not performing optimally. Will be upgraded with SSDs.

  Storage is being transitioned away from ceph. It is not necessary and too
  complex. Instead going to XFS on mdraid running on ARM machines.
  There are issues with performance degradation under heavy load. It is being
  looked into.
  Planned to be in production within a month.

 IJS (Dejan):

  About 1 PB of storage has been added. Work is on ongoing on getting 100 GBe.
  The switch has been procured, but not connected to a 100 G fibre, so not much
  use yet. There is work on better switching for nodes. This might solve issues
  with difficult pools. The problem there seems to be network cards getting
  stuck.
  New Atlas site in Slovenia: VEGA. Should be ready by March 2021.

  Jens asks about increased upstream Bandwidth. Expected to happen beginning of next year.

 HPC2N (Znikke):

  New dCache pools both for NT1 and Swestore. Service expires 24/12 this year.
  New nodes hoped to be delievered by mid-november to insure timely
  installation and migration. Dell was chosen. New switches are also being
  bought (100/25 GBe). The goal is to get 100 GBe LHCOPN connectivity when UmU 
  and SUNET know how. 

  Arc6 deployment got delayed. Primary person got sick and Sectigo certificates
  were hard to get. Test jobs have run and nagios is mostly happy. Expected to
  be in production end of October. Abisko will be decommissioned, but will be
  kept alive until Kebnekaise is in full production. Kebnekaise will be just
  about that needed for the pledge, but NSC should be able to help out and
  complete the Swedish pledge.

  New ARC cache machines are in the works and should be in place around the
  time of the next AHM. Toying with the idea of moving to SSDs.

 CSC/HIP(ErikE, Thomas)

  Alicetron is a virtual cluster on CSC openstack. It was increased from 13*24
  to 20*24 cores. The hardware is ageing and will need to be regenerated on new
  hardware. 
  Job efficiency is low, not as bad as bergen, but worse than other sites.
  Likely reason is network congestion. But has otherwise been working reliably.

  CMS is running on two clusters in HIP. The old one is almost out of
  production. The new one has had issues with Lustre, but the fix has made
  problems with running CMS jobs. It is hoped it will be fixed this week. There
  is another new cluster. ARC6 is being installed. Alcyon will be retired once
  this is working.  
  
  There is a grant for new storage. The issues with compute has put a spanner
  in the works, but it is the pipeline. CSC dCache pools are almost out of
  warranty. There is money from the academy of Finland and they will be replaced
  as soon as possible.

 Lund

  Not much to tell. A new ARC6 frontend with Sectigo is ready and needs to be
  put in production.
  The nodes will be replaced next year. So Lund is looking for suggestions.

  Swestore T2 storage has been consolidated to a couple of handfull of nodes. 

  Christian asks if we can just start production. Robert agrees and will 
  coordinate with Gianfranco.

NSC
  New nodes have been procured. Benchmarking suggests that the EPYC CPUs are
  very efficient with HT, so it is expected that the nodes will run with HT
  enabled. 

  NSC will get a very expensive and powerful NVIDIA AI machine. 

  UPS I lost power again. 
  The procurement and installation of new UPS in Hangaren(WLCG) went well.
  Unfortunately it turns out that with all power modules enabled the transformer
  gets a 450 Hz disturbance on the power feed. The system is running but with
  degraded UPS capacity. 
  UPS in Kärnhuset died completely. It was turned of and taken out of the
  power loop. Being off-line killed most of the batteries. It is too expensive
  to replace, so instead a new procurement is being looked into. This also means
  that one compute room is running without UPS.