差分

このページの2つのバージョン間の差分を表示します。

--- web:google:colaboratory [2020/06/10 07:21] – 作成ともやん
+++ web:google:colaboratory [2020/09/11 16:46] (現在) – ともやん
@@ 行 1: / 行 1: @@
-<html>
-  <style>
-    #result pre, #mincode pre {
-      overflow: hidden;
-      font-size: 10px;
-    }
-    #result_long pre, #mincode_long pre {
-      height: 250px;
-      overflow: scroll;
-      overflow-x: hidden;
-      font-size: 10px;
-    }
-    #mintbl table {
-      font-size: 12px;
-    }
-    #mintbl td pre {
-      margin: 0;
-    }
-    #img_long {
-      height: 400px;
-      overflow: scroll;
-      overflow-x: hidden;
-    }
-    .dokuwiki .plugin_wrap table {
-      width: auto;
-    }
-    #logo {
-      background-color: white;
-      padding: 10px;
-      width: fit-content;
-    }
-    #logo p {
-      margin: 0;
-    }
-  </style>
-</html>
 ====== Google Colaboratory (略称: Colab) ======
@@ 行 120: / 行 84: @@
 NUMA node0 CPU(s):   0,1
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
+</code>
+</WRAP>
+===== OpenCL =====
+<WRAP prewrap 100%>
+<code python>
+!clinfo
+</code>
+</WRAP>
+<WRAP prewrap 100% #result>
+<code python>
+Number of platforms                               0
+</code>
+</WRAP>
+メニューの [ランタイム] - [ランタイムのタイプを変更] で「ノートブックの設定」の「ハードウェア アクセラレータ」を設定する。\\
+**ハードウェア アクセラレータ: GPU** の場合\\
+<WRAP prewrap 100% #result_long>
+<code python>
+Number of platforms                               1
+  Platform Name                                   NVIDIA CUDA
+  Platform Vendor                                 NVIDIA Corporation
+  Platform Version                                OpenCL 1.2 CUDA 10.1.152
+  Platform Profile                                FULL_PROFILE
+  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
+  Platform Extensions function suffix             NV
+  Platform Name                                   NVIDIA CUDA
+Number of devices                                 1
+  Device Name                                     Tesla T4
+  Device Vendor                                   NVIDIA Corporation
+  Device Vendor ID                                0x10de
+  Device Version                                  OpenCL 1.2 CUDA
+  Driver Version                                  418.67
+  Device OpenCL C Version                         OpenCL C 1.2
+  Device Type                                     GPU
+  Device Topology (NV)                            PCI-E, 00:00.4
+  Device Profile                                  FULL_PROFILE
+  Device Available                                Yes
+  Compiler Available                              Yes
+  Linker Available                                Yes
+  Max compute units                               40
+  Max clock frequency                             1590MHz
+  Compute Capability (NV)                         7.5
+  Device Partition                                (core)
+    Max number of sub-devices                     1
+    Supported partition types                     None
+  Max work item dimensions                        3
+  Max work item sizes                             1024x1024x64
+  Max work group size                             1024
+  Preferred work group size multiple              32
+  Warp size (NV)                                  32
+  Preferred / native vector sizes
+    char                                                 1 / 1
+    short                                                1 / 1
+    int                                                  1 / 1
+    long                                                 1 / 1
+    half                                                 0 / 0        (n/a)
+    float                                                1 / 1
+    double                                               1 / 1        (cl_khr_fp64)
+  Half-precision Floating-point support           (n/a)
+  Single-precision Floating-point support         (core)
+    Denormals                                     Yes
+    Infinity and NANs                             Yes
+    Round to nearest                              Yes
+    Round to zero                                 Yes
+    Round to infinity                             Yes
+    IEEE754-2008 fused multiply-add               Yes
+    Support is emulated in software               No
+    Correctly-rounded divide and sqrt operations  Yes
+  Double-precision Floating-point support         (cl_khr_fp64)
+    Denormals                                     Yes
+    Infinity and NANs                             Yes
+    Round to nearest                              Yes
+    Round to zero                                 Yes
+    Round to infinity                             Yes
+    IEEE754-2008 fused multiply-add               Yes
+    Support is emulated in software               No
+  Address bits                                    64, Little-Endian
+  Global memory size                              15812263936 (14.73GiB)
+  Error Correction support                        Yes
+  Max memory allocation                           3953065984 (3.682GiB)
+  Unified memory for Host and Device              No
+  Integrated memory (NV)                          No
+  Minimum alignment for any data type             128 bytes
+  Alignment of base address                       4096 bits (512 bytes)
+  Global Memory cache type                        Read/Write
+  Global Memory cache size                        655360 (640KiB)
+  Global Memory cache line size                   128 bytes
+  Image support                                   Yes
+    Max number of samplers per kernel             32
+    Max size for 1D images from buffer            134217728 pixels
+    Max 1D or 2D image array size                 2048 images
+    Max 2D image size                             32768x32768 pixels
+    Max 3D image size                             16384x16384x16384 pixels
+    Max number of read image args                 256
+    Max number of write image args                32
+  Local memory type                               Local
+  Local memory size                               49152 (48KiB)
+  Registers per block (NV)                        65536
+  Max number of constant args                     9
+  Max constant buffer size                        65536 (64KiB)
+  Max size of kernel argument                     4352 (4.25KiB)
+  Queue properties
+    Out-of-order execution                        Yes
+    Profiling                                     Yes
+  Prefer user sync for interop                    No
+  Profiling timer resolution                      1000ns
+  Execution capabilities
+    Run OpenCL kernels                            Yes
+    Run native kernels                            No
+    Kernel execution timeout (NV)                 No
+  Concurrent copy and kernel execution (NV)       Yes
+    Number of async copy engines                  3
+  printf() buffer size                            1048576 (1024KiB)
+  Built-in kernels
+  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
+NULL platform behavior
+  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
+  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
+  clCreateContext(NULL, ...) [default]            No platform
+  clCreateContext(NULL, ...) [other]              Success [NV]
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
+  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
+</code>
+</WRAP>
+==== PyOpenCL で OpenCL ベンチマーク ====
+[[python:pyopencl|PyOpenCL]]\\
+**PyOpenCL** をインストールする。\\
+<WRAP prewrap 100%>
+<code python>
+!pip install pyopencl
+</code>
+</WRAP>
+<WRAP prewrap 100% #result>
+<code python>
+Collecting pyopencl
+  Downloading https://files.pythonhosted.org/packages/0d/ab/aa0ec8018066a7a70a8a7d5e342cce6d5f35058bed7c22fb6ce78ab7c963/pyopencl-2020.1-cp36-cp36m-manylinux1_x86_64.whl (728kB)
+     |████████████████████████████████| 737kB 12.5MB/s
+Requirement already satisfied: decorator>=3.2.0 in /usr/local/lib/python3.6/dist-packages (from pyopencl) (4.4.2)
+Collecting pytools>=2017.6
+  Downloading https://files.pythonhosted.org/packages/56/4c/a04ed1882ae0fd756b787be4d0f15d81c137952d83cf9b991bba0bbb54ba/pytools-2020.2.tar.gz (63kB)
+     |████████████████████████████████| 71kB 10.1MB/s
+Collecting appdirs>=1.4.0
+  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
+Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from pyopencl) (1.12.0)
+Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pyopencl) (1.18.5)
+Building wheels for collected packages: pytools
+  Building wheel for pytools (setup.py) ... done
+  Created wheel for pytools: filename=pytools-2020.2-py2.py3-none-any.whl size=62338 sha256=9aa0450004dbf633f7584e5914d50999d697d018a341b86a7c499bc1fbfd5281
+  Stored in directory: /root/.cache/pip/wheels/a7/d6/ac/03a67d071bde6d272d1f7c9ab7f4344fa9d7b9d98bda7fd127
+Successfully built pytools
+Installing collected packages: appdirs, pytools, pyopencl
+Successfully installed appdirs-1.4.4 pyopencl-2020.1 pytools-2020.2
+</code>
+</WRAP>
+**benchmark-all.py** を保存する。\\
+<WRAP prewrap 100% #mincode_long>
+<code python>
+%%file benchmark-all.py
+# example provided by Roger Pau Monn'e
+import pyopencl as cl
+import numpy
+import numpy.linalg as la
+import datetime
+from time import time
+a = numpy.random.rand(1000).astype(numpy.float32)
+b = numpy.random.rand(1000).astype(numpy.float32)
+c_result = numpy.empty_like(a)
+# Speed in normal CPU usage
+time1 = time()
+for i in range(1000):
+        for j in range(1000):
+                c_result[i] = a[i] + b[i]
+                c_result[i] = c_result[i] * (a[i] + b[i])
+                c_result[i] = c_result[i] * (a[i] / 2.0)
+time2 = time()
+print("Execution time of test without OpenCL: ", time2 - time1, "s")
+for platform in cl.get_platforms():
+    for device in platform.get_devices():
+        print("===============================================================")
+        print("Platform name:", platform.name)
+        print("Platform profile:", platform.profile)
+        print("Platform vendor:", platform.vendor)
+        print("Platform version:", platform.version)
+        print("---------------------------------------------------------------")
+        print("Device name:", device.name)
+        print("Device type:", cl.device_type.to_string(device.type))
+        print("Device memory: ", device.global_mem_size//1024//1024, 'MB')
+        print("Device max clock speed:", device.max_clock_frequency, 'MHz')
+        print("Device compute units:", device.max_compute_units)
+        # Simnple speed test
+        ctx = cl.Context([device])
+        queue = cl.CommandQueue(ctx,
+                properties=cl.command_queue_properties.PROFILING_ENABLE)
+        mf = cl.mem_flags
+        a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
+        b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
+        dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)
+        prg = cl.Program(ctx, """
+            __kernel void sum(__global const float *a,
+            __global const float *b, __global float *c)
+            {
+                int loop;
+                int gid = get_global_id(0);
+                for(loop=0; loop<1000;loop++)
+                {
+                    c[gid] = a[gid] + b[gid];
+                    c[gid] = c[gid] * (a[gid] + b[gid]);
+                    c[gid] = c[gid] * (a[gid] / 2.0);
+                }
+            }
+        """).build()
+        exec_evt = prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)
+        exec_evt.wait()
+        elapsed = 1e-9*(exec_evt.profile.end - exec_evt.profile.start)
+        #print("Execution time of test: %g s" % elapsed)
+        print("Execution time of test: %.10f s" % elapsed)
+        c = numpy.empty_like(a)
+        #cl.enqueue_read_buffer(queue, dest_buf, c).wait()
+        cl.enqueue_copy(queue, c, dest_buf)
+        error = 0
+        for i in range(1000):
+                if c[i] != c_result[i]:
+                        error = 1
+        if error:
+                print("Results doesn't match!!")
+        else:
+                print("Results OK")
+</code>
+</WRAP>
+**benchmark-all.py** を実行する。\\
+<WRAP prewrap 100%>
+<code python>
+%run benchmark-all.py
+</code>
+</WRAP>
+<WRAP prewrap 100% #result>
+<code python>
+Execution time of test without OpenCL:  5.938735008239746 s
+===============================================================
+Platform name: NVIDIA CUDA
+Platform profile: FULL_PROFILE
+Platform vendor: NVIDIA Corporation
+Platform version: OpenCL 1.2 CUDA 10.1.152
+---------------------------------------------------------------
+Device name: Tesla P4
+Device type: ALL | GPU
+Device memory:  7611 MB
+Device max clock speed: 1113 MHz
+Device compute units: 20
+Execution time of test: 0.0010557440 s
+Results OK
 </code>
 </WRAP>
 ===== 現在の GPU の割り当て状況 =====
-<WRAP prewrap 100% #mincode>
+<WRAP prewrap 100%>
 <code python>
 !nvidia-smi